Repeated Sequences in Tree Genetic Programming

W. B. Langdon
Computer Science
University of Essex Logo

Paper presented at EuroGP-2005 Pages 190-202

Introduction

Repeats in DNA

Demonstration problems

Mackey-Glass Chaotic Time Series

Mackey-Glass

Mackey-Glass benchmark and first Evolved Linear (2XO) Function

Predicting Protein Location

Animal Nuclear Proteins

Non-linear 2D projection from 20 Dimensional Space
Nuclear and Non-Nuclear Animal Proteins

Animal Nuclear Proteins

Serine v Proline - 4 Valine
Non-linear 2D projection from 20 Dimensional Space

Genetic Programming Approaches


Performance

(all approaches solve problems)

Predicting M-G chaotic Time Series
RMS error Mean
Linear GP Nordin 2XO 1.60-5.37 3.79
Tree GP 2X0 500gens 1.08-3.78 2.41
Nuclear Protein prediction (holdout set)
Discipulus 78-82% 80%
Tree GP 2X0 50gens 80-83% 82%

Evolution of Nuclear Protein Prediction Accuracy

Protein Prediction. Tree GP, Two point XO

Evolution of Protein program Size

Protein Prediction. Tree GP, Two point XO

Size of Repeated Tree Fragments

Evolution of Exact Repeats in First Protein Prediction 2XO run

Repeated Tree Fragments

Best of first Protein runs
Red 128 Blue 85 Green 79 Grey 11-63

Repeated Subtrees

Best of first Protein runs
Red 39 Blue 19 Green 15 Black 11 Grey 7

Largest Repeats M-G and Protein

Largest repeated fragment in Mackey-Glass and Protein Location best of run programs

Fraction of Program made of Repeated Subtrees

First Protein prediction 2XO Best program's repeated subtrees

Semantically repeated subtrees

First Protein Nuclear 2XO run best program Correlated subtrees

"Fitness" of Subtrees

Entropy first Protein prediction run

Entropy in best program at end of first Protein prediction run
In most cases variation across the training data increases monotonically from the leafs (bottom) to the root (top).

Subtree Fitness in first Protein prediction run

Subtree Fitness in best program at end of first Protein prediction run

Important nodes first Protein prediction run

Important nodes in best program at end of first Protein prediction run
Black changes >10 training cases

Repeated Subtrees

Best of first Protein runs
Red 39 Blue 19 Green 15 Black 11 Grey 7

Discussion

Conclusions



More information

More information on GP

References:

GP Parameters for Mackey-Glass

time series prediction

Nuclear v Non-nuclear Protein Prediction