# Is Bloat Due to Introns?

W. B. Langdon and W. Banzhaf

 Computer Science, University College, London Gower Street, London, WC1E 6BT
 Department of Computer Science, University of Dortmund D-44221 Dortmund

1. Fitness Based on Syntax

Syntax of tree (i.e. its sequence of functions and leafs) is reduced to hash code.

This is randomised to give a fitness value.

NB fitness depends only on syntax, each tree always has same fitness, many trees share the same fitness value.

• Traverse tree depth first, i.e.  208042962,34611749.
• Each syntax element has an opcode.
• The opcodes are packed tightly into 32 bit integers.
The first nine opcodes are packed into one integer and the rest into a second.
• These are combined using XOR,
208042962 xor 3461749 = C667BD216 xor 210222516 = E7659F716 = 242637303.
• 242637303 is the hash value of the tree.
• 242637303 is randomised by Park-Miller to yield 65D2E65316
65D2E65316 has 17 bits set and so the fitness of the tree is 17.

1.1 Introns

If using introns, the whole of subtrees starting with ``1'' are replaced by their first terminal.

So
208042962,34611749 becomes 208042962,3467.

XOR becomes C667BD216 xor D8B16 = C66765916

and randomising yields 4AA4B3316, which has 13 bits set.

1.2 Fitness Based on Building Blocks

Fitness given by combining ``behaviour'' of every subtree in the program.

``Behaviour'' of whole tree is union of bits set by each subtree.
Fitness is the number of bits set in the union.

Which bit is set by a subtree is given by calculating is fitness (using algorithm on page -).

Easy - Uniform

Hard - Binomial distribution

2. Building Block Fitness Distributions

Initially bigger trees have an advantage
but if size size
>200 all same fitness.

Slow change in distribution w.r.t. size for big trees

3.1 Results - Random - Anti-bloat

Population quickly reaches a fitness plateau.

Followed by convergence towards a small individual with high fitness.

Stable (1000 generations) populations formed by its children.

3.2 Results - Easy Building Blocks

GP rapidly finds trees with maximum fitness.

These spread throughout the population but each generation contains some small low fitness children.

Positive correlation between size and fitness drives increase in size (Price's theorem).

Bloat continues even when most programs in the search space have maximal fitness.

3.3 Results - Hard Building Blocks

Fitness continues to improve during run but each improvement takes longer to find.

Population converges towards current highest fitness but short and low fitness children are produced in each generation.

Positive correlation between size and fitness drives increase in size.

3.4 Linear Increase in Depth

With bushy initial trees, (except for random and easy problems) average program height increases roughly linearly by about one level per generation.

The non-linear, slower depth increase in the easy search spaces is produced by reduced selection pressure.

Other types of initial population increase their depth faster than those starting with bushy trees.

3.5 Results - Evolution of Shape

Populations evolve away from the bushy trees created by ``ramped half-and-half'' towards the most popular part of the search space.

3.6 Results - Sub Quadratic Power law

Mean of 10 runs. Power law fit of mean program size v. gen.
 Problem Final population Depth Power law Fitness mean size mean per gen Exponent Random 25 (1) 6 (0.9) 3 intron 25 (1) 530 (200) 33 0.6 (0.2) 1.23 (.09) BBlock (easy) 32 (0) 870 (100) 33 0.4 (0.1) 1.02 (.12) BBlock (hard) 26 (.8) 9000 (4600) 132 2.8 (1.7) 1.34 (.19)

Since programs remain close to the ridge and they increase their depth at a constant rate this leads to sub-quadratic growth in their size.

4. Conclusions

Linear GP on totally random landscape does not bloat without introns.

Tree GP on totally random landscape does not bloat without introns but is bloat due to introns?

Introduction of correlation (via building blocks or introns) to random landscape and bloat occurs regardless of presence of introns.

Regard bloat as random diffusive expansion of population into available search space. Most of search space is occupied by big random programs (cf. ridge).

GP behaves on these artificial problems like it does on real program spaces (i.e. with semantics).

C++ code may be found in ftp://cs.ucl.ac.uk/wblangdon/gp-code