# Foundation of Genetic Programming

## Glossary

• Arity Number of arguments that a function has. For example, + usually adds two numbers together. The numbers are its arguments. That is, + is a binary function and so has an arity of 2.

• Binary trees In binary trees, each function (internal node) has exactly two inputs. So the number of functions = [l/2]floor and number of terminals = [l/2]rceil. Since l is odd, number of terminals = number of functions + 1.

• Binomial distribution The probability of exactly k independent events each with a probability p in M trials is M choose k pk (1-p)M-k, where M choose k = M!/(k! (M-k)!). The mean of the Binomial distribution is pM and its variance is p(1-p)M.

• Boolean Dealing only with the two logical values: true (1) and false (0). Named after the mathematician George Boole.

• Building block hypothesis [Goldberg, 1989, page 41] talking of binary bit string genetic algorithms says Short, low order, and highly fit schemata are sampled, recombined'' (crossed over), and resampled to form strings of potentially higher fitness. In a way, by working with these particular schemata (the building blocks), we have reduced the complexity of our problem; instead of building high-performance strings by trying every conceivable combination, we construct better and better strings from the best partial solutions of past samplings.'' Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood'' (building blocks), so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks.'' Note [Goldberg, 1989] suggests that building blocks are highly fit schemata with only a few defined bits (low order), and that these are close together (short).

• Convergence Precisely every individual in the population is identical. Such convergence is seldom seen in genetic programming using Koza's subtree swapping crossover. However, populations often stabilise after a time, in the sense that the best programs all have a common ancestor and their behaviour is very similar (or identical) both to each other and to that of high fitness programs from the previous (and future?) generations. Often the term convergence'' is loosely used.

• Deception Deception is where the gradient in the fitness landscape leads away from locations with the best possible fitness. This is because the genetic algorithm, genetic programming or other search technique, is deceived'' into being attracted towards local peaks rather than the global best-fitness location.

• Defining Length L(H) The maximum distance between two defining (i.e. non-#) symbols in schema H. In tree GP schemata, L(H) is the number of links in the minimum tree fragment including all the non-= symbols within a schema H

• Diploid A species is diploid if its genetic chromosomes are paired. (In general, each cell has multiple pairs of chromosomes). A species is haploid if its chromosomes are not paired. In diploid species, the chromosome pairs separate during sexual reproduction. When they come together again in the child, one chromosome in each pair comes from the mother and the other from the father.

• Disruption If the child of an individual that matches schema H does not itself match H, the schema is said to have been disrupted.

• Diversity Variation between individuals in the population. Typically diversity refers to genetic variation. In bit string genetic algorithms, diversity may be measured by Hamming distance (i.e. counting the number of bits that are different) between bit strings. [Koza, 1992] defines variety'' in the population by the number of unique programs it contains, but this measure takes no notice of the fact that the behaviour of genetically different programs can be very similar or even identical.

• Dominance When one gene overrides the effect of another. Typically this refers to genes at the same locus of a pair of chromosomes (see diploid). A potential use'' is one gene acts as a backup for the other, taking on its function should the first one be damaged (e.g. by a mutation).

• Effective fitness The effective fitness of a schema is its fitness'' to take into account crossover and mutation. This can be thought of as the fitness that the schema would need to have to grow/shrink as it does with crossover and mutation if they were not there. feff(H,t) = alpha(H,t)/alpha(H,t-1) * f(t)

• Eigenvector An eigenvector v of a square matrix M has the property that multiplying M by it yields another vector which is parallel to v. That is, vM = lambda v where lambda is a (possibly) complex number known as the eigenvalue of v.

• Enumeration A search in which all possible solutions are tried in sequence.

• Epistasis Non-linear interaction between genes.

• Genetic drift Genes are inherently digital, in that an individual has an integer number of copies of each gene, and its children inherit an integer number of them. That is, an individual cannot have half a gene. While fitness may play a part in how many children inherit how many copies of a gene, in each individual child there is a large element of chance. Genetic drift is where random (i.e. not related to fitness) fluctuations in the genetic material in the population occur and lead to macroscopic changes to the population. Naturally genetic drift is more important in smaller rather than larger populations.

• Genetic drift As a crude example, consider a rare gene, say it occurs in 1% of the population but has no effect on fitness. So we expect it to occur in 1% of the next generation, and 1% of the next and so on. However, what happens if there are only 100 individuals in the population? In the current population there is one individual with the gene. In the next population (for simplicity also of size 100) we expect on average there to be one individual with the gene. However, it is quite likely there will be two or more, or that there will be none. Obviously if there are none, then there can be none in the next generation. While if there are two then on average the third population will also have twice as many as in the initial population. So, even though this gene is unrelated to fitness, its concentration is liable to drift with time. The number of generations for significant change to occur randomly is O(|population|2).

• Genotype The complete list of genes contained within an individual. In general, the position of the genes, as well as which ones are present, may be important.

• Haploid Unlike, diploid species, the chromosomes are not paired (although each cell may have more than one chromosome).

• Hits A program scores a hit'' when its output is sufficiently close or equal to the target value for a particular test input. In some problems a program's fitness may be given by the number of hits it collects when run on the entire test set.

• Instantiate An instance of. For example, an individual program may match a similarity template, such as a schema, one or more times. Each match is an instantiation (instance) of the schema.

• Leaf Outermost part of program tree. In contrast to functions or internal nodes, leaves have no arguments. Also called a terminal. In many cases in GP leaves are the inputs to the program.

• Length of a schema The total number of nodes in the schema is called the length N(H) of a schema H. N(H) is also equal to the number of nodes in the programs matching H.

• Linear tree A program composed of a variable number of unary functions and a single terminal. Note linear tree GP differs from bit string genetic algorithms since a population may contain programs of different lengths and there may be more than two types functions or more than two types of terminals.

• Local peaks These are locations in the fitness landscape where the fitness is below the best possible, and where every movement away from them leads to other locations all of whom have lower fitness.

• Loci Plural of locus.

• Locus A specific point of the chromosome. In bit string genetic algorithms it is a particular bit. More generally, it is a specific location in the genetic material that can take on one of a number of values, known as alleles. Commonly in genetic algorithms a locus controls a specific parameter, but many-to-many mappings between gene loci and parameters are possible (and believed to be prevalent in real life).

• Markov process A discrete random process where the chance of moving to another state depends only on the current state of the process and not on any earlier history.

• Mating pool While we mainly treat the population as a whole, an equivalent approach is to separate from it and place into a mating pool those individuals that will have children. Fitness selection is performed separately before genetic operations. Children are created from parents chosen only from the mating pool.

• Monte Carlo In a Monte Carlo search, points in the search space are sampled randomly. If sufficient independent points are sampled, a reasonable estimate of the whole search space can be made.

• Mutation In Nature, a mutation is viewed as a mistake when DNA is copied. E.g. when a cell divides to create two cells. Such an error introduces a random change to the DNA. Such changes are usually considered to be harmful, and Nature has elaborate error-detection techniques which considerably reduce the rate at which errors are introduced. Note if the cell with the error survives then all the cells it produces will also have the error. If the cell is a germ cell, i.e. used to create new individuals, then the children produced using it will inherit the error.

In artificial evolution, e.g. GAs and GP, mutation is used to mean any inherited random change to the genes. However, mutation is not used to describe crossover, i.e. the process whereby children are created from a combination of genes from two (or more) parents in which genes are copied correctly. In GAs it is common to use mutation (i.e. random changes) in combination with crossover. In traditional genetic algorithms mutation means choosing at random bits in the chromosome and flipping them. Each bit is considered independently and a decision is made with low probability pmut if it is to be changed or not. Note the number of bits changed is variable, and lies between 0 and N (where N is the number of bits in the chromosome) but is on average pmut N. It is commonly recommended to set pmut to about N-1 [Back, 1996]

In evolutionary programming, evolutionary strategies and real-valued genetic algorithms, it is common to apply mutation to every gene by adding a small random value to the gene. In genetic programming mutation is becoming increasingly often used. However, there are many different types of random changes that can be made to programs (see [Langdon,1998, pages 34-36]). Also, some authors recommend using a number of different types of mutations in combination with each other [Angeline, 1998]. One unfortunate source of confusion is pmut. In GAs it means the chance of changing each gene per generation. In some cases in GP, pmut is used to mean the chance that a child will be produced using mutation, rather than crossover. That is, treating pmut as analogous to pxo. Here we use pm' to denote this second meaning.

• Order O(H) The number of defining symbols in schema H.

• Parity problems Benchmark problems widely used in GP but inherited from the artificial neural network community. Parity is calculated by summing all the binary inputs and reporting if the sum is odd or even. This is considered difficult because: (1) a very simple artificial neural network cannot solve it and (2) all inputs need to be considered and a change to any one of them changes the answer.

• Polyploid A species with more than two chromosomes grouped together. See diploid and haploid.

• Premature convergence This loosely means that something has gone wrong. More precisely, that the population has converged (every individual in the population is identical, see convergence) to a suboptimal solution. Often the term premature convergence'' is loosely used.

• Propagation The inheritance of characteristics of one generation by the next. For example, a schema is propagated if individuals in the current generation match it and so do those in the next generation. Those in the next generation may be (but don't have to be) children of parents who matched it.

• Recombination This is creating children by combining the genetic material from two (or more) parents. Effectively it is another name for crossover.

• Roulette-wheel selection The archetypal selection scheme for genetic algorithms. We imagine a biased roulette wheel where each individual in the population has its own slot and the width of its slot is proportional to its fitness. That is, above-average individuals have wider-than-average slots. Individuals are selected to be parents of children in the next generation independently, one at a time, by spinning the wheel and dropping a ball into it. The chance of drawing a particular individual is proportional to the width of its slot, i.e. proportional to its fitness. However, others selection techniques, such as stochastic universal sampling [Back, 1996, page 120] or tournament selection, are often used in practice. This is because they have less stochastic noise, or are fast, easy to implement and have a constant selection pressure [Blickle, 1996].

• Schema A set of programs or bit strings that have some genotypic similarity. Usually the set is specified by defining a similarity template which members of the set must match. The template specifies the fixed part, which the programs must match, and the variable part. Don't care symbols are used to define the variable part. In tree schemata, both the content and the shape of the tree must be considered.

• Schemata Plural of schema.

• Stochastic matrix A matrix whose elements lie between 0 and 1 and where the sum of all the elements in each row is 1. A doubly stochastic matrix has the additional property that its transpose (i.e. the matrix whose i,j element is equal to the j,i element of the original matrix) is also a stochastic matrix.

• Terminal Another name for leaf.

• Unary function A function that takes one argument.

• Wild-card symbol This is also known as a don't care'' symbol. These are =, * and #. They indicate how a schema can match an actual program (or bit string) in the population.

Glossary from Genetic Programming and Data Structures.

W.B.Langdon 19 October 2002