Transferring Computer Science Research to Mining DNA chip Protein Expression
W. B. Langdon
Seconded to GlaxoSmithKline from
University College, London
Presented at MIPNETS 25-27 june 2003.
Introduction
DNA chips estimate concentration of individual gene products. A chip can measure 1000s of gene expression levels simultaneously
Tissue samples (patients) are in short supply and chips are expensive. => Few training records with many attributes. => Over fitting.
Genetic programming used for feature selection
3 Stage GP. 10,000 genes -> 100s -> 10s
Last stage GP used to produce predictive model
GSK 578 examples 4548 -> 361 ->18 (=>6) genes
100 GP runs with genes
data split 50:50 train:verification
fitness = 1/2 true positive rate + 1/2 true neg (ROC)
binary arithmetic operators (+ - * /), 4548 constants
five trees per individual
pop 500, 50 generations
50% size fair crossover, 50% 1 of 4 mutations
Extract all genes used in best of run model
361 appeared more than once (1730 used)
18=>6 gene model
100 GP runs with 18 genes
Over fitting.
Select model from generation 10
Highest fitness and smallest model, with least genes
run 40, 6 genes, size=23
IF
((
gene1
+
gene2
+
gene3
-13.8 -
gene4
(0.00918 *
gene5
*
gene1
-2230)/
gene6
) >= 0)
THEN predict positive
71% of positive examples correct and 81% of the negatives correct
Training Set vs. Test set fitness
ALL Leukemia 119 Patients 12625 ->2 139 ->17 (=>9) genes
100 GP runs with 12625 genes
no data split
pop 10,000 (all genes in gen 0), 5 generations
Extract all genes used in best of run model
139 appeared more than once (1737 used)
Leukemia 2nd stage 139->17 genes
Leukemia GP linear 9 genes
Leave one out 10 GP runs each with 17 genes 1180 runs.
linear model with 9 (of 12,625) inputs
IF
((
40371_at
/31+
34985_at
+
1828_s_at
+
451_at
+
41419_at
-
34052_at
-
39535_at
-
36730_at
-
32483_at
) >= 0)
THEN Predict positive
Leave one out estimate
76% of positive examples correct and 71% of the negatives correct
60 Childhood Cancers 7129 -> 404 => 2 genes
600 GP runs with 7129 genes
ten leave one out (10*60=600)
pop 500, 50 generations
Extract all genes used in best of run model
404 appeared ten or more times (6970 used)
Cancer 404-> 2 genes
Cancer final 2 gene model
Summary
Many variables few examples (578, 119, 60). Fear over fitting.
Simple models, few (6, 9, 2) genes.
Leave one out cross validation or holdout set
Genetic programming used to select genes
Fitness area under ROC. Cf class imbalance
Population sized to get all variables n(log n+.6)
Few generations to reduce over fitting
Size fair crossover and mutation eliminate bloat
ftp://ftp.cs.ucl.ac.uk/genetic/gp-code/GProc-1.8b.tar.gz
W.B.Langdon
4 July 2003 (last update 4 Oct 2012)