Evolutionary Data Fusion

Evolution is used to perform data fusion on different data sources for drug discovery applications. Non-linear combinations of conventional classifiers (support vector machines, neural networks, C4.5, naive Bayes, linear etc.) with improved performance are created. The classifiers are pre-trained, for example on molecule binding data mining problems. However the technique is not just for the Pharmaceutical industry and could be generally applied. The classifiers can be of any type or mixtures of types. Indeed the classifiers can be trained on different data.

This is part of my work on Intelligent Data Analysis and Fusion Techniques in Pharmaceuticals, Bioprocessing and Process Control as part of the Rocket Faraday INTErSECT partnership project between UCL, GSK, Unilever, SPSS, NPL and Sira. (Follow up work at GSK.)

Genetic Programming for Improved Receiver Operating Characteristics

Genetic programming (GP) is an automatic means of program generation. It uses the principles of Darwinian natural selection to evolve a population of programs. Each generation the fitter programs are selected to be parents and their children are created by crossing over and mutating the parents. With succeeding generations we hope fitter innovations will be introduced into the programs and the population will improve.

We use GP as the means to combine classifiers which have already been trained to some level of performance on molecule binding data mining problems. The classifiers can be of any type or mixtures of types. Indeed the classifiers can be trained on different data.

Initially GP starts with a random non-linear combinations of the supplied classifiers (possibly also the raw data they were trained on). Over generations of continuously selecting the better combinations from the population and creating new combinations, better classifiers are evolved.

The Receiver Operating Characteristics (ROC) of a classifier shows its performance as a trade off between selectivity and sensitivity. Better classifiers have a higher area under their ROC curve. The ideal classifier has an area of one.

The fitness of the non-linear combinations of classifiers the area under its ROC curve. This approach has been demonstrated by evolving improved data fusion classifiers for 1) contrived, 2) artificial and 3) several machine learning benchmarks. It has been tested in blind trials on QSAR drug activity datasets provided by GSK.

Project Research Papers

W. B. Langdon RN/01/19

W.B.Langdon 9 October 2001
(last update 29 July 2007)