Expanded abstract BioData Mining Mycoplasma Contamination in The 1000 Genomes Project http://www.cs.ucl.ac.uk/staff/W.Langdon/ Background: In silco Biology is increasingly important and is often based on public data. Biologists increasingly use sophisticated computer tools to analyse the results not only of their own data but also data given to them directly by other Biologists. Publicly funded projects, for example the 1000 genomes project, have made available vast quantities of their data. In fact making data available is often a condition of either funding or of publication. Exponentially growing quantities of such data are now available to people remote from their original source. Indeed with time, the data will increasingly be used by people who have never met the data provider. Indeed the original experimenter may have changed jobs or even retired. Nonetheless others will still be using the results of their endeavours. Indeed the data are freely available and can be used by non-biologists (eg Computer scientists) who have little experience of common procedures and problems associated with microbiology wet lab experiments. Once data are on the Internet they start to have a life of their own and may pop up in unexpected locations and be used in unanticipated ways. The current Bioinformatics infrastructure already routinely distributes key datasets (such as the DNA sequences harvested by the thousand genome project) globally. While the problem of contamination is well recognised in microbiology labs the corresponding problem of database corruption has received less attention. For example at least one microbe gene has got itself uploaded into the database holding the reference human genome and is still there. It has been copied across the planet. Indeed it has even managed to get itself incorporated into some Bioinformatics hardware. Results: Bearing the above in mind, a large sample of freely available DNA sequences from The Thousand Genome Project were down loaded from Internet. Each DNA sequence was tested against both the human genome and against genomes for 30 different species of Mycoplasma bacteria. Whilst only a fraction of 1000G DNA sequences are from Mycoplasma they are well spread so that it appears that in excess of seven percent of 1000G samples are contaminated with some DNA from Mycoplasma or a similar species. Many of the DNA sequences are of low quality. Even so using NCBI BLAST searches with high quality, high entropy sequences we get the same result: some 1000G DNA sequences match Mycoplasma but no human sequences. Some locations have a higher contamination rate than others, this may be related to differences in DNA scanner technology between the different members of the thousand genome project consortium.