W.B.Langdon . 21 Aug 2014 2008 papers , full list
Background: High Density Oligonucleotide arrays (HDONAs), such as the Affymetrix HG-U133A GeneChip, use sets of probes chosen to match specified genes, with the expectation that if a particular gene is highly expressed then all the probes in that gene's probe set will provide a consistent message signifying the gene's presence. We have tested this expectation by examining the correlations between probes using the data on thousands of arrays that are available in the NCBI Gene Expression Omnibus (GEO) repository.
Results: We have identified probes that are not well-correlated with the other probes in their probeset but nevertheless are highly correlated with probes in other probesets. The common element of these highly correlated probes is that they contain a G-spot (a sequence of four or more guanines).
Conclusions: Since these G-spot probes generally show little correlation with the other members of their probesets they are not fit for purpose and their values should be excluded when calculating gene expression values. This has serious implications, since more than 40 per cent of the probesets in the HG-U133A GeneChip contain at least one such probe. Future array designs should avoid these untrustworthy probes.
Modern biology has moved from a science of individual measurements to a science where data are collected on an industrial scale. Foremost amongst the new tools for biochemistry are chip arrays which, in one operation, measure hundreds of thousands or even millions of DNA sequences or RNA transcripts. Whilst this is impressive, increasingly sophisticated analysis tools have been required to convert gene array data into gene expression levels. Despite the assumption that noise levels are low, since the number of measurements for an individual gene is small, identifying which signals are affected by noise is a priority.
High-density oligonucleotide array (HDONAs) from NCBI GEO shows that, even in the best Human GeneChips 1/4% of data are affected by spatial noise. Earlier designs are more noisy and spatial defects may affect more than 25% of probes.
BioConductor R code is available as supplementary material and via TCBB-2007-11-0161_noCEL.tar
We have calculated the correlation between most human genes, using thousands of public Affymetrix HG-U133 +2 high-density oligonucleotide array (HDONAs). The correspondences show highly structured interactions between EBI Ensembl exons across a wide range of tissues and disease states taken from NCBI GEO. Eigen values are used to find and display the principle components of the gene expression mRNA data. The PCA analysis suggests almost all genes interact in a connected graph. There are thousands of strongly interacting genes but the whole network is sparse, with many genes not correlating strongly. So far, few power laws typical of small world networks and anticipated in gene regulatory networks have been found.
The 300 million correlations are organised by gene/exon and are available via a web interface.
Variation in tissue sample preparation leads to variation across the Transcriptome not just between experiments but to between individual microarrays. Normalisation is essential before data from different arrays can be compared. Quantile normalisation can be used to force data from a single GeneChip to take a given distribution. However quantile normalisation can be blind to the consistent spatial variation we note in thousands of Affymetrix' High-density oligonucleotide array (HDONAs) from NCBI GEO. We propose a simple computationally efficient normalisation technique which takes into account the spatial aspect. BioConductor R code is included.
Limited numerical precision of nVidia GeForce 8800 GTX and other GPUs requires careful implementation of PRNGs. The Park-Miller PRNG is programmed using G80's native Value4f floating point in RapidMind C++. Speed up is more than 40. Code is available via ftp://ftp.cs.ucl.ac.uk/genetic/gp-code/random-numbers/gpu_park-miller.tar.gz