Contest datasets

Primary dataset

The laboratories of Prof. Cristin Print and collaborators make available raw and processed data from a small microarray gene expression time-course experiment that is typical of gene expression time-course data sets yet provides an unusual opportunity for pushing the performance of analysis methods.

The experiment recorded the response of human vascular endothelial cells to serum withdrawal, triggering apoptosis. Apoptosis is known to be a major process for tissue remodelling during development and homeostasis in the adult, and also has a central role in many diseases. An initial, preliminary analysis and discussion of this data set has been published (1) and provides a good introduction to the biological background and context of the experiment.

The data set is typical in that a complex biological phenomenon is probed by a timecourse with only a few measurements, in this case 8 time-points and 3 replicate pools of cells from 10 distinct individuals each. It thus provides the classical challenge of microarray data analysis of extracting insight in a data space of very uneven dimensionalities, in this case 20k variables x (8×3) measurements. Also, a large number of independent experiments and established knowledge is available regarding apoptosis. Taking advantage of such external information for inference is again a typical challenge of the field.

The experiment is unusual, however, in that by design its focus is on detecting possible early causes. The challenge hence is rather to identify candidate regulators rather than primarily their targets by concentrating on very early time-points. We believe that this type of challenge will become a more and more central task for microarray analysis, particularly when considering the platform's strength in detecting low-copy-number molecules, such as transcription factors, that potentially drive later transcriptional events. The development of improved algorithms in this area will therefore continue to grow in relevance. The performance of new algorithms, however, is hard to assess without additional laboratory experiments. As part of every year's analysis challenge, the Program Committee will vote for the most interesting analysis. For this contest, Prof. Print's laboratory has kindly offered to experimentally test predictions by siRNA knock-down of the most promising candidates emerging (with a budget for costs of 5–10kNZ$). Together with the experimental design this offers a special opportunity for developing and testing novel integrative algorithms for the detection of regulatory factors from typical (small) time-course data sets and available external knowledge.

We look forward to a lively contest!

Data download

We provide two archive versions,

  1. One version without original microarray scan images [54 MB], and

The archives can are in 'zip' format, and can be unpacked with pre-installed tools on most machines as well as other free 'unzip' software.

As traditional in CAMDA contests, neither we nor Prof. Print's group can provide advice on the datasets to individuals as dealing with the files forms part of the analysis challenge. If you think, however, that there is a genuine problem with the files, please contact us and we will publish update information on this site.


(1) Affara M, Dunmore B, Savoie C, Imoto S, Tamada Y, Araki H, Charnock-Jones D.S, Miyano S, Print C. (2007) 'Understanding endothelial cell apoptosis: what can the transcriptome glycome and proteome reveal?' Phil. Trans. R. Soc. B. 362, 1469–-87.
PDF reprint

Emerald dataset

The Emerald workshop experiments have just been completed and are now available for download!

A Microarray Experiment to Study the Relative Magnitudes of Technical and Biological Variation

Microarray science and technology has progressed to the point at which careful work yields reliable measurements. There is a growing understanding of the sources of variability in microarray experiments, and ways to control that variability are propagating. In part because the technical variability observed in contemporary microarray experiments has become better controlled, statistically significant lab-to-lab and batch-to-batch effects have been observed. A number of experiments which study the same samples across a variety of laboratories and platforms have reported this. The essential question is whether these effects are significant with respect to the biological variability observed amongst the samples. This question lies at the heart of establishing the fitness for purpose of microarrays for biological studies.

We have data available produced by three different laboratories measuring the same samples on three different platforms – each with their own batch factors (Liggett et al., 2008). The platforms are the Affymetrix Rat Genome 230 2.0 array, the Illumina RatRef-12 array, and the Agilent Whole Rat Genome array. The samples are a titration mixture of RNA isolated from kidney and liver, from 6 different normal control rats from an earlier experiment at Novartis. This titration presents a series of 4 samples from each rat: RNA from the kidney, a mixture of 75% RNA from kidney and 25% from liver, a mixture of 25% RNA from kidney and 75% from liver, and RNA from the liver. These samples were measured in replicate, for each animal. Pooled samples from the various animals were also measured, for a nominal 96 arrays from each platform.

The relationship amongst these samples enables model-based analysis, amongst other approaches. Model-based approaches can be compelling because they permit observation and apportionment of variation in the residuals. The titration samples present interesting opportunities for alternative analyses as well, with the titration fraction as a surrogate or proxy for RNA concentration.

A particular interest for this CAMDA dataset is its use for evaluating the performance of different preprocessing approaches and techniques. We encourage research groups to address this question. Assessment using a model-based approach might enable estimation of any bias that might be introduced in preprocessing. Such estimates would, for the first time, provide valuable quantitative insight to enable the microarray data analysis community to make appropriate compromises when selecting a preprocessing pipeline.

Data download

The Affymetrix data is available from ArrayExpress, accession number E-TABM-536.

The Illumina data will be available from ArrayExpress shortly, as accession number E-TABM-554. You can also download the original data files from the CAMDA server below or this mirror: Illumina Description (4kB .zip), Illumina Data (49MB .zip).

The Agilent data will be available from ArrayExpress shortly, as accession number E-TABM-555. You can also download the original data files from the CAMDA server below or this mirror: Agilent Description (73kB .zip), Agilent Data (828MB [!] .zip).

We look forward to a lively contest!

CAMDA Download Mirror


We provide a preprint of an analysis of the Novartis / Affymetrix data.

(2) Ligget W, Peterson R, Salit M. (2008) 'Technical vis-à-vis biological variation in gene expression measurements', preprint.

contest_dataset.txt · Last modified: 2008/11/23 19:24 by pablo