Contest datasets

As traditional in CAMDA contests, neither we nor the producers of the data can provide advice on the datasets to individuals as dealing with the files forms part of the analysis challenge. There is, however, an open forum for participants' free discussions relating to the contest data sets, and in which you are encouraged to participate.

We look forward to a lively contest!

Dataset 1: TGP dataset from the Japanese Toxicogenomics Project

Data Description

The TGP dataset contains over >21,000 arrays for rats treated with mainly human drugs and profiled using the Affymetrix RAE230_2.0 GeneChip®. The main target organ profiled is liver.

In this project, only the data for liver are provided. The data package contains the following files:

  1. TGP Description (word document) – it provides a brief introduction of the TGP data and human hepatotoxic potential of each drug. More information is available from two references below: Citation 1: Uehara T, Ono A, Maruyama T, Kato I, Yamada H, Ohno Y, Urushidani T., The Japanese toxicogenomics project: application of toxicogenomics. Mol Nutr Food Res. 54(2):218-27, 2010. Citation 2: Chen, M., et al., FDA-approved drug labeling for the study of drug-induced liver injury. Drug Discov Today, 2011. 16(15-16): p. 697-703.
  2. Drug Information (Excel table) – the basic information about individual drugs are extracted from DrugBank. The last three columns contain human hepatotoxicity data for each drug described in the paper by Chen et al. (mentioned above in citation 1).
  3. Pathology Data (Excel table) – A significant portion of the TGP data is derived from in vivo assay using two different treatment protocols (i.e., single treatment and daily repeated treatment). Pathology and clinical chemistry data for each rat (which anchored with each array) are summarized in this table.
  4. Array Metadata (csv format) – Meta data (e.g., dose, time, sacrifice time and etc) for each array are summarized. Phenotypic data anchored to each array are available from the “Pathology data” table mentioned above.
  5. MAS5 data (folder) – it contains all the array data in the MAS5 normalized format
  6. RAW data (folder) – it contains all the array data in the cel format


This is a typical toxicogenomics dataset. This dataset can be used to address two most important questions in toxicology and safety evaluation:

  • Question 1: Can we replace the animal study with in vitro assay? The current safety assessment is largely relied on the animal model, which is time-consuming, labor-intensive, and definitely not in line with the animal right voice. There is a paradigm shift in toxicology to explore the possibility of replacing the animal model with in vitro assay coupled with toxicogenomics. The TGP data contains both in vitro and animal data, which is essential to address this question.
  • Question 2: Can we predict the liver injury in humans using toxicogenomics data from animals. Around 40% of drug-induced liver injury (DILI) cases are not detected in the preclinical studies using the conventional indicators (such as pathology, clinical chemistry data). It has been hypothesized that genomic biomarkers will be more sensitive than conventional markers in detecting human hepatotoxicity signals in preclinical studies (i.e., in vitro and in vivo assays). In this project, we provide the human hepatotoxicity data for most of the drugs (the last three columns in the table named “Drug Information”). The contests can explore the possibility of predicting the DILI potential in humans using the in vitro data from rat primary hepatocytes or human primary hepatocytes, or the animal data from two different treatment protocols. Alternatively, these data can also be combined to enhance the predictive power for the human hepatotoxic potential.

Dataset 2: KPGP-38 Human Genomes

Both bioinformatics and medical informatics professionals are challenged with massive data generated by genome sequencing and complex data from the electronic medical records. How do we leverage these data to improve patient care? The second track of CAMDA will address some of the issues. The purpose of this track is exploratory. We see it more of a collaborative effort from the community to define the problems, rather than a competition among groups for the best answer. Thus, tasks in this track are less defined, and regular conference calls will be held for discussions.

Data Description

38 Human genomes sequenced on Illumina HiSeq 2000 platform with 30x to 40x coverage.

Human subjects are part of the Korean Personal Genome Project, which is part of the international Personal Genome Project (PGP). Limited medical record data is available for the subjects.


Level I Challenge: Variant Calling

DNA structural variants include many forms, such as single nucleotide polymorphisms (SNPs), insertions, deletions, inversions, repeats, et al.

To start with, we will focus on SNPs. From the raw data, how can we detect SNPs? Are we able to consistently identify them among different research groups?

A web server will be set up to automatically to assess the consistency of SNP identification on the data set from different participants.

Results from CAMDA 2012 will be used as a baseline for future studies.

Level II Challenge: Analysis of one Genome on Pharmacogenomics

For this challenge, we will explore how a physician can utilize the genome information in clinical care, especially pharmacogenomics. Given the medical record information and the genomic data, what kind of advice would a physician give to this patient with regard to drug usage? The crux of the problem is how to solve n=1.

Level III Challenge: Analysis of 38 Genomes

This is an open-ended challenge. We invited innovative ideas on utilizing the data set for biomedical informatics development. Sky is the limit.

Data Download

Before downloading you should read and accept the data download agreement

contest_dataset.txt · Last modified: 2012/10/22 11:17 by okko