Collection of high-throughput computational biology benchmarking data sets

Introduction

Benchmarking of new computational biology methods, as well as interpretation of published benchmarking studies, can be complicated by the lack of "standardized" benchmarking data sets and result representations. To address these issues, starting with a set of well-defined computational biology questions, we collect on this page links to (synthetic or real) benchmarking data sets with extensive truth available. In many cases, the data sets are available in multiple formats ("raw" and "processed"), which allows methods working with different input formats to be compared.

In an effort to standardize result representations, we have developed the iCOBRA R package. Based on a collection of results (p-values or general scores) and true labels or a continuous target, iCOBRA can be used to calculate a set of widely used performance metrics and generate flexible and customizable graphs. The package also contains a shiny application for interactive use. The application can be run locally, or accessed via our shiny server.

For more information about iCOBRA, we refer to the Nature Methods publication and the Bioconductor package.

If you have questions or would like to add a link to another benchmarking data set, please contact Mark Robinson or Charlotte Soneson.

Differential isoform usage, human and fruit fly

This is a synthetic data set intended to be used for evaluation of methods for detection of differential isoform usage. The data set contains two simulations; one representing data from Homo sapiens and the other representing data from Drosophila melanogaster. Each simulation consists of six samples (representing biological replicates), three from each of two simulated conditions. The generation and processing of the data, as well as the results of applying eleven different methods for differential isoform usage, are described in Soneson, Matthes et al (2015). All data files have been deposited in ArrayExpress, with accession number E-MTAB-3766. From the ArrayExpress website, the following data types are available:

FASTQ files with simulated reads (available via links in the "Raw" column). Data were simulated as paired-end reads, and thus there are two FASTQ files for each sample.
BAM files with aligned reads (available via links in the "Processed" column). Reads were aligned to the genome and transcriptome using TopHat.

In addition, we provide ground truth on two different levels; first, on transcript level (true expression, true differential usage status) and second, on gene level (the main focus of the manuscript mentioned above, giving the true differential isoform usage status for each gene). The gene-level truth archives below also contain gene-level differential isoform usage results (adjusted p-values) for each method evaluated in the manuscript). All text files contained in the archives below can be used as input to the iCOBRA shiny app.

Transcript level truth (true TPM expression value for each transcript in each sample, differential usage status for each transcript, human and fruit fly): transcript_truth.zip
Gene level DTU truth and results, fruit fly simulation: diff_splicing_comparison_drosophila.zip
Gene level DTU truth and results, human simulation: diff_splicing_comparison_human.zip

Differential gene expression, human

This is a synthetic data set generated with the aim of evaluating methods for differential expression analysis from RNA-seq data. Here, the count matrices were simulated directly (and hence no raw data files are available). The data was simulated according the process described by Soneson and Delorenzi (2013), using the compcodeR R package. Many more similar simulated data sets (including replicates of the one provided here) as well as results from several methods and parameter settings are available from the website of the compcodeR package. The simulation data below is extracted from the "NB_625_625" data set from this webpage, with five samples in each of two conditions. Note that the results provided below may not be obtained with the latest version of the respective software, and only serve as example results for comparison. The following settings were used for each of the included methods:

baySeq: version 1.14.1, quantile normalization, equal dispersion in the two groups, 5000 resamplings
DESeq2: version 1.2.5, parametric dispersion fit, Wald test, beta prior, independent filtering and Cook's distance cutoff (default), no imputation
DSS: version 1.8.0, quantile normalization, no trend
edgeRGLM: version 3.4.0, GLM, TMM normalization, Cox Reid likelihood, tagwise dispersion, shrunk towards trend
voomlimma: limma version 3.18.1, edgeR version 3.4.0, TMM normalization
SAMseq: version 2.0

The data and results can be accessed via the link below. The text files with truth and results can be used as the input for the iCOBRA shiny app:

Compressed archive containing count matrix, sample annotation, truth and merged results.

Relative inclusion of alternative splicing events, human

The data set from Shen et al (2012) was used by Alamancos et al (2015) to evaluate methods for estimating relative inclusion rates (psi values) of alternative splicing events. Four methods/method combinations (Sailfish+SUPPA, RSEM+SUPPA, MATS, MISO) were used to estimate relative inclusion of events. The data set contains two samples: from MDA-MB-231 cells with overexpression of ESRP1, and MDA-MB-231 cells with an empty vector (EV). The relative inclusion rates for 163 alternative splicing events (skipped exons) were estimated by RT-PCR, which is considered as the ground truth in the iCOBRA data files below. The evaluations were performed using three different annotation catalogs (Ensembl, Ensembl CDS and RefSeq), and in each case all events for which psi values from all methods were available were used for the evaluation.

The raw data is available from SRA under accession ID SRX122589.
Ensembl annotation release 75 and RefSeq annotation (NM_ and NR_ transcripts) assembly hg19 were used to define alternative splicing events.
The psi estimates for the alternative splicing events can be downloaded from the Supplementary website of Alamancos et al (2015), or from here.
True (RT-PCR) and estimated psi values used for the correlation estimates (a subset of all the events in the previous point, as described in the paper) were contributed by the authors and are available in a compressed archive. This archive contains two additional method combinations: kallisto+SUPPA and salmon+SUPPA.

Transcript/gene abundance estimation

This data set was generated for the study by Kanitz et al (2015) with the purpose of evaluating methods for transcript abundance estimation from RNA-seq data. 11 transcript abundance methods were evaluated by application to synthetic data with different sequencing depths (1, 3, 10, 30 and 100 million reads) as well as real RNA-seq data from human and mouse. The "ground truth" for the experimental data sets is obtained using 3' end sequencing with the A-seq-2 protocol as described by Kanitz et al.

Raw data (FASTQ files, BAM files) and analysis scripts are available from the supplementary website of the paper.
Truth and result files suitable for iCOBRA were provided by the authors and are available as a compressed archive.

Metagenome analysis

This synthetic data set was generated in a recent preprint by Lindgreen et al, aimed at evaluating methods for metagenome abundance quantification. Estimates are obtained from 14 different methods, at two levels of detail: phyla and genera. Six samples (A1-A3, B1-B3) were simulated.

Raw data is available from the supplementary website of Lindgreen et al (2015).
The true and estimated relative abundances were extracted from the Supplementary Tables of Lindgreen et al (2015) and truth and result files in a format suitable for iCOBRA are provided here as a compressed archive.