Collection of high-throughput computational biology benchmarking data sets


Benchmarking of new computational biology methods, as well as interpretation of published benchmarking studies, can be complicated by the lack of "standardized" benchmarking data sets and result representations. To address these issues, starting with a set of well-defined computational biology questions, we collect on this page links to (synthetic or real) benchmarking data sets with extensive truth available. In many cases, the data sets are available in multiple formats ("raw" and "processed"), which allows methods working with different input formats to be compared.

In an effort to standardize result representations, we have developed the iCOBRA R package. Based on a collection of results (p-values or general scores) and true labels or a continuous target, iCOBRA can be used to calculate a set of widely used performance metrics and generate flexible and customizable graphs. The package also contains a shiny application for interactive use. The application can be run locally, or accessed via our shiny server.

For more information about iCOBRA, we refer to the Nature Methods publication and the Bioconductor package.

If you have questions or would like to add a link to another benchmarking data set, please contact Mark Robinson or Charlotte Soneson.

Differential isoform usage, human and fruit fly

This is a synthetic data set intended to be used for evaluation of methods for detection of differential isoform usage. The data set contains two simulations; one representing data from Homo sapiens and the other representing data from Drosophila melanogaster. Each simulation consists of six samples (representing biological replicates), three from each of two simulated conditions. The generation and processing of the data, as well as the results of applying eleven different methods for differential isoform usage, are described in Soneson, Matthes et al (2015). All data files have been deposited in ArrayExpress, with accession number E-MTAB-3766. From the ArrayExpress website, the following data types are available:

In addition, we provide ground truth on two different levels; first, on transcript level (true expression, true differential usage status) and second, on gene level (the main focus of the manuscript mentioned above, giving the true differential isoform usage status for each gene). The gene-level truth archives below also contain gene-level differential isoform usage results (adjusted p-values) for each method evaluated in the manuscript). All text files contained in the archives below can be used as input to the iCOBRA shiny app.

Differential gene expression, human

This is a synthetic data set generated with the aim of evaluating methods for differential expression analysis from RNA-seq data. Here, the count matrices were simulated directly (and hence no raw data files are available). The data was simulated according the process described by Soneson and Delorenzi (2013), using the compcodeR R package. Many more similar simulated data sets (including replicates of the one provided here) as well as results from several methods and parameter settings are available from the website of the compcodeR package. The simulation data below is extracted from the "NB_625_625" data set from this webpage, with five samples in each of two conditions. Note that the results provided below may not be obtained with the latest version of the respective software, and only serve as example results for comparison. The following settings were used for each of the included methods:

The data and results can be accessed via the link below. The text files with truth and results can be used as the input for the iCOBRA shiny app:

Relative inclusion of alternative splicing events, human

The data set from Shen et al (2012) was used by Alamancos et al (2015) to evaluate methods for estimating relative inclusion rates (psi values) of alternative splicing events. Four methods/method combinations (Sailfish+SUPPA, RSEM+SUPPA, MATS, MISO) were used to estimate relative inclusion of events. The data set contains two samples: from MDA-MB-231 cells with overexpression of ESRP1, and MDA-MB-231 cells with an empty vector (EV). The relative inclusion rates for 163 alternative splicing events (skipped exons) were estimated by RT-PCR, which is considered as the ground truth in the iCOBRA data files below. The evaluations were performed using three different annotation catalogs (Ensembl, Ensembl CDS and RefSeq), and in each case all events for which psi values from all methods were available were used for the evaluation.

Transcript/gene abundance estimation

This data set was generated for the study by Kanitz et al (2015) with the purpose of evaluating methods for transcript abundance estimation from RNA-seq data. 11 transcript abundance methods were evaluated by application to synthetic data with different sequencing depths (1, 3, 10, 30 and 100 million reads) as well as real RNA-seq data from human and mouse. The "ground truth" for the experimental data sets is obtained using 3' end sequencing with the A-seq-2 protocol as described by Kanitz et al.

Metagenome analysis

This synthetic data set was generated in a recent preprint by Lindgreen et al, aimed at evaluating methods for metagenome abundance quantification. Estimates are obtained from 14 different methods, at two levels of detail: phyla and genera. Six samples (A1-A3, B1-B3) were simulated.