About conquer
The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. It is implemented in shiny and provides access to consistently processed public single-cell RNA-seq data sets. Below is a short description of the workflow used to process the raw reads in order to generate the data provided in the repository.
If you use conquer for your work, please cite
- C Soneson & MD Robinson: Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 15(4):255-261 (2018).
The information provided in the columns Brief description
, Protocol
and Protocol type
was inferred and summarized from the information provided by the data generators in the public repositories. We refer to the original descriptions for more detailed information.
Index building
In order to use Salmon to quantify the transcript abundances in a given sample, we first need to index the corresponding reference transcriptome. For a given organism, we download the fasta files containing cDNA and ncRNA sequences from Ensembl, complement these with ERCC spike-in sequences, and build a Salmon quasi-mapping index for the entire catalog. Note that the scater report for a given data set (available in the scater report column) details the precise version of the transcriptome that was used for the quantification. For data sets with “long” reads (longer than 50 bp) we use the default k=31
, while for “short reads” (typically around 25 bp) we set k=15
.
We also create a lookup table relating transcript IDs to the corresponding gene IDs. This information is obtained by parsing the sequence names in the cDNA and ncRNA fasta files. From these names we also obtain the genomic coordinates for each feature.
Sample list and run matching
The first step is to determine the set of samples included in a given data set. We download a “RunInfo.csv” file for the data set from SRA and a Series Matrix file from GEO, in order to link samples both to individual runs and to phenotypic information. If the data set is not available from GEO, we construct a phenotype data file from the information provided by the corresponding repository.
Quality control
For each sample in the data set, we find all the corresponding runs, and download and concatenate the corresponding FastQ files from SRA. There is also an optional step to trim adapters from the reads using cutadapt. Next, we run FastQC to generate a quality control file for each concatenated read file (one or two files per sample depending on whether it was processed with a single-end or paired-end sequencing protocol).
Abundance quantification
After the QC, we run Salmon to estimate the abundance of each transcript from the catalog described above in each sample. The Salmon output files are then compressed in an archive and can be downloaded from conquer (see the salmon archive column).
For data obtained with non-full-length library preparation protocols (e.g. targeting only the 3' or 5' end of transcripts), we quantify transcript and gene abundances using the umis pipeline developed by Valentine Svensson. Briefly, we quasimap the reads to the transcriptome using RapMap and use the counting capabilities of umis to obtain feature counts.
Summary report - MultiQC
Once FastQC and Salmon (or RapMap/umis) have been applied to all samples in the data set, we run MultiQC to summarise all the information into one report. This can also be downloaded from conquer (see the MultiQC report column). This report contains quality scores for all the samples and can be used to determine if there are problematic samples and whether the data set is good enough for the purposes of the user or needs to be subsetted.
Data summarisation
The abundances estimated by Salmon are summarised and provided to the user via conquer in the form of a MultiAssayExperiment object. This object can be downloaded via the buttons in the MultiAssayExperiment column. To generate this object, we first use the tximport package to read the Salmon output into R. This returns both count estimates and TPM estimates for each transcript. Next, we summarise the transcript-level information to the gene level. The gene-level TPM is defined as the sum of the TPMs of the corresponding transcripts, and similarly for the gene-level counts. We also provide “scaled TPMs” (see http://f1000research.com/articles/4-1521/ or the tximport vignette for a discussion), that is, summarised TPMs scaled to a “count scale”. In the summarisation step, we make use of the transcript-to-gene lookup table generated above.
The provided MultiAssayExperiment object contains two “experiments”, corresponding to the gene-level and transcript-level values. The gene-level experiment contains four “assays”:
- TPM
- count
- count_lstpm (count-scale length-scaled TPMs)
- avetxlength (the average transcript length, which can be used as offsets in count models based on the count assay, see http://f1000research.com/articles/4-1521/).
The transcript-level experiment contains three “assays”:
- TPM
- count
- efflength (the effective length estimated by Salmon)
The MultiAssayExperiment also contains the phenotypic data (in the colData slot), as well as some metadata for the data set (the genome, the organism, a summary of the Salmon parameters and the fraction of reads that were mapped, and the date when the object was generated). Please note that the format of MultiAssayExperiment objects changed with version 1.1.49 of the MultiAssayExperiment package, and in particular the pData slot is now deprecated in favor of colData. The objects provided in conquer follow the new format.
Summary report - scater
In order to give users another way of investigating whether a data set is useful for their purposes, we also provide an exploratory analysis report. This is largely based on functions from the scater Bioconductor package, applied to data extracted from the MultiAssayExperiment object. The report calculates and visualises various quality measures for the cells, and provides low-dimensional representations of the cells, colored by different phenotypic annotations.
Acknowledgements
We would like to thank Simon Andrews for help with FastQC, Mike Love and Valentine Svensson for providing instructions for how to retrieve the URL for the FastQ file(s) of a given SRA run (see here and here), Davis McCarthy for input regarding scater and Nicholas Hamilton for instructions on how to generate a standardized report based on a provided R object (see here)). Finally, we would like to acknowledge the developers of all the tools we use to prepare the data for conquer.
Presentations/publications
conquer was presented as a poster at the Single Cell Genomics conference in Hinxton, UK, in September 2016. A detailed description of the database and an example of its use in an evaluation of differential expression analysis methods for single-cell RNA-seq data can be found in:
- C Soneson & MD Robinson: Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data. bioRxiv doi:10.1101/143289 (2017).
Code
The code used for conquer is available via GitHub.
Changelog
- 2017-07-22: Updated to Bioconductor 3.5 (affects MultiAssayExperiment objects and scater results). Updated MultiQC to version 1.1
- 2016-10-27: Updated to Bioconductor 3.4 (affects MultiAssayExperiment objects and scater results)
- 2016-10-25: Updated MultiQC to version 0.8
The following list contains the samples that were excluded from each of the data sets. Most of these samples were excluded since they do not represent single cells. In rare cases (indicated in italics), the download or processing of the sample failed.
GSE45719
GSM1112582, GSM1112583, GSM1112584, GSM1112585, GSM1112586, GSM1112587, GSM1112588, GSM1112589, GSM1278009, GSM1278010, GSM1278011, GSM1278012, GSM1278013, GSM1278014, GSM1278015, GSM1278016, GSM1278026, GSM1278027, GSM1278028, GSM1278029, GSM1278030, GSM1278031, GSM1278032, GSM1278033, GSM1278034, GSM1278035
GSE60749-GPL13112
GSM1487049, GSM1487050, GSM1487051, GSM1487052, GSM1487053, GSM1487054, GSM1487055, GSM1487056, GSM1487057, GSM1487058, GSM1487059, GSM1487060, GSM1487061, GSM1487062, GSM1487063, GSM1487064, GSM1487065, GSM1487066, GSM1487067, GSM1487068, GSM1487069, GSM1487070, GSM1487071, GSM1487072, GSM1487073, GSM1487074
GSE57872
GSM1396263, GSM1396264, GSM1396265, GSM1396266, GSM1396267, GSM1396268, GSM1396269, GSM1396270, GSM1396271, GSM1396272, GSM1396273
GSE48968
GSM1190890, GSM1190891, GSM1190892, GSM1190893, GSM1190894, GSM1190895, GSM1190896, GSM1190897, GSM1190898, GSM1190899, GSM1190900, GSM1190901, GSM1190902
GSE41265
GSM1012795, GSM1012796, GSM1012797, GSM1110889, GSM1110890, GSM1110891
GSE44183-GPL11154
GSM1080212
GSE52529-GPL16791
GSM1269332, GSM1269333, GSM1269334, GSM1269335, GSM1269336, GSM1269337, GSM1269338, GSM1269339, GSM1269340, GSM1269341, GSM1269342, GSM1269343
GSE63818
GSM1677801, GSM1677802, GSM1677803, GSM1677804, GSM1677805, GSM1677806, GSM1677807, GSM1677808, GSM1677809, GSM1677810, GSM1677811, GSM1677812, GSM1677813, GSM1677814, GSM1677815, GSM1677816, GSM1677817, GSM1677818, GSM1677819, GSM1677820, GSM1677821, GSM1677822, GSM1677823, GSM1677824, GSM1677825, GSM1677826, GSM1677827, GSM1677828, GSM1677829, GSM1677830, GSM1677831, GSM1677832, GSM1677833, GSM1677834, GSM1677835, GSM1677836
GSE71585-GPL13112
GSM1840998, GSM1840999, GSM1841000, GSM1839229, GSM1839230
GSE71585-GPL17021
GSE1840992, GSM1840993, GSM1840994, GSM1840995, GSM1840996, GSM1840997, GSM1840931, GSM1840932, GSM1840933, GSM1840934, GSM1840935, GSM1840936, GSM1840937, GSM1840938, GSM1840939, GSM1840940, GSM1840941, GSM1840942, GSM1840943
GSE100911
GSM2696330
GSE80232
GSM2121581, GSM2121582, GSM2121583
GSE84465
GSM2244841, GSM2244965, GSM2245176, GSM2245437, GSM2246972
Using the conquer database
Note! Starting from version 1.1.49, the pData
slot in a
MultiAssayExperiment is deprecated in favor of colData
. The
objects included in conquer are now updated to the new version.
To use a data set provided in the conquer database, download the corresponding R object from the MultiAssayExperiment column. As an illustration, we will assume that the file for the GSE41265 data set has been downloaded and is available in the current working directory. First, load the SummarizedExperiment and MultiAssayExperiment packages and read the file into R:
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(library(MultiAssayExperiment))
(gse41265 <- readRDS("GSE41265.rds"))
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] gene: RangedSummarizedExperiment with 45686 rows and 18 columns
## [2] tx: RangedSummarizedExperiment with 113560 rows and 18 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
The resulting object contains both gene and transcript abundances.
experiments(gse41265)
## ExperimentList class object of length 2:
## [1] gene: RangedSummarizedExperiment with 45686 rows and 18 columns
## [2] tx: RangedSummarizedExperiment with 113560 rows and 18 columns
Gene-level data
To access the gene abundances, get the gene experiment:
(gse41265_gene <- experiments(gse41265)[["gene"]])
## class: RangedSummarizedExperiment
## dim: 45686 18
## metadata(0):
## assays(4): TPM count count_lstpm avetxlength
## rownames(45686): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
## ERCC-00170 ERCC-00171
## rowData names(3): gene genome symbol
## colnames(18): GSM1012777 GSM1012778 ... GSM1012793 GSM1012794
## colData names(0):
This object contains four slots, which can be accessed via the assays function:
- TPM: transcripts per million abundance estimates for each gene, obtained by summing the transcript TPMs for the gene's isoforms.
- count: gene read counts, obtained by summing the estimated read counts for the gene's isoforms.
- count_lstpm: length-scaled TPMs, which provide and alternative abundance measure on the “count scale”, which is not correlated with the average transcript length in a given sample. See the tximport Bioconductor package for more information.
- avetxlength: the average length of the transcripts expressed in each sample for each gene. See the tximport Bioconductor package for more information.
Each of these slots is a matrix with the respective values for each gene and each sample.
head(assays(gse41265_gene)[["TPM"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUSG00000000001.4 16.9260 68.9189000 27.2136000 25.6035
## ENSMUSG00000000003.15 0.0000 0.0000000 0.0000000 0.0000
## ENSMUSG00000000028.14 84.6329 0.0354825 0.0261671 0.0000
## ENSMUSG00000000031.15 0.0000 0.0000000 0.0000000 0.0000
## ENSMUSG00000000037.16 21.0225 0.0000000 0.0000000 0.0000
## ENSMUSG00000000049.11 0.0000 0.0000000 0.0000000 0.0000
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUSG00000000001.4 1.003940 13.9319000 1.62186 27.26950
## ENSMUSG00000000003.15 0.000000 0.0000000 0.00000 0.00000
## ENSMUSG00000000028.14 0.028031 0.0816208 0.00000 18.28858
## ENSMUSG00000000031.15 0.000000 0.0000000 0.00000 0.00000
## ENSMUSG00000000037.16 0.000000 0.0000000 0.00000 0.00000
## ENSMUSG00000000049.11 0.000000 0.0000000 0.00000 0.00000
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUSG00000000001.4 0.220791 0.14094 21.8415 42.991200
## ENSMUSG00000000003.15 0.000000 0.00000 0.0000 0.000000
## ENSMUSG00000000028.14 58.226400 0.00000 0.0000 0.118938
## ENSMUSG00000000031.15 0.000000 0.00000 0.0000 0.000000
## ENSMUSG00000000037.16 0.000000 0.00000 0.0000 0.000000
## ENSMUSG00000000049.11 0.000000 0.00000 0.0000 0.000000
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUSG00000000001.4 156.447 0.109962 1.3144600 27.3401000
## ENSMUSG00000000003.15 0.000 0.000000 0.0000000 0.0000000
## ENSMUSG00000000028.14 0.000 3.537801 0.0382929 0.0340262
## ENSMUSG00000000031.15 0.000 0.000000 0.0000000 0.0000000
## ENSMUSG00000000037.16 0.000 0.000000 0.0000000 0.5705780
## ENSMUSG00000000049.11 0.000 0.000000 0.0000000 0.8658010
## GSM1012793 GSM1012794
## ENSMUSG00000000001.4 23.0052 32.8455000
## ENSMUSG00000000003.15 0.0000 0.0000000
## ENSMUSG00000000028.14 0.0000 0.0453766
## ENSMUSG00000000031.15 0.0000 0.0000000
## ENSMUSG00000000037.16 0.0000 0.0000000
## ENSMUSG00000000049.11 0.0000 0.0000000
head(assays(gse41265_gene)[["count"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUSG00000000001.4 777.612 3903.94 2088.36 1493.55
## ENSMUSG00000000003.15 0.000 0.00 0.00 0.00
## ENSMUSG00000000028.14 2352.123 1.00 1.00 0.00
## ENSMUSG00000000031.15 0.000 0.00 0.00 0.00
## ENSMUSG00000000037.16 90.000 0.00 0.00 0.00
## ENSMUSG00000000049.11 0.000 0.00 0.00 0.00
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUSG00000000001.4 71.8391 875.125000 93.6992 1645.560
## ENSMUSG00000000003.15 0.0000 0.000000 0.0000 0.000
## ENSMUSG00000000028.14 1.0000 2.000001 0.0000 672.918
## ENSMUSG00000000031.15 0.0000 0.000000 0.0000 0.000
## ENSMUSG00000000037.16 0.0000 0.000000 0.0000 0.000
## ENSMUSG00000000049.11 0.0000 0.000000 0.0000 0.000
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUSG00000000001.4 11.00 9 1515.99 1843.77
## ENSMUSG00000000003.15 0.00 0 0.00 0.00
## ENSMUSG00000000028.14 1823.75 0 0.00 1.00
## ENSMUSG00000000031.15 0.00 0 0.00 0.00
## ENSMUSG00000000037.16 0.00 0 0.00 0.00
## ENSMUSG00000000049.11 0.00 0 0.00 0.00
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUSG00000000001.4 9756.93 6.0000 69.0064 1614.74
## ENSMUSG00000000003.15 0.00 0.0000 0.0000 0.00
## ENSMUSG00000000028.14 0.00 121.1216 1.0000 1.00
## ENSMUSG00000000031.15 0.00 0.0000 0.0000 0.00
## ENSMUSG00000000037.16 0.00 0.0000 0.0000 30.00
## ENSMUSG00000000049.11 0.00 0.0000 0.0000 16.00
## GSM1012793 GSM1012794
## ENSMUSG00000000001.4 1277.01 1459.1
## ENSMUSG00000000003.15 0.00 0.0
## ENSMUSG00000000028.14 0.00 1.0
## ENSMUSG00000000031.15 0.00 0.0
## ENSMUSG00000000037.16 0.00 0.0
## ENSMUSG00000000049.11 0.00 0.0
head(assays(gse41265_gene)[["count_lstpm"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUSG00000000001.4 752.8798 3720.7822146 2032.4459094 1452.948
## ENSMUSG00000000003.15 0.0000 0.0000000 0.0000000 0.000
## ENSMUSG00000000028.14 1872.5288 0.9528578 0.9720902 0.000
## ENSMUSG00000000031.15 0.0000 0.0000000 0.0000000 0.000
## ENSMUSG00000000037.16 290.8873 0.0000000 0.0000000 0.000
## ENSMUSG00000000049.11 0.0000 0.0000000 0.0000000 0.000
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUSG00000000001.4 69.9132476 847.647213 91.01965 1582.7900
## ENSMUSG00000000003.15 0.0000000 0.000000 0.00000 0.0000
## ENSMUSG00000000028.14 0.9709756 2.470152 0.00000 528.0124
## ENSMUSG00000000031.15 0.0000000 0.000000 0.00000 0.0000
## ENSMUSG00000000037.16 0.0000000 0.000000 0.00000 0.0000
## ENSMUSG00000000049.11 0.0000000 0.000000 0.00000 0.0000
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUSG00000000001.4 10.61746 8.385097 1487.373 1497.153135
## ENSMUSG00000000003.15 0.00000 0.000000 0.000 0.000000
## ENSMUSG00000000028.14 1392.76246 0.000000 0.000 2.060275
## ENSMUSG00000000031.15 0.00000 0.000000 0.000 0.000000
## ENSMUSG00000000037.16 0.00000 0.000000 0.000 0.000000
## ENSMUSG00000000049.11 0.00000 0.000000 0.000 0.000000
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUSG00000000001.4 9318.475 5.840962 66.3240941 1559.5448194
## ENSMUSG00000000003.15 0.000 0.000000 0.0000000 0.0000000
## ENSMUSG00000000028.14 0.000 93.474485 0.9610815 0.9654487
## ENSMUSG00000000031.15 0.000 0.000000 0.0000000 0.0000000
## ENSMUSG00000000037.16 0.000 0.000000 0.0000000 10.1246935
## ENSMUSG00000000049.11 0.000 0.000000 0.0000000 15.4503225
## GSM1012793 GSM1012794
## ENSMUSG00000000001.4 1201.013 1407.2935134
## ENSMUSG00000000003.15 0.000 0.0000000
## ENSMUSG00000000028.14 0.000 0.9670719
## ENSMUSG00000000031.15 0.000 0.0000000
## ENSMUSG00000000037.16 0.000 0.0000000
## ENSMUSG00000000049.11 0.000 0.0000000
head(assays(gse41265_gene)[["avetxlength"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUSG00000000001.4 3026.4100 3015.1200 3017.9000 3018.6900
## ENSMUSG00000000003.15 553.6516 553.6516 553.6516 553.6516
## ENSMUSG00000000028.14 1830.7909 1500.1200 1502.9000 1462.8034
## ENSMUSG00000000031.15 1022.8460 1022.8460 1022.8460 1022.8460
## ENSMUSG00000000037.16 282.0180 870.1116 870.1116 870.1116
## ENSMUSG00000000049.11 943.5560 943.5560 943.5560 943.5560
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUSG00000000001.4 3021.2400 3018.8100 3015.9000 3014.2200
## ENSMUSG00000000003.15 553.6516 553.6516 553.6516 553.6516
## ENSMUSG00000000028.14 1506.2400 1177.6154 1462.8034 1837.8923
## ENSMUSG00000000031.15 1022.8460 1022.8460 1022.8460 1022.8460
## ENSMUSG00000000037.16 870.1116 870.1116 870.1116 870.1116
## ENSMUSG00000000049.11 943.5560 943.5560 943.5560 943.5560
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUSG00000000001.4 3013.6100 3005.7400 3017.4300 3022.5500
## ENSMUSG00000000003.15 553.6516 553.6516 553.6516 553.6516
## ENSMUSG00000000028.14 1894.6100 1462.8034 1462.8034 592.5530
## ENSMUSG00000000031.15 1022.8460 1022.8460 1022.8460 1022.8460
## ENSMUSG00000000037.16 870.1116 870.1116 870.1116 870.1116
## ENSMUSG00000000049.11 943.5560 943.5560 943.5560 943.5560
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUSG00000000001.4 3015.2300 3017.5800 3014.5600 3015.5600
## ENSMUSG00000000003.15 553.6516 553.6516 553.6516 553.6516
## ENSMUSG00000000028.14 1462.8034 1893.3784 1499.5600 1500.5600
## ENSMUSG00000000031.15 1022.8460 1022.8460 1022.8460 1022.8460
## ENSMUSG00000000037.16 870.1116 870.1116 870.1116 2684.5600
## ENSMUSG00000000049.11 943.5560 943.5560 943.5560 943.5560
## GSM1012793 GSM1012794
## ENSMUSG00000000001.4 3012.8000 3006.4800
## ENSMUSG00000000003.15 553.6516 553.6516
## ENSMUSG00000000028.14 1462.8034 1491.4800
## ENSMUSG00000000031.15 1022.8460 1022.8460
## ENSMUSG00000000037.16 870.1116 870.1116
## ENSMUSG00000000049.11 943.5560 943.5560
Transcript-level data
To access the transcript abundances, get instead the transcript experiment:
(gse41265_tx <- experiments(gse41265)[["tx"]])
## class: RangedSummarizedExperiment
## dim: 113560 18
## metadata(0):
## assays(3): TPM count efflength
## rownames(113560): ENSMUST00000178537.1 ENSMUST00000178862.1 ...
## ERCC-00170 ERCC-00171
## rowData names(4): transcript gene genome symbol
## colnames(18): GSM1012777 GSM1012778 ... GSM1012793 GSM1012794
## colData names(0):
This object contains three slots, which can be accessed via the assays function:
- TPM: transcripts per million abundance estimates for each transcript.
- count: transcript read counts.
- efflength: effective transcript lengths.
Each of these slots is a matrix with the respective values for each transcript and each sample.
head(assays(gse41265_tx)[["TPM"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012793 GSM1012794
## ENSMUST00000178537.1 0 0
## ENSMUST00000178862.1 0 0
## ENSMUST00000177564.1 0 0
## ENSMUST00000196221.1 0 0
## ENSMUST00000179664.1 0 0
## ENSMUST00000179520.1 0 0
head(assays(gse41265_tx)[["count"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUST00000178537.1 0 0 0 0
## ENSMUST00000178862.1 0 0 0 0
## ENSMUST00000177564.1 0 0 0 0
## ENSMUST00000196221.1 0 0 0 0
## ENSMUST00000179664.1 0 0 0 0
## ENSMUST00000179520.1 0 0 0 0
## GSM1012793 GSM1012794
## ENSMUST00000178537.1 0 0
## ENSMUST00000178862.1 0 0
## ENSMUST00000177564.1 0 0
## ENSMUST00000196221.1 0 0
## ENSMUST00000179664.1 0 0
## ENSMUST00000179520.1 0 0
head(assays(gse41265_tx)[["efflength"]])
## GSM1012777 GSM1012778 GSM1012779 GSM1012780
## ENSMUST00000178537.1 12 12 12 12
## ENSMUST00000178862.1 14 14 14 14
## ENSMUST00000177564.1 16 16 16 16
## ENSMUST00000196221.1 9 9 9 9
## ENSMUST00000179664.1 11 11 11 11
## ENSMUST00000179520.1 11 11 11 11
## GSM1012781 GSM1012782 GSM1012783 GSM1012784
## ENSMUST00000178537.1 12 12 12 12
## ENSMUST00000178862.1 14 14 14 14
## ENSMUST00000177564.1 16 16 16 16
## ENSMUST00000196221.1 9 9 9 9
## ENSMUST00000179664.1 11 11 11 11
## ENSMUST00000179520.1 11 11 11 11
## GSM1012785 GSM1012786 GSM1012787 GSM1012788
## ENSMUST00000178537.1 12 12 12 12
## ENSMUST00000178862.1 14 14 14 14
## ENSMUST00000177564.1 16 16 16 16
## ENSMUST00000196221.1 9 9 9 9
## ENSMUST00000179664.1 11 11 11 11
## ENSMUST00000179520.1 11 11 11 11
## GSM1012789 GSM1012790 GSM1012791 GSM1012792
## ENSMUST00000178537.1 12 12 12 12
## ENSMUST00000178862.1 14 14 14 14
## ENSMUST00000177564.1 16 16 16 16
## ENSMUST00000196221.1 9 9 9 9
## ENSMUST00000179664.1 11 11 11 11
## ENSMUST00000179520.1 11 11 11 11
## GSM1012793 GSM1012794
## ENSMUST00000178537.1 12 12
## ENSMUST00000178862.1 14 14
## ENSMUST00000177564.1 16 16
## ENSMUST00000196221.1 9 9
## ENSMUST00000179664.1 11 11
## ENSMUST00000179520.1 11 11
Sample annotations
The sample annotations, downloaded from GEO, are also available in the object:
pdata <- colData(gse41265)
head(pdata, 2)
## DataFrame with 2 rows and 47 columns
## title geo_accession status
## <factor> <factor> <factor>
## GSM1012777 Single cell S1 GSM1012777 Public on May 19 2013
## GSM1012778 Single cell S2 GSM1012778 Public on May 19 2013
## submission_date last_update_date type channel_count
## <factor> <factor> <factor> <factor>
## GSM1012777 Oct 01 2012 May 19 2013 SRA 1
## GSM1012778 Oct 01 2012 May 19 2013 SRA 1
## source_name_ch1 organism_ch1 characteristics_ch1
## <factor> <factor> <factor>
## GSM1012777 BMDC (4h LPS stim) Mus musculus strain: C57BL/6
## GSM1012778 BMDC (4h LPS stim) Mus musculus strain: C57BL/6
## characteristics_ch1.1
## <factor>
## GSM1012777 cell type: Bone Marrow-derived Dendritic Cell (BMDC)
## GSM1012778 cell type: Bone Marrow-derived Dendritic Cell (BMDC)
## characteristics_ch1.2 characteristics_ch1.3
## <factor> <factor>
## GSM1012777 treatment: LPS-stimulation cell count: 1 cell
## GSM1012778 treatment: LPS-stimulation cell count: 1 cell
## characteristics_ch1.4
## <factor>
## GSM1012777
## GSM1012778
## growth_protocol_ch1
## <factor>
## GSM1012777 Cells were cultured and stimulated with LPS as previously described (Amit et. al 2009)
## GSM1012778 Cells were cultured and stimulated with LPS as previously described (Amit et. al 2009)
## molecule_ch1 extract_protocol_ch1
## <factor> <factor>
## GSM1012777 polyA RNA cDNA synthesis and amplification:
## GSM1012778 polyA RNA cDNA synthesis and amplification:
## extract_protocol_ch1.1
## <factor>
## GSM1012777 We used the SMARTer Ultra Low RNA Kit (Clontech, Mountain View, CA) to prepare amplified cDNA. We added 1 l of 12 M 3' SMART primer (5' AAGCAGTGGTATCAACGCAGAGTACT(30)N-1N (N = A, C, G, or T; N-1 = A, G, or C)), 1 l of H2O, and 2.5 l of Reaction Buffer onto the RNA capture beads. We mixed them well by pipetting, heated the mixture at 72 C for 3 minutes and placed it on ice. First-strand cDNA was synthesized with this RNA primer mix by adding 2 l of 5x first-strand buffer, 0.25 l of 100mM DTT, 1 l of 10 mM dNTPs, 1 l of 12 M SMARTer II A Oligo (5' AAGCAGTGGTATCAACGCAGAGTACXXXXX (X = undisclosed base in the proprietary SMARTer oligo sequence)), 100 U SMARTScribe RT, and 10 U RNase Inhibitor in a total volume of 10 l and incubating at 42 C for 90 minutes followed by 10 minutes at 70 C. We purified the first strand cDNA by adding 25 l of room temperature AMPure XP SPRI beads (Beckman Coulter Genomics, Danvers, MA), mixing well by pipetting, incubating at room temperature for 8 minutes. We removed the supernatant from the beads after a good separation was established. We carried out all of the above steps in a PCR product free clean room. We amplified the cDNA by adding 5 l of 10x Advantage 2 PCR Buffer, 2 l of 10 mM dNTPs, 2 l of 12 M IS PCR primer (5' AAGCAGTGGTATCAACGCAGAGT), 2 l of 50x Advantage 2 Polymerase Mix, and 39 l H2O in a total volume of 50 l. We performed the PCR at 95 C for 1 minute, followed by 21 cycles of 15 seconds at 95 C, 30 seconds at 65 C and 6 minutes at 68 C, followed by another 10 minutes at 72 C for final extension. We purified the amplified cDNA by adding 90 l of AMPure XP SPRI beads and washing with 80% ethanol.
## GSM1012778 We used the SMARTer Ultra Low RNA Kit (Clontech, Mountain View, CA) to prepare amplified cDNA. We added 1 l of 12 M 3' SMART primer (5' AAGCAGTGGTATCAACGCAGAGTACT(30)N-1N (N = A, C, G, or T; N-1 = A, G, or C)), 1 l of H2O, and 2.5 l of Reaction Buffer onto the RNA capture beads. We mixed them well by pipetting, heated the mixture at 72 C for 3 minutes and placed it on ice. First-strand cDNA was synthesized with this RNA primer mix by adding 2 l of 5x first-strand buffer, 0.25 l of 100mM DTT, 1 l of 10 mM dNTPs, 1 l of 12 M SMARTer II A Oligo (5' AAGCAGTGGTATCAACGCAGAGTACXXXXX (X = undisclosed base in the proprietary SMARTer oligo sequence)), 100 U SMARTScribe RT, and 10 U RNase Inhibitor in a total volume of 10 l and incubating at 42 C for 90 minutes followed by 10 minutes at 70 C. We purified the first strand cDNA by adding 25 l of room temperature AMPure XP SPRI beads (Beckman Coulter Genomics, Danvers, MA), mixing well by pipetting, incubating at room temperature for 8 minutes. We removed the supernatant from the beads after a good separation was established. We carried out all of the above steps in a PCR product free clean room. We amplified the cDNA by adding 5 l of 10x Advantage 2 PCR Buffer, 2 l of 10 mM dNTPs, 2 l of 12 M IS PCR primer (5' AAGCAGTGGTATCAACGCAGAGT), 2 l of 50x Advantage 2 Polymerase Mix, and 39 l H2O in a total volume of 50 l. We performed the PCR at 95 C for 1 minute, followed by 21 cycles of 15 seconds at 95 C, 30 seconds at 65 C and 6 minutes at 68 C, followed by another 10 minutes at 72 C for final extension. We purified the amplified cDNA by adding 90 l of AMPure XP SPRI beads and washing with 80% ethanol.
## extract_protocol_ch1.2
## <factor>
## GSM1012777 We created Illumina sequencing libraries from this amplified cDNA using standard protocols.
## GSM1012778 We created Illumina sequencing libraries from this amplified cDNA using standard protocols.
## extract_protocol_ch1.3
## <factor>
## GSM1012777 cDNA shearing and library construction:
## GSM1012778 cDNA shearing and library construction:
## extract_protocol_ch1.4
## <factor>
## GSM1012777 We added the purification buffer (Clontech) to the amplified cDNA to make a total volume of 76 l. We sheared the cDNA in a 100 l tube with 10% Duty Cycle, 5% Intensity and 200 Cycles/Burst for 5 minutes in the frequency sweeping mode (Covaris S2 machine, Woburn, MA). We purified the sheared cDNA with 2.2 volumes AMPure XP SPRI beads.
## GSM1012778 We added the purification buffer (Clontech) to the amplified cDNA to make a total volume of 76 l. We sheared the cDNA in a 100 l tube with 10% Duty Cycle, 5% Intensity and 200 Cycles/Burst for 5 minutes in the frequency sweeping mode (Covaris S2 machine, Woburn, MA). We purified the sheared cDNA with 2.2 volumes AMPure XP SPRI beads.
## extract_protocol_ch1.5
## <factor>
## GSM1012777 We prepared indexed paired-end libraries for Illumina sequencing as described (J. Z. Levin et al., Nature Methods 7, 709 (2010)., with the following modifications. First, we used a different indexing adaptor (containing an 8-base barcode) for each library. Second, we size-selected the ligation product by using two rounds of 0.7 volume of AMPure XP SPRI bead cleanup with the first round starting volume at 100 l. Third, we performed PCR with Phusion High-Fidelity DNA polymerase with GC buffer and 2 M betaine. Fourth, we used 55 C as the annealing temperature in PCR with the universal indexing primers (forward primer 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGAC, reverse primer 5'-CAAGCAGAAGACGGCATACGAGAT). Fifth, we performed 12 cycles of PCR. Sixth, we removed PCR primers using two rounds of 1.0 volume of AMPure beads.
## GSM1012778 We prepared indexed paired-end libraries for Illumina sequencing as described (J. Z. Levin et al., Nature Methods 7, 709 (2010)., with the following modifications. First, we used a different indexing adaptor (containing an 8-base barcode) for each library. Second, we size-selected the ligation product by using two rounds of 0.7 volume of AMPure XP SPRI bead cleanup with the first round starting volume at 100 l. Third, we performed PCR with Phusion High-Fidelity DNA polymerase with GC buffer and 2 M betaine. Fourth, we used 55 C as the annealing temperature in PCR with the universal indexing primers (forward primer 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGAC, reverse primer 5'-CAAGCAGAAGACGGCATACGAGAT). Fifth, we performed 12 cycles of PCR. Sixth, we removed PCR primers using two rounds of 1.0 volume of AMPure beads.
## taxid_ch1 description
## <factor> <factor>
## GSM1012777 10090 S1
## GSM1012778 10090 S2
## data_processing
## <factor>
## GSM1012777 We created a Bowtie index based on the UCSC knownGene (8) transcriptome, and aligned paired-end reads directly to this index using Bowtie v 0.12.7 with command line options -q --phred33-quals -n 2 -e 99999999 -l 25 -I 1 -X 1000 -a -m 200.
## GSM1012778 We created a Bowtie index based on the UCSC knownGene (8) transcriptome, and aligned paired-end reads directly to this index using Bowtie v 0.12.7 with command line options -q --phred33-quals -n 2 -e 99999999 -l 25 -I 1 -X 1000 -a -m 200.
## data_processing.1
## <factor>
## GSM1012777 Next, we ran RSEM v1.11 with default parameters on these alignments to estimate expression levels. RSEM’s gene level expression estimates (tau) were multiplied by 1,000,000 to obtain transcript per million (TPM) estimates for each gene.
## GSM1012778 Next, we ran RSEM v1.11 with default parameters on these alignments to estimate expression levels. RSEM’s gene level expression estimates (tau) were multiplied by 1,000,000 to obtain transcript per million (TPM) estimates for each gene.
## data_processing.2
## <factor>
## GSM1012777 Genome_build: mm9
## GSM1012778 Genome_build: mm9
## data_processing.3
## <factor>
## GSM1012777 Supplementary_files_format_and_content: File allGenesTPM.txt represents a matrix of gene expression estimates across all non-MolecularBarcode samples. File umbExp.txt represents a matrix of gene expression estimates across all MolecularBarcode samples. Linked as supplementary files on Series record.
## GSM1012778 Supplementary_files_format_and_content: File allGenesTPM.txt represents a matrix of gene expression estimates across all non-MolecularBarcode samples. File umbExp.txt represents a matrix of gene expression estimates across all MolecularBarcode samples. Linked as supplementary files on Series record.
## platform_id contact_name contact_email contact_phone
## <factor> <factor> <factor> <factor>
## GSM1012777 GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468
## GSM1012778 GPL13112 Rahul,,Satija rsatija@nygenome.org 6177022468
## contact_laboratory contact_institute
## <factor> <factor>
## GSM1012777 Satija Lab New York Genome Center
## GSM1012778 Satija Lab New York Genome Center
## contact_address contact_city contact_state
## <factor> <factor> <factor>
## GSM1012777 101 Avenue of the Americas New York City NY
## GSM1012778 101 Avenue of the Americas New York City NY
## contact_zip.postal_code contact_country data_row_count
## <factor> <factor> <factor>
## GSM1012777 10013 USA 0
## GSM1012778 10013 USA 0
## instrument_model library_selection library_source
## <factor> <factor> <factor>
## GSM1012777 Illumina HiSeq 2000 cDNA transcriptomic
## GSM1012778 Illumina HiSeq 2000 cDNA transcriptomic
## library_strategy
## <factor>
## GSM1012777 RNA-Seq
## GSM1012778 RNA-Seq
## relation
## <factor>
## GSM1012777 SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRX190719
## GSM1012778 SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRX190720
## relation.1
## <factor>
## GSM1012777 BioSample: http://www.ncbi.nlm.nih.gov/biosample/SAMN01737621
## GSM1012778 BioSample: http://www.ncbi.nlm.nih.gov/biosample/SAMN01737622
## supplementary_file_1
## <factor>
## GSM1012777 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX190/SRX190719
## GSM1012778 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX190/SRX190720
Metadata
Finally, the MultiAssayExperiment object contains information regarding the mapping and abundance estimation, as well as the date when it was generated.
names(metadata(gse41265))
## [1] "genome" "organism" "salmon_summary" "creation_date"
metadata(gse41265)$genome
## [1] "GRCm38.84"
metadata(gse41265)$organism
## [1] "Mus musculus"
head(metadata(gse41265)$salmon_summary)
## sample salmon_version libtype
## 1 GSM1012777 0.6.0 IU
## 2 GSM1012778 0.6.0 IU
## 3 GSM1012779 0.6.0 IU
## 4 GSM1012780 0.6.0 IU
## 5 GSM1012781 0.6.0 IU
## 6 GSM1012782 0.6.0 IU
## index seqBias num_processed
## 1 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 21326048
## 2 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 27434011
## 3 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 31142391
## 4 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 26231852
## 5 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 29977214
## 6 Mus_musculus.GRCm38.84.cdna.ncrna.ercc92.sidx FALSE 24148387
## num_mapped percent_mapped
## 1 15794660 74.063
## 2 18787649 68.483
## 3 24373666 78.265
## 4 20122093 76.709
## 5 23706687 79.082
## 6 19342343 80.098
metadata(gse41265)$creation_date
## [1] "Sun Jul 23 20:00:43 2017"