Differential transcript usage from RNA-seq data: isoform pre-filtering improves performance of count-based methods
Charlotte Soneson1,2,*, Katarina L Matthes3,4,*, Malgorzata Nowicka1,2, Charity W Law1,2, Mark D Robinson1,2
1 Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190 CH-8057 Zurich, Switzerland 2 SIB Swiss Institute of Bioinformatics, University of Zurich, CH-8057 Zurich, Switzerland 3 Division of Chronic Disease Epidemiology, Epidemiology, Biostatistics and Prevention Institute (EPBI), University of Zurich, Hirschengraben 84, CH-8001 Zurich, Switzerland 4 Cancer Registry Zurich and Zug, University Hospital Zurich, Vogelsangstrasse 10, CH-8091 Zurich, Switzerland * Equal contribution
Large-scale sequencing of cDNA (RNA-seq) has been a boon to the quantitative analysis of transcriptomes.
A notable application of significant biomedical relevance is the detection of changes in transcript
usage between experimental conditions. For example, discovery of pathological alternative splicing may
allow the development of new treatments or better management of patients. From an analysis perspective,
there are several ways to approach RNA-seq data to unravel differential transcript usage, such as
annotation-based exon-level counting, differential analysis of the "percent spliced in" measure or
quantitative analysis of assembled transcripts. The goal of this research is to compare and contrast
current state-of-the-art methods, as well as to suggest improvements to commonly used workflows.
Results:
We assess the performance of representative workflows using synthetic data, and explore the effect
of using non-standard counting bin definitions as input to a state-of-the-art inference engine
(DEXSeq). Although the canonical counting provided the best results overall, several non-canonical
approaches were as good or better in specific aspects, and most counting approaches outperformed
the evaluated event- and assembly-based methods. We show that an incomplete annotation catalog
can have a detrimental effect on the ability to detect differential transcript usage in transcriptomes
with few isoforms per gene, and that isoform-level pre-filtering can considerably improve the
false discovery rate (FDR) control.
Conclusion:
Count-based methods generally perform well in detection of differential transcript usage.
Controlling the FDR at the imposed threshold is difficult, mainly in complex organisms,
but can be improved by pre-filtering of the annotation catalog.