Rna seq analysis protocol

Rna seq analysis protocol DEFAULT

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Abstract

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.

Access options

Subscribe to Journal

Get full journal access for 1 year

,22 €

only 9,27 € per issue

Subscribe

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$

Rent or Buy

All prices are NET prices.

Accession codes

Accessions

Gene Expression Omnibus

Change history

    References

    1. 1

      Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods5, – ().

      CASArticle Google Scholar

    2. 2

      Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods5, – ().

      CASArticle Google Scholar

    3. 3

      Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, – ().

      CASArticle Google Scholar

    4. 4

      Mardis, E.R. The impact of next-generation sequencing technology on genetics. Trends Genet.24, – ().

      CASPubMed Google Scholar

    5. 5

      Adams, M.D. et al. Sequence identification of 2, human brain genes. Nature, – ().

      CASArticle Google Scholar

    6. 6

      Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev.25, – ().

      CASArticle Google Scholar

    7. 7

      Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics25, – ().

      CASArticle Google Scholar

    8. 8

      Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol.28, – ().

      CASArticle Google Scholar

    9. 9

      Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc., – ().

      CASArticle Google Scholar

    10. 10

      Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics26, – ().

      Article Google Scholar

    11. 11

      Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res.18, – ().

      CASArticle Google Scholar

    12. 12

      Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods8, – ().

      CASArticle Google Scholar

    13. 13

      Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics25, – ().

      CASArticle Google Scholar

    14. 14

      Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature, 68–73 ().

      Article Google Scholar

    15. 15

      Graveley, B.R. et al. The developmental transcriptome of Drosophila melanogaster. Nature, – ().

      CASArticle Google Scholar

    16. 16

      Twine, N.A., Janitz, K., Wilkins, M.R. & Janitz, M. Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease. PLoS ONE6, e ().

      CASArticle Google Scholar

    17. 17

      Mizuno, H. et al. Massive parallel sequencing of mRNA in identification of unannotated salinity stress-inducible transcripts in rice (Oryza sativa L.). BMC Genomics11, ().

      CASArticle Google Scholar

    18. 18

      Goecks, J., Nekrutenko, A. & Taylor, J. Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol.11, R86 ().

      Article Google Scholar

    19. 19

      Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics26, – ().

      CASArticle Google Scholar

    20. 20

      Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res.38, e ().

      Article Google Scholar

    21. 21

      Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res.38, – ().

      CASArticle Google Scholar

    22. 22

      Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol.28, – ().

      CASArticle Google Scholar

    23. 23

      Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods7, – ().

      CASArticle Google Scholar

    24. 24

      Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods7, – ().

      CASArticle Google Scholar

    25. 25

      Nicolae, M., Mangul, S., Măndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-seq data. Algorithms Mol. Biol.6, 9 ().

      Article Google Scholar

    26. 26

      Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol.11, R ().

      CASArticle Google Scholar

    27. 27

      Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26, – ().

      Article Google Scholar

    28. 28

      Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics26, – ().

      Article Google Scholar

    29. 29

      Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol.29, – ().

      CASArticle Google Scholar

    30. 30

      Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods7, – ().

      CASArticle Google Scholar

    31. 31

      Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science, – ().

      CASArticle Google Scholar

    32. 32

      Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. & Weissman, J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science, – ().

      CASArticle Google Scholar

    33. 33

      Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.10, R25 ().

      Article Google Scholar

    34. 34

      Ferragina, P. & Manzini, G. An experimental study of a compressed index. Information Sci., 13–28 ().

      Article Google Scholar

    35. 35

      Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics27, – ().

      CASArticle Google Scholar

    36. 36

      Li, J., Jiang, H. & Wong, W.H. Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol.11, R50 ().

      Article Google Scholar

    37. 37

      Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res.38, e ().

      Article Google Scholar

    38. 38

      Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biol.12, R22 ().

      CASArticle Google Scholar

    39. 39

      Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods7, – ().

      CASArticle Google Scholar

    40. 40

      Hansen, K.D., Wu, Z., Irizarry, R.A. & Leek, J.T. Sequencing technology does not eliminate biological variability. Nat. Biotechnol.29, – ().

      CASArticle Google Scholar

    41. 41

      Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R) p (Springer, ).

    42. 42

      Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol.29, 24–26 ().

      CASArticle Google Scholar

    43. 43

      Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics25, – ().

      Article Google Scholar

    44. 44

      Schatz, M.C., Langmead, B. & Salzberg, S.L. Cloud computing and the DNA data race. Nat. Biotechnol.28, – ().

      CASArticle Google Scholar

    Download references

    Acknowledgements

    We are grateful to D. Hendrickson, M. Cabili and B. Langmead for helpful technical discussions. The TopHat and Cufflinks projects are supported by US National Institutes of Health grants RHG (to S.L.S.) and RHG (to L.P.). C.T. is a Damon Runyon Cancer Foundation Fellow. L.G. is a National Science Foundation Postdoctoral Fellow. A.R. is a National Science Foundation Graduate Research Fellow. J.L.R. is a Damon Runyon-Rachleff, Searle, and Smith Family Scholar, and is supported by Director's New Innovator Awards (1DP2OD). This work was funded in part by the Center of Excellence in Genome Science from the US National Human Genome Research Institute (J.L.R.). J.L.R. is an investigator of the Merkin Foundation for Stem Cell Research at the Broad Institute.

    Author information

    Affiliations

    1. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA

      Cole Trapnell, Loyal Goff, David R Kelley & John L Rinn

    2. Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA

      Cole Trapnell, Loyal Goff, David R Kelley & John L Rinn

    3. Department of Computer Science, University of California, Berkeley, California, USA

      Adam Roberts, Harold Pimentel & Lior Pachter

    4. Department of Electrical Engineering and Computer Science, Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

      Loyal Goff

    5. Department of Medicine, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA

      Geo Pertea, Daehwan Kim & Steven L Salzberg

    6. Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA

      Geo Pertea & Steven L Salzberg

    7. Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA

      Daehwan Kim

    8. Department of Mathematics, University of California, Berkeley, California, USA

      Lior Pachter

    9. Department of Molecular and Cell Biology, University of California, Berkeley, California, USA

      Lior Pachter

    Contributions

    C.T. is the lead developer for the TopHat and Cufflinks projects. L.G. designed and wrote CummeRbund. D.K., H.P. and G.P. are developers of TopHat. A.R. and G.P. are developers of Cufflinks and its accompanying utilities. C.T. developed the protocol, generated the example experiment and performed the analysis. L.P., S.L.S. and C.T. conceived the TopHat and Cufflinks software projects. C.T., D.R.K. and J.L.R. wrote the manuscript.

    Corresponding author

    Correspondence to Cole Trapnell.

    Ethics declarations

    Competing interests

    The authors declare no competing financial interests.

    About this article

    Cite this article

    Trapnell, C., Roberts, A., Goff, L. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc7, – (). https://doi.org//nprot

    Download citation

    Share this article

    Anyone you share the following link with will be able to read this content:

    Sorry, a shareable link is not currently available for this article.

    Provided by the Springer Nature SharedIt content-sharing initiative

    Further reading

    • Auxin is involved in arbuscular mycorrhizal fungi-promoted tomato growth and NADP-malic enzymes expression in continuous cropping substrates

      • Yu Wang
      • , Wenze Zhang
      • , Weikang Liu
      • , Golam Jalal Ahammed
      • , Wenxu Wen
      • , Shirong Guo
      • , Sheng Shu
      •  & Jin Sun

      BMC Plant Biology ()

    • Transcriptomic analysis of salt tolerance-associated genes and diversity analysis using indel markers in yardlong bean (Vigna unguiculata ssp. sesquipedialis)

      • Hongmei Zhang
      • , Wenjing Xu
      • , Huatao Chen
      • , Jingbin Chen
      • , Xiaoqing Liu
      • , Xin Chen
      •  & Shouping Yang

      BMC Genomic Data ()

    • Transcriptomic insights into the effects of CytCo, a novel nematotoxic protein, on the pine wood nematode Bursaphelenchus xylophilus

      • Ye Chen
      • , Xiang Zhou
      • , Kai Guo
      • , Sha-Ni Chen
      •  & Xiu Su

      BMC Genomics ()

    • A first insight into the genome of Prototheca wickerhamii, a major causative agent of human protothecosis

      • Zofia Bakuła
      • , Paweł Siedlecki
      • , Robert Gromadka
      • , Jan Gawor
      • , Agnieszka Gromadka
      • , Jan J. Pomorski
      • , Hanna Panagiotopoulou
      •  & Tomasz Jagielski

      BMC Genomics ()

    • PRICKLE1, a Wnt/PCP signaling component, is overexpressed and associated with inferior prognosis in acute myeloid leukemia

      • Duanfeng Jiang
      • , Yanjuan He
      • , Qiuyu Mo
      • , Enyi Liu
      • , Xin Li
      • , Lihua Huang
      • , Qin Zhang
      • , Fangping Chen
      • , Yan Li
      •  & Haigang Shao

      Journal of Translational Medicine ()

    Comments

    By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

    Sours: https://www.nature.com/articles/nprot

    An RNA-Seq Protocol for Differential Expression Analysis

    Abstract

    Here we consider RNA-Seq, used to measure global gene expression through RNA fragmentation, capture, sequencing, and subsequent computational analysis. Xenopus, with its large number of RNA-rich, synchronously developing, and accessible embryos, is an excellent model organism for exploiting the power of high-throughput sequencing to understand gene expression during development. Here we present a standard RNA-Seq protocol for performing two-state differential gene expression analysis (between groups of replicates of control and treated embryos) using Illumina sequencing. Samples contain multiple whole embryos, and polyadenylated mRNA is measured under relative normalization. The protocol is divided into two parts: wet-lab processes to prepare samples for sequencing and downstream computational analysis including quality control, quantification of gene expression, and differential expression.

    Footnotes

    • From the Xenopus collection, edited by Hazel L. Sive.

    Sours: http://cshprotocols.cshlp.org/content//6/pdb.protabstract
    1. May 29 2021 horoscope
    2. Discord hierarchy names
    3. Hairstyles with your photo
    4. Gangster tattoo designs

    Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

    Abstract

    High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

    Access options

    Subscribe to Journal

    Get full journal access for 1 year

    ,22 €

    only 9,27 € per issue

    Subscribe

    All prices are NET prices.
    VAT will be added later in the checkout.
    Tax calculation will be finalised during checkout.

    Rent or Buy article

    Get time limited or full article access on ReadCube.

    from$

    Rent or Buy

    All prices are NET prices.

    References

    1. 1

      Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, – ().

      CASArticle Google Scholar

    2. 2

      Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods5, – ().

      CASArticle Google Scholar

    3. 3

      Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods5, – ().

      CASArticle Google Scholar

    4. 4

      Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol.14, R36 ().

      Article Google Scholar

    5. 5

      Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol.28, – ().

      CASArticle Google Scholar

    6. 6

      Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc.7, – ().

      CASArticle Google Scholar

    7. 7

      Kim, D., Langmead, B. & Salzberg, S.L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods12, – ().

      CASArticle Google Scholar

    8. 8

      Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol.33, – ().

      CASArticle Google Scholar

    9. 9

      Frazee, A.C. et al. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat. Biotechnol.33, – ().

      CASArticle Google Scholar

    10. 10

      Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics26, – ().

      CASArticle Google Scholar

    11. 11

      Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics29, 15–21 ().

      CASArticle Google Scholar

    12. 12

      Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol.28, – ().

      CASArticle Google Scholar

    13. 13

      Li, W., Feng, J. & Jiang, T. IsoLasso: a LASSO regression approach to RNA-seq based transcriptome assembly. J. Comput. Biol.18, – ().

      Article Google Scholar

    14. 14

      Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol.29, – ().

      CASArticle Google Scholar

    15. 15

      Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics28, – ().

      CASArticle Google Scholar

    16. 16

      Xie, Y. et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics30, – ().

      CASArticle Google Scholar

    17. 17

      Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics12, ().

      CASArticle Google Scholar

    18. 18

      Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods10, 71–73 ().

      CASArticle Google Scholar

    19. 19

      Patro, R., Mount, S.M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol.32, – ().

      CASArticle Google Scholar

    20. 20

      Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26, – ().

      CASArticle Google Scholar

    21. 21

      Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15, ().

      Article Google Scholar

    22. 22

      Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol.31, 46–53 ().

      CASArticle Google Scholar

    23. 23

      Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res.22, – ().

      CASArticle Google Scholar

    24. 24

      Shen, S. et al. rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. USA, E–E ().

      CASArticle Google Scholar

    25. 25

      Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods7, – ().

      CASArticle Google Scholar

    26. 26

      Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.10, R25 ().

      Article Google Scholar

    27. 27

      Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods9, – ().

      CASArticle Google Scholar

    28. 28

      Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics25, – ().

      CASArticle Google Scholar

    29. 29

      Ferragina, P. & Manzini, G. Opportunistic data structures with applications. Proceedings 41st Annual Symposium on Foundations of Computer Science ().

    30. 30

      Raj, A. et al. Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling. eLife5, e ().

      Article Google Scholar

    31. 31

      Kodama, Y., Shumway, M. & Leinonen, R. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res.40, D54–D56 ().

      CASArticle Google Scholar

    32. 32

      Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods12, – ().

      CASArticle Google Scholar

    33. 33

      Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.43, e47 ().

      Article Google Scholar

    34. 34

      Paulson, J.N., Stine, O.C., Bravo, H.C. & Pop, M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods10, – ().

      CASArticle Google Scholar

    35. 35

      Robert, C. & Watson, M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol.16, ().

      Article Google Scholar

    36. 36

      Pertea, M. The human transcriptome: an unfinished story. Genes3, – ().

      CASArticle Google Scholar

    37. 37

      Chow, J.C. et al. Inducible XIST-dependent X-chromosome inactivation in human somatic cells is reversible. Proc. Natl. Acad. Sci. USA, – ().

      CASArticle Google Scholar

    38. 38

      Lee, J.T., Davidow, L.S. & Warshawsky, D. Tsix, a gene antisense to Xist at the X-inactivation centre. Nat. Genet.21, – ().

      CASArticle Google Scholar

    39. 39

      Talebizadeh, Z., Simon, S.D. & Butler, M.G. X chromosome gene expression in human tissues: male and female comparisons. Genomics88, – ().

      CASArticle Google Scholar

    Download references

    Acknowledgements

    This work was supported in part by the National Institutes of Health under grants RHG (to S.L.S.), RGM (to S.L.S.) and RGM (to J.T.L.), and the National Science Foundation under grant DBI (to M.P.).

    Author information

    Affiliations

    1. Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, USA

      Mihaela Pertea, Daehwan Kim, Geo M Pertea & Steven L Salzberg

    2. Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, USA

      Mihaela Pertea & Steven L Salzberg

    3. Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, USA

      Jeffrey T Leek & Steven L Salzberg

    4. Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA

      Steven L Salzberg

    Contributions

    M.P. led the development of the protocol, with help from all the authors. D.K. is the main developer of HISAT, M.P. led the development of StringTie and J.T.L. is the senior author of Ballgown. G.M.P. developed gffcompare and contributed to StringTie. All authors contributed to the writing of the manuscript. S.L.S. supervised the entire project.

    Corresponding author

    Correspondence to Steven L Salzberg.

    Ethics declarations

    Competing interests

    The authors declare no competing financial interests.

    Supplementary information

    About this article

    Verify currency and authenticity via CrossMark

    Cite this article

    Pertea, M., Kim, D., Pertea, G. et al. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc11, – (). https://doi.org//nprot

    Download citation

    Share this article

    Anyone you share the following link with will be able to read this content:

    Sorry, a shareable link is not currently available for this article.

    Provided by the Springer Nature SharedIt content-sharing initiative

    Further reading

    • Positive natural selection of N6-methyladenosine on the RNAs of processed pseudogenes

      • Liqiang Tan
      • , Weisheng Cheng
      • , Fang Liu
      • , Dan Ohtan Wang
      • , Linwei Wu
      • , Nan Cao
      •  & Jinkai Wang

      Genome Biology ()

    • The Amino Acid Transporter OsAAP4 Contributes to Rice Tillering and Grain Yield by Regulating Neutral Amino Acid Allocation through Two Splicing Variants

      • Zhongming Fang
      • , Bowen Wu
      •  & Yuanyuan Ji

      Rice ()

    • Transcriptomic analysis reveals the molecular mechanisms of rumen wall morphological and functional development induced by different solid diet introduction in a lamb model

      • Daming Sun
      • , Yuyang Yin
      • , Changzheng Guo
      • , Lixiang Liu
      • , Shengyong Mao
      • , Weiyun Zhu
      •  & Junhua Liu

      Journal of Animal Science and Biotechnology ()

    • Comprehensive analysis of coding and non-coding RNA transcriptomes related to hypoxic adaptation in Tibetan chickens

      • Ying Zhang
      • , Woyu Su
      • , Bo Zhang
      • , Yao Ling
      • , Woo Kyun Kim
      •  & Hao Zhang

      Journal of Animal Science and Biotechnology ()

    • Cotton D genome assemblies built with long-read data unveil mechanisms of centromere evolution and stress tolerance divergence

      • Zhaoen Yang
      • , Xiaoyang Ge
      • , Weinan Li
      • , Yuying Jin
      • , Lisen Liu
      • , Wei Hu
      • , Fuyan Liu
      • , Yanli Chen
      • , Shaoliang Peng
      •  & Fuguang Li

      BMC Biology ()

    Comments

    By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

    Sours: https://www.nature.com/articles/nprot

    RNA-Seq

    Overview

    Among different methods to evaluate gene expression, the high-throughput sequencing of RNA, or RNA-seq. is particularly attractive, as it can be performed and analyzed without relying on prior available genomic information. During RNA-seq, RNA isolated from samples of interest is used to generate a DNA library, which is then amplified and sequenced. Ultimately, RNA-seq can determine which genes are expressed, the levels of their expression, and the presence of any previously unknown transcripts.

    Here, JoVE presents the basic principles behind RNA-seq. We then discuss the experimental and analytical steps of a general RNA-seq protocol. Finally, we examine how researchers are currently using RNA-seq, for example, to compare gene expression between different biological samples, or to characterize protein-RNA interactions.

    Procedure

    Log in or Start trial to access full content. Learn more about your institution’s access to JoVE content here

    RNA sequencing, or RNA-seq, is a technique that can provide information on the sequence and quantity of every RNA expressed, known as the &#;transcriptome,&#; in a cell population. Unlike other expression profiling methods such as microarrays, which involve probing for known RNA sequences, RNA-seq can profile gene expression from organisms with un-sequenced genomes. Additionally, RNA-seq can accurately measure a larger range of transcript expression levels than microarrays, especially at very low or very high levels.

    This video will cover the principles of RNA-seq, a protocol for preparing an RNA-seq library and analyzing the data, and some applications of this technique.

    First, let&#;s review some principles behind RNA-seq. Transcriptome sequencing requires isolating a population of transcripts whose levels are to be measured. Most RNA in cells is ribosomal RNA, or rRNA, the central component of the cell&#;s protein-production machinery. To facilitate recovery of other types of transcripts, rRNA is typically removed prior to sequencing by hybridizing the sample to complementary oligonucleotides attached to magnetic beads, and using a magnet to separate the rRNA from the rest of the sample.

    Alternatively, a specific population of RNA can be selected for sequencing. For example, protein-coding messenger RNAs, or mRNAs, can be captured with &#;oligo-dT&#;&#;short stretches of deoxy-T nucleotides that bind to the sequence of A bases known as a poly-A tail at the end of these transcripts. The contaminating rRNA is then removed. MicroRNAs, which are nucleotide regulatory RNAs, can be selectively isolated for sequencing based on their size. Because RNA is inherently prone to degradation, it is first reverse transcribed to double-stranded DNA.

    Oligonucleotide sequences known as adaptors are then ligated onto the DNA fragments. The adaptors contain constant regions that serve as primer-binding sites for subsequent PCR amplification, and these are usually asymmetric so that the &#;strandedness&#; of the template is preserved. The adaptors also contain unique sequences, known as &#;barcodes,&#; that identify all fragments originating from a single sample. The library is then amplified by PCR.

    A sequencing chip, on which there are oligonucleotides complementary to the adaptors, is used to immobilize the library sample, which is diluted such that the DNA molecules anneal onto the chip at low density. The DNA is amplified on the chip via a process called &#;bridge amplification&#; to form &#;clonal clusters.&#; Short fragments, each bases in length, are then synthesized from one or both ends of these DNA templates, generating hundreds of millions of products known as sequencing reads.

    The sequencing results are then analyzed for quality and the data are processed. Analysis of the sequences can reveal a wide variety of information, including differences in expression levels of RNAs between samples and previously unknown transcripts or forms of transcripts.

    Now that we&#;ve seen how RNA-seq works, let&#;s go through a protocol for preparing an RNA-seq library and analyzing the sequence data. RNA is first obtained from the sample of interest, and its quality is checked by electrophoresis, for example by using a microfluidics device called a bioanalyzer. The RNA must be of high quality for accurate sequencing results. To ensure the absence of DNA contamination, RT-PCR for an expressed gene is conducted with or without reverse transcriptase. There should be no products in the absence of reverse transcriptase.

    To select poly-A RNA, the samples are bound to oligo-dT probes attached to magnetic beads. The selected RNA is fragmented to nucleotide pieces at high temperature in the presence of magnesium ions, reducing length-dependent biases in subsequent reactions and analyses. The fragments are then converted to double-stranded DNA, and adaptors are ligated. The library is amplified by PCR, and its quality is checked on a bioanalyzer and by performing qPCR. The bioanalyzer results should reveal a peak of products at the size expected based on the average fragment size and length of the adaptors.

    Libraries from different samples, containing different barcoded adaptors, can be mixed together, along with a sample of reference DNA added at low concentration as a quality control for subsequent steps of the process, such as clonal cluster generation and the sequencing reactions. The mixture is added to a sequencing chip and loaded into the machine.

    During the sequencing reaction the density of DNA clusters is monitored: it must not be too high, which can lead to cross-contamination, or too low, which can lead to insufficient data. The quality of the sequencing is given by the Q score, which indicates the likelihood of an incorrect base being identified. The Q scores for most bases should be greater than 30, which corresponds to a chance of less than 1 in for an incorrect read. Recovery of the reference DNA sequences at the expected rate indicates that all library sequences are evenly represented.

    Reads generated by sequencing are then overlapped with each other to deduce the RNA that was sequenced. For organisms with genome information available, reads can be aligned to the reference genome. The number of reads per transcript is counted to measure the abundance of each RNA.

    After seeing how RNA-seq works, let&#;s look at some ways it&#;s being used.

    Transcriptome sequencing can identify genes that are differentially expressed under different conditions. For example, in this experiment, transcriptomes of mosquito larvae produced under different growth conditions were compared. Even though this particular species of disease-carrying mosquito does not have a sequenced genome, researchers were able to compare the obtained transcriptome information to other sequenced species, and identify genes with increased or decreased expression levels.

    RNA-seq can also be used in &#;massively parallel reporter assays&#; to study gene regulatory mechanisms. This is done by transfecting mammalian cells with a library of thousands of plasmids, each containing a mutated variant of a gene regulatory site &#;driving&#; the transcription of a reporter sequence that is coupled to unique tags. Following RNA isolation and high-throughput sequencing, the levels of each tag are assessed to evaluate reporter expression from each construct, which gives insight into the functional importance of the nucleotides mutated in each regulatory site variant.

    Finally, RNA sequencing can be adapted to study RNA-protein interactions, particularly to identify transcripts that a protein of interest binds to. The protein is immunoprecipitated with antibodies and the bound RNAs are defined by sequencing. If the RNA-protein complexes are crosslinked at the beginning, sequencing analysis can map the site of the crosslink and identify the protein-binding site on the RNA down to the nucleotide level.

    You&#;ve just watched JoVE&#;s video on RNA-seq. In this video, we&#;ve seen how RNA samples are converted into libraries, sequenced, and the resulting data analyzed, as well as the types of information that sequencing analysis can provide. Thanks to its sensitivity, potential to be used in any organism, and the lowered cost of sequencing, RNA-seq is increasingly being used in multiple areas of genetics research, and will provide insight into many questions surrounding cell function and development. Thanks for watching!

    Subscription Required. Please recommend JoVE to your librarian.

    Disclosures

    No conflicts of interest declared.

    Sours: https://www.jove.com/v//rna-seq

    Analysis rna protocol seq

    RNA Sequencing and Analysis

    References

    • Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin&#x;Rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. ;– [PubMed] [Google Scholar]
    • Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science. ;– [PubMed] [Google Scholar]
    • Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. ;– [PubMed] [Google Scholar]
    • Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. ;– [PubMed] [Google Scholar]
    • Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu FL, Bonnen PE, de Bakker PIW, Deloukas P, Gabriel SB, et al. Integrating common and rare genetic variation in diverse human populations. Nature. ;–[PMC free article] [PubMed] [Google Scholar]
    • An J, Lai J, Lehman ML, Nelson CC. miRDeep*: An integrated application tool for miRNA identification from RNA sequencing data. Nucleic Acids Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. ;R[PMC free article] [PubMed] [Google Scholar]
    • Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nature protocols. ;– [PubMed] [Google Scholar]
    • Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE, Reijo-Pera RA, Underwood JG, et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc Natl Acad Sci. ;E–E[PMC free article] [PubMed] [Google Scholar]
    • Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, Haudenschild CD, Beckman KB, Shi J, Mei R, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of individuals. Genome Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Benjamini Y, Hochberg Y. Controlling the false discovery rate&#x;A practical and powerful approach to multiple testing. J Roy Stat Soc B Met. ;–[Google Scholar]
    • Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. ;–[PMC free article] [PubMed] [Google Scholar]
    • Birney E, Stamatoyannopoulos Ja, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. ;–[PMC free article] [PubMed] [Google Scholar]
    • Blencowe BJ, Ahmad S, Lee LJ. Current-generation high-throughput sequencing: Deepening insights into mammalian transcriptomes. Genes Dev. ;– [PubMed] [Google Scholar]
    • Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. ;– [PubMed] [Google Scholar]
    • Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. ;[PMC free article] [PubMed] [Google Scholar]
    • Cantor H, Simpson E, Sato VL, Fathman CG, Herzenberg LA. Characterization of subpopulations of T lymphocytes. I. Separation and functional studies of peripheral T-cells binding different amounts of fluorescent anti-Thy (theta) antibody using a fluorescence-activated cell sorter (FACS) Cell Immunol. ;– [PubMed] [Google Scholar]
    • Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. ;[PMC free article] [PubMed] [Google Scholar]
    • Casneuf T, Van de Peer Y, Huber W. In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics. ;[PMC free article] [PubMed] [Google Scholar]
    • Christodoulou DC, Gorham JM, Herman DS, Seidman JG. Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. Current Protocols in Molecular Biology/edited by Frederick M Ausubel, [et al] ;Chapter 4 Unit 4 [PMC free article] [PubMed] [Google Scholar]
    • Crick F. Central dogma of molecular biology. Nature. ;– [PubMed] [Google Scholar]
    • Crick FH. On protein synthesis. Symp Soc Exp Biol. ;– [PubMed] [Google Scholar]
    • Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. ;[PMC free article] [PubMed] [Google Scholar]
    • Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Deng Q, Ramskold D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. ;– [PubMed] [Google Scholar]
    • Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, Ingle C, Beazley C, Gutierrez Arcelus M, Sekowska M, et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. ;–[PMC free article] [PubMed] [Google Scholar]
    • Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. Real-time DNA sequencing from single polymerase molecules. Science. ;– [PubMed] [Google Scholar]
    • Eminaga S, Christodoulou DC, Vigneault F, Church GM, Seidman JG. Quantification of microRNA expression with next-generation sequencing. Current Protocols in Molecular Biology/edited by Frederick M Ausubel [et al] ;Chapter 4 Unit 4 [PMC free article] [PubMed] [Google Scholar]
    • Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta LA. Laser capture microdissection. Science. ;– [PubMed] [Google Scholar]
    • Engstrom PG, Steijger T, Sipos B, Grant GR, Kahles A, Consortium R, Alioto T, Behr J, Bertone P, Bohnert R, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. ;–[PMC free article] [PubMed] [Google Scholar]
    • Erkkila T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lah-desmaki H. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Fehrmann RSN, Jansen RC, Veldink JH, Westra HJ, Arends D, Bonder MJ, Fu JY, Deelen P, Groen HJM, Smolonska A, et al. Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA. PLoS Genet. ;7:e[PMC free article] [PubMed] [Google Scholar]
    • Flutre T, Wen X, Pritchard J, Stephens M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. ;9:e[PMC free article] [PubMed] [Google Scholar]
    • Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al. A second generation human haplotype map of over million SNPs. Nature. ;U–U[PMC free article] [PubMed] [Google Scholar]
    • Fu GK, Xu W, Wilhelmy J, Mindrinos M, Davis RW, Xiao W, Fodor SPA. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Assoc Sci. ;–[PMC free article] [PubMed] [Google Scholar]
    • Gamazon ER, Ziliak D, Im HK, LaCroix B, Park DS, Cox NJ, Huang RS. Genetic architecture of microRNA expression: Implications for the transcriptome and complex traits. Am J Hum Genet. ;–[PMC free article] [PubMed] [Google Scholar]
    • Ge B, Pokholok DK, Kwan T, Grundberg E, Morcos L, Verlaan DJ, Le J, Koka V, Lam KC, Gagne V, et al. Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nat Genet. ;– [PubMed] [Google Scholar]
    • Glaus P, Honkela A, Rattray M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adi-conis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. ;–[PMC free article] [PubMed] [Google Scholar]
    • Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Grant GR, Liu J, Stoeckert CJ., Jr A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics. ;– [PubMed] [Google Scholar]
    • Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD, Corbett R, Tang MJ, Hou YC, Pugh TJ, et al. Alternative expression analysis by RNA sequencing. Nat Methods. ;– [PubMed] [Google Scholar]
    • Grundberg E, Small KS, Hedman AK, Nica AC, Buil A, Keildson S, Bell JT, Yang TP, Meduri E, Barrett A, et al. Mapping cis- and transregulatory effects across multiple tissues in twins. Nat Genet. ;–[PMC free article] [PubMed] [Google Scholar]
    • Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. ;–[PMC free article] [PubMed] [Google Scholar]
    • Hackenberg M, Rodriguez-Ezpeleta N, Aransay AM. miRanalyzer: An update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res. ;W–W[PMC free article] [PubMed] [Google Scholar]
    • Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. ;[PMC free article] [PubMed] [Google Scholar]
    • Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: Single-cell RNA-Seq by multiplexed linear amplification. Cell Reports. ;– [PubMed] [Google Scholar]
    • Huang R, Jaritz M, Guenzl P, Vlatkovic I, Sommer A, Tamir IM, Marks H, Klampfl T, Kralovics R, Stunnenberg HG, et al. An RNA-Seq strategy to detect the complete coding and non-coding transcriptome including full-length imprinted macro ncRNAs. PLoS ONE. ;6:e[PMC free article] [PubMed] [Google Scholar]
    • Huang S. Non-genetic heterogeneity of cells in development: More than just noise. Development. ;–[PMC free article] [PubMed] [Google Scholar]
    • Islam S, Kjallquist U, Moliner A, Zajac P, Fan JB, Lonnerberg P, Linnarsson S. Highly multiplexed and strand-specific single-cell RNA 5&#x; end sequencing. Nat Protocols. ;– [PubMed] [Google Scholar]
    • Itoh K, Matsubara K, Okubo K. Identification of an active gene by using large-scale cDNA sequencing. Gene. ;– [PubMed] [Google Scholar]
    • Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B. Synthetic spike-in standards for RNA-seq experiments. Genome Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Karlsson A. Review of Permutation, parametric, and bootstrap tests of hypotheses. J R Stat SocA Stat. ;–[Google Scholar]
    • Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. ;–[PMC free article] [PubMed] [Google Scholar]
    • Kawasaki ES. Microarrays and the gene expression profile of a single cell. Ann N Y Acad Sci. ;– [PubMed] [Google Scholar]
    • Kendziorski C, Wang P. A review of statistical methods for expression quantitative trait loci mapping. Mamm Genome. ;– [PubMed] [Google Scholar]
    • Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. ;–[PMC free article] [PubMed] [Google Scholar]
    • Kleinman CL, Majewski J. Comment on Widespread RNA and DNA sequence differences in the human transcriptome. Science. ; [PubMed] [Google Scholar]
    • Kozomara A, Griffiths-Jones S. miRBase: Annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. ;D68–D[PMC free article] [PubMed] [Google Scholar]
    • Kube DM, Savci-Heijink CD, Lamblin AF, Kosari F, Vasmatzis G, Cheville JC, Connelly DP, Klee GG. Optimization of laser capture microdissection and RNA amplification for gene expression profiling of prostate cancer. BMC Mol Biol. ;[PMC free article] [PubMed] [Google Scholar]
    • Kumar V, Westra HJ, Karjalainen J, Zhernakova DV, Esko T, Hrdlickova B, Almeida R, Zhernakova A, Reinmaa E, Vosa U, et al. Human disease-associated genetic variation impacts large intergenic non-coding RNA expression. PLoS Genet. ;9:e[PMC free article] [PubMed] [Google Scholar]
    • Kwan T, Grundberg E, Koka V, Ge B, Lam KC, Dias C, Kindmark A, Mallmin H, Ljunggren O, Rivadeneira F, et al. Tissue effect on genetic control of transcript isoform variation. PLoS Genet. ;5:e[PMC free article] [PubMed] [Google Scholar]
    • Lalonde E, Ha KC, Wang Z, Bemmo A, Kleinman CL, Kwan T, Pastinen T, Majewski J. RNA sequencing reveals the role of splicing polymorphisms in regulating human gene expression. Genome Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. ;R[PMC free article] [PubMed] [Google Scholar]
    • Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. ;R[PMC free article] [PubMed] [Google Scholar]
    • Lappalainen T, Montgomery SB, Nica AC, Dermitzakis ET. Epistatic selection between coding and regulatory variation in human evolution and disease. Am J Hum Genetics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Lappalainen T, Sammeth M, Friedlander MR, 't Hoen PA, Monlong J, Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. ;–[PMC free article] [PubMed] [Google Scholar]
    • Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM. Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. ;4 doi/journal.pgen [PMC free article] [PubMed] [Google Scholar]
    • Levin JZ, Berger MF, Adiconis X, Rogov P, Melnikov A, Fennell T, Nusbaum C, Garraway LA, Gnirke A. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol. ;R[PMC free article] [PubMed] [Google Scholar]
    • Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) ;–[PMC free article] [PubMed] [Google Scholar]
    • Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci. ;–[PMC free article] [PubMed] [Google Scholar]
    • Li Y, Xie X. A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues. BMC Bioinformatics. ;S[PMC free article] [PubMed] [Google Scholar]
    • Lin W, Piskol R, Tan MH, Li JB. Response to comments on Widespread RNA and DNA sequence differences in the human transcriptome. Science. ; [PubMed] [Google Scholar]
    • Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. ;–[PMC free article] [PubMed] [Google Scholar]
    • Liu Y, Zhou J, White KP. RNA-seq differential expression studies: More sequence or more replication? Bioinformatics. ;–[PMC free article] [PubMed] [Google Scholar]
    • Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B. RobiNA: A user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. ;W–W[PMC free article] [PubMed] [Google Scholar]
    • Lonsdale J, Thomas J, Salvatore M, Philips R, Lo E, et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet. ;–[PMC free article] [PubMed] [Google Scholar]
    • MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. ;–[PMC free article] [PubMed] [Google Scholar]
    • Majewski J, Pastinen T. The study of eQTL variations by RNA-seq: From SNPs to phenotypes. Trends Genet. ;– [PubMed] [Google Scholar]
    • Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. ;–[PMC free article] [PubMed] [Google Scholar]
    • Mattick JS, Makunin IV. Non-coding RNA. Hum Mol Genet. ;15 Spec No 1:R17–R [PubMed] [Google Scholar]
    • Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: Insights into functions. Nat Rev Genet. ;– [PubMed] [Google Scholar]
    • Metzker ML. Sequencing technologies&#x;The next generation. Nat Rev Genet. ;– [PubMed] [Google Scholar]
    Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC/
    Introduction to RNA-seq data analysis

    RNA-Seq: Basics, Applications and Protocol

    Figure 2: A workflow for RNA-seq

    An RNA-seq protocol

    Experiment planning 


    Preparation prior to starting your RNA-seq experiment is essential. Questions to answer before starting include:11


    •           What method of RNA purification are you using?

    •           What read depth will you need?

    •           Which platform will you use? 

    •           Is there a reference genome available and which will you use?

    •           How are you assessing the quality of your RNA?

    •           Do you need to enrich your target RNA?

    •           Will you barcode your RNA?

    •           Have I got enough biological and technical replicates?

    •           Single-end or paired-end sequencing?

    •           What read length will you use?

    •           Do I want to retain strand-specific information?

    cDNA library preparation 


    After these points have been considered, you can start preparing your cDNA library. This will require fragmentation of the cDNA, addition of the platform-specific “adapter sequences” and amplification of the cDNA, but the exact procedure will be very specific to the platform used at this stage. For strand-specific protocols, the amplification of the cDNA involves a reverse transcriptase-mediated first strand synthesis followed by a DNA polymerase-mediated second strand synthesis.11,12 Barcodes may also be added that enable multiplexing, so numerous samples can be sequenced in a single run. It can be beneficial to quantify your library at the end of the library preparation stage to ensure the protocol has been successful and check the quality and concentration of your library to enable optimal sequencing performance.

    cDNA sequencing


    Once the library is prepared, you can use your chosen sequencing platform to sequence your cDNA library to your desired depth and requirements. Once your transcript data has been produced, you can map the data to your reference genome or assemble it de novo if no reference is available. The alignment process can be complicated by the presence of splice variants and modifications, and the choice of reference genome used will also vary how difficult this stage is. Software packages such as STAR are useful at this stage, as are quality control tools like Picard or Qualimap.13 De novo assembly will allow for the discovery of novel transcripts in addition to those already known.

    RNA-seq data analysis


    After the alignment stage, you can focus on analyzing your data. Tools like Sailfish, RSEM and BitSeq13 will help you quantify your transcription levels, whilst tools like MISO, which quantifies alternatively spliced genes, are available for more specialized analysis.14 There is a library of these tools out there, and reading reviews and roundups are your best way to find the right tool for your research.


    To sum up, modern-day RNA-seq is well established as the superior option to microarrays and will likely remain the preferred option for the time being.  

    Challenges of RNA-seq


    Significant progress has been made in the field of RNA-seq over the last decade or so. The associated costs have reduced significantly while throughput has increased, sequence fidelity is far superior to earlier iterations of the NGS technologies and the availability of data analysis tools and pipelines has improved tremendously. However, there remain a number of challenges for scientists to bear in mind when considering RNA-seq experiments. These include:


    Isolating sufficient, high-quality RNA – while the sample quantity requirements for RNA-seq analysis have reduced drastically, it is still important to ensure you are able to obtain sufficient RNA to fulfill all your analysis requirements, including repeats if necessary. It is also important to bear in mind that, while you may isolate total RNA, depending upon your experimental question, you are likely only to be sequencing a fraction of this (typically messenger RNA (mRNA)), further reducing your sample quantity. This must also be of high quality and purity as poor samples are likely to lead to poor results, or in some cases failure within the library preparation protocol. The quality and concentration of RNA can be determined using UV-visible spectroscopy. Unlike DNA, RNA degrades rapidly so it important to treat samples with care at all stages of isolation and purification. Degradation may not be uniform, hindering the comparison of transcription levels between genes. Low-level transcripts may be lost from the sequenced population altogether.


    The impact of sample pooling – pooling samples prior to library preparation (without the use of barcoding) can reduce sequencing effort and costs or enable sequencing in cases where sample quantities are very limited. However, it is important to account for this during data analysis, with one such pool considered to be one biological replicate, not however many samples went in to making up the pool. Variations between the pooled samples can lead to misleading results and statistical issues so possible implications should be considered during the experimental design process.


    Trading-off sequencing depth against sample number – It may seem appealing to get as many samples done in a single sequencing run as possible to reduce costs and machine time. However, this comes at a cost. The more samples are multiplexed, the fewer reads will be obtained for each of those samples. With reducing read depth comes mounting uncertainty as to the reliability of the sequences obtained. Sequencing technologies are still far from perfect, and mistakes are made in reads. It is therefore important to find the sweet spot between obtaining sufficient read depth to give confidence in the quality and fidelity of the sequencing data obtained and maximizing sequencing capacity to ensure sufficient biological replicates can be analyzed to give meaningful data.


    References


    1.         Wang Z, Gerstein M, & Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet, ;10(1), 57– doi/nrg

    2.         Ozsolak F, & Milos PM. RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet,; 12(2), 87– doi/nrg

    3.         Bakhtiarizadeh MR, Alamouti AA. RNA-Seq based genetic variant discovery provides new insights into controlling fat deposition in the tail of sheep. Sci Rep 10, (). doi/s

    4.         Han Y, Gao S, Muegge K, Zhang W, & Zhou B. Advanced applications of RNA sequencing and challenges. Bioinform. Biol. Insights , ;9(Suppl 1), 29– doi/BBI.S

    5.         Schuster SC. Next-generation sequencing transforms today’s biology. Nat. Methods, ;5(1), 16– doi/nmeth

    6.         JP Sulzberger Columbia Genome Center. Genome sequencing: Defining your experiment. Columbia Systems Biology. https://systemsbiology.columbia.edu/genome-sequencing-defining-your-experiment. Accessed August 24,

    7.         Functional genomics II. EMBL-EBI. https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/design-considerations/. Accessed September 6,   

    8.         Zhao S, Zhang Y, Gordon W et al. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics, ;16(1). doi/s

    9.         Zhao S, Fung-Leung W-P, Bittner A, Ngo K, & Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLOS ONE, ;9(1), e doi/journal.pone

           Rao MS, Van Vleet TR, Ciurlionis R, et al. Comparison of RNA-seq and microarray gene expression platforms for the toxicogenomic evaluation of liver from short-term rat toxicity studies. Front. Genet. ; doi/fgene

           Kukurba KR, Montgomery SB. RNA sequencing and analysis. Cold Spring Harb Protoc. ;(11) doi/pdb.top

           The Cresko Lab of the University of Oregon. RNA-seqlopedia. University of Oregon. https://rnaseq.uoregon.edu/#library-prep-stranded-libraries. Accessed August 24,

           Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol., ; doi/s

           Katz Y, Wang ET, Airoldi EM, & Burge CB. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods, ;7(12), – doi/nmeth

    Sours: https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-protocol

    Similar news:

    Systematic evaluation of RNA-Seq preparation protocol performance

    • Methodology article
    • Open Access
    • Published:
    • Hsueh-Ping Chao1,2,
    • Yueping Chen1,
    • Yoko Takata1,
    • Mary W. Tomida1,
    • Kevin Lin1,
    • Jason S. Kirk3,
    • Melissa S. Simper1,
    • Carol D. Mikulec1,
    • Joyce E. Rundhaug1,
    • Susan M. Fischer1,
    • Taiping Chen1,2,
    • Dean G. Tang1,2,3,
    • Yue Lu1 &
    • Jianjun ShenORCID: orcid.org/1,2

    BMC Genomicsvolume 20, Article number:  () Cite this article

    • Accesses

    • 10 Citations

    • 4 Altmetric

    • Metrics details

    Abstract

    Background

    RNA-Seq is currently the most widely used tool to analyze whole-transcriptome profiles. There are numerous commercial kits available to facilitate preparing RNA-Seq libraries; however, it is still not clear how some of these kits perform in terms of: 1) ribosomal RNA removal; 2) read coverage or recovery of exonic vs. intronic sequences; 3) identification of differentially expressed genes (DEGs); and 4) detection of long non-coding RNA (lncRNA). In RNA-Seq analysis, understanding the strengths and limitations of commonly used RNA-Seq library preparation protocols is important, as this technology remains costly and time-consuming.

    Results

    In this study, we present a comprehensive evaluation of four RNA-Seq kits. We used three standard input protocols: Illumina TruSeq Stranded Total RNA and mRNA kits, a modified NuGEN Ovation v2 kit, and the TaKaRa SMARTer Ultra Low RNA Kit v3. Our evaluation of these kits included quality control measures such as overall reproducibility, 5′ and 3′ end-bias, and the identification of DEGs, lncRNAs, and alternatively spliced transcripts. Overall, we found that the two Illumina kits were most similar in terms of recovering DEGs, and the Illumina, modified NuGEN, and TaKaRa kits allowed identification of a similar set of DEGs. However, we also discovered that the Illumina, NuGEN and TaKaRa kits each enriched for different sets of genes.

    Conclusions

    At the manufacturers’ recommended input RNA levels, all the RNA-Seq library preparation protocols evaluated were suitable for distinguishing between experimental groups, and the TruSeq Stranded mRNA kit was universally applicable to studies focusing on protein-coding gene profiles. The TruSeq protocols tended to capture genes with higher expression and GC content, whereas the modified NuGEN protocol tended to capture longer genes. The SMARTer Ultra Low RNA Kit may be a good choice at the low RNA input level, although it was inferior to the TruSeq mRNA kit at standard input level in terms of rRNA removal, exonic mapping rates and recovered DEGs. Therefore, the choice of RNA-Seq library preparation kit can profoundly affect data outcomes. Consequently, it is a pivotal parameter to consider when designing an RNA-Seq experiment.

    Background

    Omics technology, driven by next-generation sequencing (NGS) coupled with new and increasingly robust bioinformatics pipelines, has triggered exponential growth in the accumulation of large biological datasets. The first NGS study, published in [1], reported the highly accurate sequencing of 25 million DNA bases in less than a day, representing a vast improvement in cost and throughput over traditional Sanger sequencing methods. Shortly thereafter, NGS technology was applied to RNA sequencing (RNA-Seq) [2,3,4,5], and since then, the sensitivity, accuracy, reproducibility, and flexibility of RNA-Seq have made it the gold standard in transcriptomic research. Over the last ten years, approximately 53, RNA-Seq datasets have been deposited in the Gene Expression Omnibus (GEO) database [6]. These RNA-Seq datasets provide information about the whole transcriptome, including gene fusions, differential expression of coding and non-coding genes, and splice variants in different experimental conditions. Increasing evidence confirms that changes in the transcriptome are a result of biological alterations, making RNA-Seq a driving force behind the exploration of global regulatory networks in cells, tissues, organisms, and diseases.

    RNA-Seq is used primarily to identify differentially expressed genes (DEGs) in different biological conditions, but it is also used to discover non-coding RNAs such as microRNAs and long non-coding RNAs (lncRNAs) [7]. RNA-Seq studies have already shown that differences in RNA preparation and enrichment during library preparation can cause fundamental variations in experimental outcomes. Hence, comprehensive evaluation of RNA-Seq library preparation methods by using different kits has provided a baseline from which to compare their overall capabilities and to guide future research applications. Several earlier studies have already identified potential confounding factors affecting RNA-Seq performance and analysis [8,9,10,11,12,13,14,15]. These include two large-scale projects--the Sequencing Quality Control project of the SEQC/MAQC-III (MicroArray Quality Control) Consortium, led by US Food and Drug Administration [8] and the Association of Biomolecular Resource Facilities (ABRF) next-generation sequencing (NGS) study [9], and other studies including the evaluation of three Illumina RNA-Seq protocols for degraded and low quantity samples [10], a study of gene qualification on clinical samples using Illumina TruSeq Stranded Total RNA and mRNA RNA-Seq protocols [11] and additional investigations focused on low-input or single-cell sequencing [12,13,14,15].

    The SEQC project evaluated the sensitivity, specificity, reproducibility, and complexity of gene expression, DEGs, and splice junction detection from RNA-Seq performed at multiple sites, using the same commercial reference library and External RNA Controls Consortium (ERCC) RNA spike-in controls as well as experimental samples, but using different sequencing platforms and bioinformatics pipelines [8]. Overall, the SEQC project found that RNA-Seq data generated from vendor-prepared libraries were stable across sites but variable across protocols, implying that data variability likely originated from differences in library preparation and/or sequencing platforms. Parameters affecting library preparation include fragmentation time, ribosomal RNA (rRNA) depletion methods, cDNA synthesis procedures, library purification methods, ligation efficiency, and RNA quality. This study [8] also illustrated that for the most highly expressed genes, DEGs were consistently identified across sites and platforms and that de novo splice junction discovery was robust but sensitive to sequencing depth.

    The ABRF-NGS study evaluated not only the sensitivity, specificity, reproducibility, and complexity of gene expression, but also differential gene expression and splice junction detection among different combinations of sequencing platforms and library preparation methods, taking into account size-specific fractionation and RNA integrity [9]. In general, the results across platforms and library preparation methods were highly correlated, but greater read depth was necessary to recover rare transcripts and splice site junctions present at low frequency, especially those resulting from putative novel and complex splicing events. Library preparation influenced the detection of non-polyA tail transcripts, 3′ UTRs, and introns, primarily due to inherent differences between rRNA reduction methods, i.e., rRNA depletion and polyA enrichment, with the former method capturing more structural and non-coding RNAs, and the latter method capturing more full-length mRNAs [9]. More importantly, although gene quantification was robust, transcriptome coverage was sensitive to the pipelines applied during the analyses; however, surrogate variable analysis proved useful in making direct comparisons across platforms.

    Schuierer S. et al. [10] evaluated three Illumina library preparation kits, representing polyA selection, ribosomal RNA depletion and exon capture methods, respectively, on RNA-Seq samples in a wide range of input quantity and quality. They found ribosomal RNA depletion method had generally good performance whereas exon capture method performed the best for highly degraded RNA samples. Zhao S. et al. [11] evaluated polyA selection vs. rRNA depletion using clinical samples and recommended the former over the latter in most cases where the interest is protein-coding gene quantification.

    More recently, increasing interest in investigating rare cell populations and detailed biological mechanisms has led to a demand for protocols generating high quality libraries from nanogram quantities of total RNA [12, 13] and even single cells [14, 15]. Dissecting the characteristics of RNA-Seq protocols designed to obtain data from low-input or degraded samples will benefit studies involving both rare cell populations and fixed clinical samples. For low-quantity RNA analysis, it has been established that the NuGEN protocol yields data with better transcriptome complexity but has less effective rRNA depletion, while the SMARTer Ultra Low RNA Kit has better performance on transcriptome annotation but demonstrates bias with respect to underrepresenting transcripts with high GC content [12]. cDNA amplification can help compensate for extremely small amounts of starting materials in low quantity RNA-Seq, but amplification itself may introduce problems, such as duplication, that affect library performance [12]. ABRF evaluated several low-input RNA amplification kits and identified certain underlying differences, such as two distinct categories of genes recovered in the libraries prepared with two distinct rRNA-reduction techniques, polyA enrichment and rRNA-depletion [13]. The sensitivity of gene detection and accuracy of gene expression level assessments were consistent across approaches but divergent across RNA input amounts. The SMARTer protocol provided a near perfect correlation between obtained values and the actual amount of ERCC standard included as a spike-in control [13]. Although this prior study provides insight into the effects of RNA amplification, it employed an artificial system using commercial RNA from TaKaRa mixed with the ERCC control RNAs, which likely oversimplifies the transcriptome complexity of real cells, thus necessitating similar work in whole-cell systems.

    The source of data variation among different library preparation methods remains unclear. Therefore, in the present study, we carefully compared the results we obtained from several commercial RNA-Seq library preparation kits with different rRNA depletion and cDNA synthesis methods to understand the strength of each protocol. The first goal of our study was to investigate confounding factors in RNA-Seq library preparation protocols using three standard input kits: the TruSeq Stranded Total RNA and mRNA Library Prep Kits from Illumina, and a modified NuGEN Ovation® RNA-Seq System. Defining the properties of the data generated using these protocols may aid users in designing their future RNA-Seq strategies. The second part of our study was to thoroughly evaluate the SMARTer Ultra Low RNA Kit using mouse embryonic stem cells (mESCs). Our results demonstrated that the TruSeq Stranded mRNA protocol was the best for transcriptome profiling and that the TruSeq Stranded Total RNA and mRNA protocols were comparable, whereas the modified NuGEN protocol performed less well for whole transcriptome analysis, but might be a better choice for studies focused on non-coding RNAs. Lastly, although the results obtained with the SMARTer Ultra Low RNA Kit were comparable to those of the TruSeq Stranded mRNA kit for most metrics and for identification of DEGs, the absolute expression levels were only moderately correlated. We conclude that each RNA-Seq protocol has individual strengths for particular individual applications that need to be considered for a successful RNA-Seq experiment.

    Results

    Experimental design and RNA-Seq data quality metrics

    Figure 1 outlines the experimental design we used for testing the three standard input protocols (Illumina TruSeq Stranded Total RNA, Illumina TruSeq Stranded mRNA, and modified NuGEN Ovation v2) (Fig. 1a), the ultra-low input protocol (TaKaRa SMARTer Ultra Low RNA Kit) (Fig. 1b), the data analysis flow, and data quality evaluation metrics (Fig. 1c). The RNA-Seq datasets used in the current study were generated during two research-based projects. The first study assessed six xenograft tumors, three from the control group (biological replicates) and three from the experimental group (biological replicates) to test all three standard input protocols (Fig. 1a). Because one of the xenograft tumors from the control group was used up, a different tumor (from a different mouse) had to be used for the libraries prepared with the TruSeq Total RNA protocol ( ng) and the TruSeq mRNA protocol ( ng). The second study assessed three mESC cell lines (biological replicates) from Zbtb24 knockout (1lox/1lox) clones compared with three wild-type (2lox/+) clones (biological replicates) using the TaKaRa SMARTer Ultra Low RNA protocol directly on cells with no RNA preparation step. When RNA was isolated, all total RNA samples had RNA integrity (RIN) numbers > 

    Experimental design and RNA-Seq data quality metrics. a Flow chart outlining the experimental design for comparing the three standard input RNA-Seq library preparation protocols. Six xenograft tumors, 3 from the control group and 3 from the experimental group, were used for all three protocols. Similar amounts of tumor tissue from control and experimental groups were used to isolate total RNA. Separate Illumina Stranded Total RNA and mRNA libraries were prepared using  ng and 1 μg RNA. The modified NuGEN Ovation v2 protocol library was prepared with  ng RNA. Images of the mice and vials were created by the Research Graphics department at MD Anderson Science Park (©MD Anderson), and the pipettes were taken from https://all-free-download.com/free-vectors/b Flow chart outlining the ultra-low input protocol. Cells from 3 independently derived Zbtb24 wild-type (2lox/+) mESC control lines and 3 independently derived Zbtb24 knockout (1lox/1lox) mESC experimental lines were lysed directly in reaction buffer without isolating total RNA. One hundred cells (~ 1 ng RNA, 18 PCR cycles) and cells (~ 10 ng RNA, 10 PCR cycles) were used to make cDNA for the TaKaRa SMARTer Low Input RNA-Seq kit v3 protocol. One hundred-fifty pg of TaKaRa SMARTer-generated cDNA was then used to prepare the Nextera libraries. c A diagram depicting the data analysis flow and the data quality metrics used in this study to evaluate RNA-Seq protocols. The analysis steps are on the left and the data quality metrics that were derived from each analysis step are on the right

    Full size image

    We used the manufacturer-recommended optimal input amounts (1 μg for both the Illumina TruSeq Stranded Total RNA and the Illumina TruSeq Stranded mRNA protocols; and  ng for the modified NuGEN Ovation v2; hereafter, “standard protocol”) (Fig. 1a). In addition, we also compared all three of these protocols with  ng input RNA (Fig. 1a and in the Additional file Figures). As described in a recent study, and as shown in Fig. 1a, the Illumina TruSeq Stranded Total RNA protocol uses Ribo-Zero to remove rRNA, whereas the TruSeq Stranded mRNA protocol enriches mRNA through polyA selection [11]. In contrast, as shown in Fig. 1a, the modified NuGEN Ovation v2 protocol synthesizes cDNA directly from total RNA with a combination of random primers and oligo [15], and followed by cDNA fragmentation on Covaris. On the other hand, both TruSeq protocols use divalent cations under elevated temperature to fragment purified RNAs. For the TaKaRa SMARTer Ultra Low RNA Kit, we used total RNA from mESCs cells and mESCs cells or approximately 1 and 10 ng RNA, respectively. To check whether this modified ultra-low input protocol was capable of generating quality data, we compared the mESC dataset derived from the TaKaRa SMARTer cDNA synthesis step combined with Nextera library preparation, to the high-quality datasets obtained using the TruSeq Stranded mRNA protocol with 2 μg total RNA as the input level.

    The data analysis flow and the data quality metrics used in this study to evaluate RNA-Seq protocols are diagrammed in Fig. 1c and detailed below.

    Mapping statistics (standard input protocols)

    The high abundance of rRNA in cells creates an important problem in RNA-Seq experiments. rRNA contamination of samples wastes reagents and decreases the recovery of other RNA species of interest. Therefore, we wanted to determine the efficacy of each protocol in removing rRNA. We found that for the libraries created with the modified NuGEN, TruSeq Stranded Total RNA, and TruSeq Stranded mRNA protocols, ~ 17, 5, and 1% of fragments, respectively, could be mapped to rRNA genes (Fig. 2a and Additional file 1: Figure S1A), indicating that in our conditions, the modified NuGEN protocol was inferior to the other two protocols in reducing rRNA contamination. After removing the rRNA reads, we mapped the remaining reads to the whole mouse genome using TopHat. The percentages of fragments with at least one end mapped to the genome were ~ 98% for both TruSeq protocols, and ~ 90% for the modified NuGEN protocol (Fig. 2b and Additional file 1: Figure S1B). The percentages of fragments with both ends mapped were > 93%, for both TruSeq Stranded Total RNA and TruSeq Stranded mRNA libraries, and ~ 60% for the modified NuGEN library (Fig. 2b and Additional file 1: Figure S1B). The percentages of fragments mapped to multiple locations of the genome accounted for ~ 12–20%, ~ 3–5%, and ~ 2% of total non-rRNA fragments from the samples prepared with the TruSeq Stranded Total RNA, TruSeq Stranded mRNA, and modified NuGEN protocols, respectively (Fig. 2c and Additional file 1: Figure S1C).

    Mapping statistics and read coverage over transcripts for all the libraries prepared with standard input protocols. a The rRNA mapping rate was calculated as the percentage of fragments that were mappable to rRNA sequences. b The non-rRNA mapping rate was calculated from all the non-rRNA fragments as the percentage of fragments with both ends or one end mapped to the genome. c Multiple alignment rates were determined from non-rRNA fragments that were mapped to multiple locations of the genome. d Read-bias was assessed using the read coverage over transcripts. Each transcript was subdivided evenly into bins and the read coverage was averaged over all the transcripts

    Full size image

    Read coverage over transcripts (standard input protocols)

    Positional signal bias in RNA-Seq data can lead to inaccurate transcript quantification. Therefore, we examined the read coverage over transcripts longer than  bps and found excessive enrichment of fragments at the 3′-end and depletion of signal at the 5′-end for samples prepared with the modified NuGEN protocol (Fig. 2d and Additional file 1: Figure S1D). Reads from the TruSeq Stranded Total RNA and TruSeq Stranded mRNA protocols were more evenly distributed along the entire length of the transcript (Fig. 2d and Additional file 1: Figure S1D). Closer examination of each nucleotide within  bps of the 5′- and 3′- ends confirmed that the modified NuGEN protocol failed to capture the RNA signal towards the 5′-end (Additional file 2: Figure S2A, C), and also suggested that the TruSeq Stranded mRNA protocol missed the signal within  bp of the 3′-end, compared to the TruSeq Stranded Total RNA protocol (Additional file 2: Figure S2B, D).

    Representation of the transcriptome (standard input protocols)

    To assess how well the entire transcriptome was represented within the libraries generated by the three RNA-Seq protocols, we first investigated the composition of uniquely mapped fragments in exonic, intronic, and intergenic regions (Fig. 3a and Additional file 3: Figure S3A). We found that for the TruSeq Stranded Total RNA and mRNA protocols, respectively, approximately 67–84% and 88–91% of the fragments were from exonic regions; 14–28 and < 10% were from intronic regions; and the remaining 3–5% were from intergenic regions. For the modified NuGEN protocol, only 35–45% of the fragments were from exonic regions; 47–56% were from intronic regions; and less than 10% were from intergenic regions. Since only the TruSeq protocols are strand-specific, as expected, the majority of the fragments in exonic and intronic regions were from the sense strand of the genes, whereas for the NuGEN libraries about half of the fragments were from the sense strand and the other half were from the antisense strand of the genes.

    Representation of the transcriptome for all the libraries prepared with standard protocols. a Composition of the uniquely mapped fragments, shown as the percentage of fragments in exonic, intronic, and intergenic regions. According to the direction of transcription, exonic and intronic regions were further divided into sense and antisense. b Saturation analysis showing the percentage of coding genes recovered (calculated as the genes with more than 10 fragments) at increasing sequencing depth. c-d Saturation analysis showing the percentage of lncRNAs recovered (calculated as the lncRNAs with more than 10 fragments) at increasing sequencing depth. In C, the six libraries created using each of three protocols (18 libraries total) are plotted individually. In D, the six libraries from the same protocol were pooled. e Saturation analysis showing the number of splice junctions recovered at increasing sequencing depth

    Full size image

    To evaluate the capability of the RNA-Seq protocols for detecting coding genes and lncRNAs, we performed saturation analysis to count the number of coding genes and lncRNAs detected at increasing sequencing depth. For coding genes, the saturation curves from the TruSeq Stranded Total RNA and mRNA libraries looked very similar and were superior to those from the NuGEN libraries (Fig. 3b and Additional file 3: Figure S3B). For lncRNAs, the modified NuGEN protocol outperformed both the TruSeq Stranded Total RNA and mRNA protocols, yielding more lncRNAs at the same sequencing depth (Fig. 3c Additional file 3: Figure S3C). However, for lncRNAs, none of the libraries were close to saturation at the sequencing depth used for our experiments. To examine the sequencing depth required to reach saturation for lncRNA detection, we repeated our saturation analysis after pooling samples from the same RNA-Seq protocol together. Our analysis showed that the modified NuGEN protocol still exceeded the other two protocols in lncRNA recovery, even when sequencing depth approached saturation (Fig. 3d and Additional file 3: Figure S3D).

    Another important application of RNA-Seq is to identify alternatively spliced variants, which frequently occur in mammalian genes [16]. In this regard, we conducted saturation analysis comparing the number of reads to the number of detected splice sites (Fig. 3e and Additional file 3: Figure S3E). We recovered the lowest number of splice junctions using the modified NuGEN protocol and the highest number with the TruSeq Stranded mRNA protocol.

    Concordance of expression quantification (standard input protocols)

    Spearman’s rank correlation coefficients between samples based on count per million (cpm) fragments mapped to exons values were calculated to assess the concordance of the three standard RNA-Seq protocols on expression quantification. The correlation coefficients were greater than between samples prepared using the same protocol, regardless of whether the samples were biological replicates of the same condition or from different conditions. The correlation coefficients between samples prepared using different protocols were lower: – between the TruSeq Stranded Total RNA and mRNA protocols, – between the TruSeq Stranded Total RNA and modified NuGEN protocols, and – between the TruSeq Stranded mRNA and modified NuGEN protocols (Fig. 4a and Additional file 4: Figure S4A). Unsupervised clustering demonstrated that the whole transcriptome expression profiles obtained from TruSeq Stranded Total RNA and mRNA libraries were more similar to each other than either was to the NuGEN libraries (Fig. 4b and Additional file 4: Figure S4B). Principal component analysis (PCA) recapitulated the clustering analysis: the NuGEN libraries were separated from the TruSeq libraries in the first component, whereas the TruSeq Stranded Total RNA and mRNA libraries were separated in the second component (Fig. 4c and Additional file 4: Figure S4C). Further investigation revealed the TruSeq protocols tended to capture genes with higher expression and GC content, whereas the modified NuGEN protocol tended to capture longer genes (Additional file 7: Figure S7B-C). Comparing the TruSeq mRNA protocol to the TruSeq Total RNA protocol, showed that the TruSeq mRNA protocol preferentially recovered genes with higher GC content and shorter length (Additional file 7: Figure S7A). To exclude the possibility that these differences stemmed from batch effects, such as different set of libraries being prepared at different times, we included additional technical replicates, prepared at different times, for the TruSeq Stranded Total RNA and mRNA protocols (1 μg). Unsupervised clustering suggested that the distance between technical replicates of the same protocol was closer than the distance between samples prepared with different protocols (Additional file 5: Figure S5A). The technical replicate libraries generated using the same protocol clustered together and were separated from those of different protocols in PCA (Additional file 5: Figure S5B). Taken together, these results demonstrate that the variability among these library preparation protocols was not primarily due to batch effects.

    Concordance of expression quantification between the libraries prepared with standard input protocols. a Scatter plots in a smoothed color density representation (top-right panel) and Spearman’s rank correlation coefficients (bottom-left panel) for all pairs of libraries using log2(cpm + 1) values. b Unsupervised clustering of all the libraries using log2(cpm + 1) values. Euclidean distance with complete linkage was used to cluster the libraries. c Principal component analysis (PCA) of all the libraries, using log2(cpm + 1) values. The values for each gene across all the libraries were centered to zero and scaled to have unit variance before being analyzed. Circles and triangles represent control and experimental libraries, respectively (NuGEN, red; TruSeq mRNA, green; TrueSeq Total RNA, blue). For all analyses in Fig. 4, genes represented by fewer than 10 fragments in all the libraries were excluded

    Full size image

    Concordance of DEGs recovered with standard input protocols

    PCA demonstrated that all protocols could distinguish between samples representing different biological conditions (Fig. 5a and Additional file 6: Figure S6A). Three hundred ninety-four DEGs were detected across all three RNA-Seq library preparation protocols, accounting for 41, 38, and 28% of the total DEGs detected when using the TruSeq Stranded Total RNA, TruSeq Stranded mRNA, and modified NuGEN protocols, respectively (Fig. 5b). The pairwise scatter plots of log2 ratio values between DEGs from control and experimental mouse tumor tissues showed that the TruSeq Stranded Total RNA and mRNA results were more highly correlated with each other (Spearman’s correlation coefficient = ) than either was with the modified NuGEN protocol (Spearman’s correlation coefficient =  and , respectively) (Fig. 5c and Additional file 6: Figure S6B). That is, the TruSeq Total RNA and mRNA protocols yielded more shared DEGs than either did with the modified NuGEN protocol (Fig. 5c and Additional file 6: Figure S6B). To evaluate how accurate the DEG calls were, we performed qPCR for genes that RNA-Seq data indicated were differentially expressed, and compared the log2 ratio values for these genes as derived from the various RNA-Seq library preparation protocols and qPCR (manuscript in preparation). The DEGs recovered with the TruSeq Total RNA and mRNA protocols had correlation coefficients of and vs. qPCR, whereas the modified NuGEN protocol had a correlation coefficient of (Fig. 5d). In short, the libraries produced by all three standard protocols were sufficient to detect DEGs. However, independent validation of DEGs by qPCR indicated that the differential expression results from the TruSeq Stranded Total RNA and mRNA protocols might be more accurate than those from the modified NuGEN protocol.

    Concordance of differentially expressed genes (DEGs) recovered from libraries prepared with standard protocols. a Principle component analysis (PCA) was performed on the libraries prepared with each standard protocol. b Venn diagram showing the number of DEGs recovered with the three standard protocols. c Pairwise scatter plots of log2 ratio values comparing the DEGs identified in the tumor tissues of control and experimental mice. The black dots represent genes that were called as differentially expressed in libraries from both protocols, colored dots represent genes that were called as differentially expressed in the libraries from only one protocol. The Spearman’s rank correlation coefficient is shown at the top of each plot. The Venn diagram above each plot shows the number of DEGs recovered with the specified protocols. d Scatter plots of log2 ratio values calculated between tumor tissues of control and experimental mice for each protocol vs. qPCR. Spearman’s rank correlation coefficient is shown at the top of each plot

    Full size image

    Mapping statistics, read coverage bias and transcriptome representation (ultra-low protocol)

    Increasing numbers of omics studies are being designed to investigate minor cell subpopulations, rare cell types, and even single cells. Effectively executing low-input RNA-Seq is essential to achieve these goals. To determine the applicability of the TaKaRa SMARTer Ultra Low RNA Kit v3 with low-level RNA input or mESCs from each of three Zbtb24 knockout (1lox/1lox) clones (biological replicates) and three wild-type (2lox/+) clones (biological replicates), we evaluated its performance by comparing it to that of the TruSeq Stranded mRNA protocol using 2 μg of total RNA, as a “gold standard” that represents overall robustness with regard to rRNA contamination, mRNA species representation, identification of DEGs, and overall reproducibility. The SMARTer kit protocol resulted in libraries with higher levels of rRNA contamination at both the (~ 1 ng RNA) and cell (~ 10 ng RNA) levels than did the TruSeq Stranded mRNA protocol using standard input RNA amounts (Fig. 6a). The percentage of fragments with both ends mapped to the genome was 91–92% for the TruSeq Stranded mRNA protocol and 60–65% for the SMARTer protocol using either or cells (Fig. 6b). The coverage of fragments over transcripts suggested the SMARTer protocol libraries were biased toward the 3′-end of transcripts compared to the TruSeq Stranded mRNA protocol libraries (Fig. 6c). For libraries from the SMARTer protocol with and cells, around 90% of the fragments were from exonic regions, ~ 6% were from intronic regions, and ~ 4% were from intergenic regions, which was comparable to libraries from the TruSeq Stranded mRNA protocol (Fig. 6d). Since the SMARTer protocol is not strand-specific, half of the fragments were from the sense strand and the other half were from the antisense strand of the genes (Fig. 6d). For coding genes, the saturation curves for libraries from the SMARTer protocol with and cells were very similar and were slightly less robust than those from the TruSeq Stranded mRNA protocol (Fig. 6e). The SMARTer protocol outperformed the TruSeq Stranded mRNA protocol in recovering more lncRNAs at the same sequencing depth (Fig. 6f). However, at the same sequencing depth, the number of splice junctions detected in libraries from the SMARTer protocol was lower than in libraries from the TruSeq Stranded mRNA protocol (Fig. 6g). Overall, low-input RNA samples subjected to the SMARTer protocol, when compared to the TruSeq Stranded mRNA protocol, produced data with greater rRNA contamination but similar rates of exon detection. Furthermore, we recovered fewer coding genes and splice junctions but more lncRNAs from libraries generated with the SMARTer Ultra Low RNA Kit. Overall, the kit performed well on these low-input samples, but as anticipated, did not capture the range of expression recovered with a kit using more input RNA.

    Mapping statistics, read coverage bias, and transcriptome representation for libraries prepared using the SMARTer Ultra Low RNA Kit. a The percentage of fragments mapped to rRNA sequences. b Of all the non-rRNA fragments, the percentage of fragments with both ends or one end mapped to the genome. c The read coverage over transcripts. Each transcript was subdivided evenly into bins and the read coverage was averaged over all the transcripts. d Composition of the uniquely mapped fragments, shown as the percentage of fragments in exonic, intronic, and intergenic regions. According to the direction of transcription, exonic and intronic regions were further divided to sense and antisense. e Saturation analysis showing the percentage of coding genes recovered at increasing sequencing depth. f Saturation analysis showing the percentage of lncRNAs recovered at increasing sequencing depth. g Saturation analysis showing the number of splice junctions recovered at increasing sequencing depth. For the purpose of evaluation, the above analyses also include the libraries prepared with the TruSeq Stranded mRNA protocol using the same biological conditions

    Full size image

    Concordance of expression quantification and DE detection (ultra-low protocol)

    Spearman’s rank correlation coefficients between the low-input samples prepared from the same or different input quantities were very good (–), indicating high reproducibility with the SMARTer Ultra Low RNA Kit protocol. However, the coefficients between samples prepared using the SMARTer and standard TruSeq Stranded mRNA protocols were lower (–) (Fig. 7a). PCA showed that the variability among samples was largely due to differences between the SMARTer and TruSeq Stranded mRNA libraries, as described in the first component (Fig. 7b). The transcriptome profile changes from biological conditions within each protocol could be explained by the second component (Fig. 7b). Further investigation showed the SMARTer protocol tended to allow recovery of genes with higher expression, lower GC content, and shorter length, compared to the TruSeq mRNA protocol (Additional file 7: Figure S7D-F). There were DEGs shared between the SMARTer libraries generated from either or cells and the TruSeq Stranded mRNA libraries, accounting for 40, 37, and 23% of the total DEGs detected in each, respectively, but the majority of DEGs recovered from the TruSeq Stranded mRNA libraries ( genes) were excluded from the SMARTer libraries (Fig. 7c). The pairwise scatter plots of log2 ratios between biological interventions using DEGs showed that the concordance of DEG detection between the SMARTer libraries prepared with cells vs. cells, or between SMARTer vs. TruSeq Stranded mRNA, was much lower than that between the standard protocols at normal input level (Fig. 7d vs. Figure 5c). In summary, the SMARTer Ultra Low RNA Kit is capable of capturing the effect of biological conditions, but is not as robust as the standard input protocol at a normal input level of 2 μg for the TruSeq Stranded mRNA-Seq protocol.

    Concordance of expression quantification and DEG detection using the SMARTer Ultra Low RNA Kit. For the purpose of evaluation, the libraries prepared from the same biological conditions with the TruSeq Stranded mRNA protocol are also included. a Smoothed color density representation scatter plots (top, right) and Spearman’s rank correlation coefficients (bottom left) for all library pairs using log2(cpm + 1) values. and represent the SMARTer Ultra Low RNA Kit using and cells. b Principal component analysis (PCA) of all libraries using log2(cpm + 1) values. Red, blue, and green represent libraries prepared with the ultra-low protocol cells, ultra-low protocol cells, and TruSeq Stranded mRNA protocol, respectively. Circles and triangles represent control and experimental libraries, respectively. c Venn diagram showing the number of DEGs recovered with the SMARTer Ultra Low RNA ( cells and cells) and the TruSeq Stranded mRNA kits. d Pairwise scatter plots of log2 ratio values between the biological conditions using the DEGs. The black dots represent genes called as differentially expressed in libraries prepared with both kits, and the colored dots represent genes called as differentially expressed in libraries from only one kit. The Spearman’s rank correlation coefficient is shown at the top of each plot. The Venn diagram to the left of each scatter plot shows the number of DEGs called for the data produced using both or only one of the protocols

    Full size image

    Discussion

    Comparing global gene expression in differing biological contexts is a cornerstone of contemporary biology. As microarray technology is being supplanted by RNA-Seq methods for many applications, it is imperative to determine which library preparation protocols are best suited for specific needs, for example the recovery of coding vs. non-coding RNAs and reliable discernment of DEGs. Here, we have examined three different standard RNA-Seq library preparation protocols, and one low-input protocol in terms of overall reproducibility, rRNA contamination, read coverage, 5′- and 3′-end bias, and recovery of exonic vs. intronic sequences, lncRNAs, and DEGs. These protocols were the standard input Illumina TruSeq Stranded Total RNA, Illumina TruSeq Stranded mRNA, and modified NuGEN Ovation v2 kits; and the low input TaKaRa SMARTer Low Input RNA-Seq kit v3, tested at two different input levels, (~ 1 ng RNA) and (~ 10 ng RNA) cells. Although all protocols yielded reproducible data, overall, the Illumina kits generally outperformed the modified NuGEN Ovation v2 kit at standard RNA input levels. The modified NuGEN protocol was useful for the recovery of lncRNAs and intronic sequences, but also had higher levels of rRNA contamination.

    Undesirable recovery of rRNA

    One impediment to the efficient recovery of meaningful RNA-Seq data is repetitive rRNA. Nearly 80% of RNA in a cell is rRNA, making it preferable to remove this class of RNA prior to library construction [17]. RNA-Seq library preparation protocols depend on one of two means of reducing rRNA contamination: rRNA depletion and polyA enrichment. For the three standard protocols and the one ultra-low input protocol we evaluated, the TruSeq Stranded Total RNA and the modified NuGEN Ovation RNA-Seq System V2 protocols employ rRNA depletion methods, whereas the TruSeq Stranded mRNA protocol and SMARTer Ultra-low protocol use polyA enrichment methods to reduce rRNA contamination in sequencing libraries. In our present study, the modified NuGEN protocol libraries averaged 15–20% of their reads mapping to rRNA, as compared to 1–5% for the TruSeq protocols (Fig. 2a and Additional file 1: Figure S1A). These results are consistent with those reported by Adiconis et al. (%) [12], but lower than those reported by Shanker et al. (35%) [13]. However, our NuGEN rRNA mapping rates were much higher than those reported by both Sun et al. [18] and Alberti et al. [19] who had only a 1% rRNA mapping rate for both their Illumina- and NuGEN-created libraries. While we cannot explain the differences in rRNA mapping rates for the NuGEN libraries in these studies, in our core facility, the NuGEN Ovation v2 kit libraries consistently resulted in a 15–20% rRNA mapping rate, not only in this study, but also in prior sequencing libraries constructed in our facility (data not shown), thus providing part of the impetus for the current study. We also examined the rRNA mapping rate in libraries prepared from two polyA-enrichment protocols, the Illumina TruSeq Stranded mRNA protocol and the TaKaRa SMARTer Ultra Low RNA protocol. The SMARTer protocol yielded a 7–9% rRNA mapping rate, which was inferior to the TruSeq protocol at standard RNA input levels (1%) (Fig. 6a). The 7–9% mapping rate yielded by the SMARTer protocol in our facility was consistent with that reported by Adiconis et al. [12] and Alberti et al. [19]. Overall, the protocols we tested were able to remove the majority of rRNA. Although the modified NuGen protocol showed relatively higher rRNA content, since the existence of rRNA is not expected to introduce a bias for expression quantification, an increase in sequencing depth would be able to compensate.

    Overall mapping, end bias and exonic coverage

    The TruSeq protocols yielded a ≥ 90% overall mapping rate for fragments with both ends mapped to the genome, compared to 60% for the modified NuGEN protocol (Fig. 2b and Additional file 1: Figure S1B). This is on par with a prior study showing NuGEN rRNA-depleted libraries had a 75% alignment rate and TruSeq PolyA-enrichment mRNA libraries had a 90% alignment rate [18].

    To assess whether complete transcripts were evenly captured by the three standard library preparation protocols, we examined read coverage over the length of the full transcript. Our results, like those of Acondis [12], indicated that NuGEN libraries displayed augmented 3′-end signal and depleted 5′-end signal, perhaps due to using a combination of both oligo[dT] and random primers during cDNA synthesis [12]. The TruSeq Stranded mRNA libraries were also somewhat biased, as reflected by a lack of reads within  bps of the 3′-end, relative to the TruSeq Total RNA libraries (Additional file 2: Figure S2B, 2D). This may be because of the difference between the rRNA depletion approaches used by the TruSeq mRNA and TruSeq total RNA protocols, resulting in more unmappable reads near the 3′-end in TruSeq mRNA libraries due to the presence of polyA tails in these reads.

    To determine how well each protocol performed in recovering the transcriptome, we examined the composition of the uniquely mapped fragments from the two Illumina and the modified NuGEN protocols. Ninety percent of our reads were mapped to exons using the TruSeq Stranded mRNA kit, 67–84% using the Total RNA kit, and 35–46% using the NuGEN kit (Fig. 3a and Additional file 3: Figure S3A), which is consistent with similar studies using these kits [9, 11, 13, 18], suggesting that polyA-enrichment protocols may be superior to rRNA depletion protocols for studies focusing on exonic RNA [11, 13, 18]. This is further supported by our finding that, compared to the three standard input protocols, the polyA-based TaKaRa SMARTer Ultra Low RNA Kit had almost the same exonic coverage as the TruSeq Stranded mRNA protocol (Fig. 6d). The inverse was true for the recovery of intronic sequences, with rRNA-depleted libraries outperforming the polyA-enrichment libraries. For example, the modified NuGEN protocol yielded ~ 50% intronic sequences, which was on par with the results of Shanker et al. (after removing PCR duplicates) [13], whereas our TruSeq Stranded Total RNA libraries consisted of 14–28% intronic sequences. In contrast, the TruSeq Stranded mRNA libraries contained only 6–8% intronic sequences (Fig. 3a and Additional file 3: Figure S3A). We also found that the modified NuGEN kit yielded better lncRNA recovery. In this case, better lncRNA recovery may be due to differences in the cDNA synthesis step rather than in the rRNA depletion step: whereas the TruSeq Stranded Total RNA protocol uses only random primers for cDNA synthesis, the modified NuGEN protocol uses a combination of random and oligo [15] primers, thus allowing more efficient capture of both coding and non-coding RNAs with and without polyA-tails [11]. However, it is also possible that some of the lncRNAs identified in the rRNA-depleted libraries are merely false signals originating from intronic reads from other coding genes rather than lncRNAs [11]. Additionally, it is worth noting that in our saturation analysis (Fig. 3b, c Additional file 3: Figure S3B, 3C), the curves reached saturation at ~ 60% coding genes or ~ 30% lncRNAs, suggesting that achieving increased coverage of coding genes or lncRNAs beyond these levels by deeper sequencing would be very difficult.

    Gene quantification and identification of DEGs

    Gene expression quantification in and identification of DEGs between samples from different biological conditions are two of the primary goals for most RNA-Seq experiments. In the current study, we identified and DEGs between experimental and control tumor tissues using the TruSeq Total RNA and mRNA protocols (manuscript in preparation), respectively, which was slightly fewer than the DEGs identified using the modified NuGEN protocol (Fig. 5b). This contrasts with the work of Sun et al. who recovered fewer DEGs from NuGEN libraries than TruSeq PolyA-enrichement libraries [18]. To explore this difference, we validated our RNA-Seq-identified DEGs using qRT-PCR. We found that a greater proportion of DEGs identified using the TruSeq Stranded Total RNA and mRNA libraries were supported by our qRT-PCR results compared to DEGs identified using the modified NuGEN protocol libraries. That is, the modified NuGEN protocol may have resulted in more false-positive DEGs than did the TruSeq protocols. The comparable performance of the TruSeq Total and mRNA protocols in our study contrasts with the results of Zhao, et al., who directly compared the TruSeq Stranded Total and mRNA protocols using clinical samples. They found the TruSeq Stranded mRNA libraries more accurately predicted gene expression levels than the TruSeq Stranded Total RNA libraries [11].

    Although the SMARTer Ultra Low RNA Kit-generated libraries were able to capture the effect of biological differences between experimental and control samples, overall, its performance was inferior to that of the TruSeq Stranded mRNA protocol, given both the higher amount of rRNA recovered and the lower number of DEGs recovered (Figs. 6 and 7). This may be due to the very different levels of input RNA used in these two protocols.

    Limitations and future work

    There are still some limitations in this study that could be addressed in future work. For example, this study didn’t include spike-in RNAs, which could serve as a sample independent benchmark to further evaluate the accuracy of DEG detection in libraries prepared by different protocols. Future work could also consider investigating additional ultralow RNA-Seq protocols and using standard RNA samples such as Universal Human Reference RNA (UHRR) for an easier comparison to other studies. [20]

    Conclusions

    In summary, all the RNA-Seq library preparation protocols evaluated in this study were suitable for distinguishing between experimental groups when using the manufacturers’ recommended amount of input RNA. However, we made some discoveries that might have been previously overlooked. First, we found that the TruSeq Stranded mRNA protocol is universally applicable to studies focusing on dissecting protein-coding gene profiles when the amount of input RNA is sufficient, whereas the modified NuGEN protocol might provide more information in studies designed to understand lncRNA profiles. Therefore, choosing the appropriate RNA-Seq library preparation protocol for recovering specific classes of RNA should be a part of the overall study design [18]. Second, when dealing with small amounts of input RNA, the SMARTer Ultra Low RNA Kit may be a good choice in terms of rRNA removal, exonic mapping rates and recovered DEGs. Third, our saturation analysis indicated that the required sequencing depth depends on the biological question being addressed by each individual study. Roughly, a minimum of 20 M aligned reads/mate-pairs are required for a project designed to detect coding genes and increasing the sequencing depth to ≥ M reads may be necessary to thoroughly investigate lncRNAs [21] (note: the needed sequencing depth may also vary depending on different biological samples and study designs). Omics technology and big data will facilitate the development of personalized medicine, but we should understand the outcomes of the experimental parameters and control for those as thoroughly as possible.

    Methods

    Biological samples and RNA isolation

    The use of mice in this project has been reviewed and approved by The University of Texas MD Anderson Cancer Center (MD Anderson) IACUC committee (ACUF 04–, S. Fischer) and (ACUF MODIFICATION RN01, T. Chen). C57BL/6 mice were purchased from The Jackson Laboratory (Bar Harbor, ME). For the three standard input RNA-Seq library preparation protocols (Illumina TruSeq Stranded Total RNA, TruSeq Stranded mRNA kit, and the modified NuGEN Ovation RNA-Seq kits), total RNA was isolated from three xenograft tumors (biological replicates) from control [30% calorie restricted diet [19]] and experimental [(diet-induced obese (OB)) xenograft mouse models in the C57BL/6 genetic background, respectively. C57BL/6 mice were chosen, in part, because they are susceptible to obesity when fed a high-fat diet [22]. We fed the mice with two commercial diets following previously established guidelines (Research Diets, Inc., New Brunswick, NJ): a CR diet (D) for lean C57BL/6 mice (30% CR), and a diet-induced obesity (DIO) diet (D; consumed ad libitum) for OB C57BL/6 mice, 10 mice per group [23]. Mice were humanely euthanized using carbon dioxide and followed by cervical dislocation, per IACUC approved procedures. A manuscript describing the details of the mouse obesity/tumor xenograft study, including transcriptomic profiling results, is in preparation. For the SMARTer Ultra Low RNA Kit, designed to evaluate both rare cell populations and fixed clinical samples, three mESCs cell lines (biological replicates) from Zbtb24 knockout (1lox/1lox) clones and three Zbtb24 wild-type (2lox/+) clones were used as experimental and control samples, respectively. The mice used for this part of the study were generated in-house at MD Anderson Science Park. A manuscript describing the Zbtb24 KO mESCs, including transcriptomic profiling results, is also in preparation.

    Total RNA from mouse xenograft tumor tissues was isolated using TRIZOL following the manufacturer’s protocol. Isolated RNA samples were treated with DNase I followed by purification with a QIAGEN RNeasy Mini kit (Madison, WI). Total RNA from mESCs was extracted using the QIAGEN RNeasy Mini kit with on-column DNase treatment following the manufacturer’s protocol. Both concentration and quality of all the isolated RNA samples were measured and checked with an Agilent Bioanalyzer and Qubit. All RNA samples had RNA integrity numbers >  For the low-cell-input experiments, cells and cells (~ 1 and 10 ng RNA, respectively, according to the SMARTer Ultra Low RNA kit user manual) were used directly without isolating total RNA in accordance with manufacturer recommendations.

    TruSeq stranded total RNA and mRNA library preparations

    Libraries were prepared using the Illumina TruSeq Stranded Total RNA (Cat. # RS) or mRNA (Cat. # RS) kit according to the manufacturer’s protocol starting with 1 μg total RNA. Briefly, rRNA-depleted RNAs (Total RNA kit) or purified mRNAs (mRNA kit) were fragmented and converted to cDNA with reverse transcriptase. The resulting cDNAs were converted to double stranded cDNAs and subjected to end-repair, A-tailing, and adapter ligation. The constructed libraries were amplified using 8 cycles of PCR.

    NuGEN ovation RNA-Seq system v2 modified with SPRI-TE library construction system

    Total RNA ( ng) was converted to cDNA using the NuGEN Ovation RNA-Seq System v2 (Cat. # –32) (NuGEN) following the manufacturer’s protocol (NuGEN, San Carlos, CA). NuGEN-amplified double-stranded cDNAs were broken into ~  base pair (bp) fragments by sonication with a Covaris S instrument (Covaris, Woburn, MA). Fragmented cDNAs were processed on a SPRI-TE library construction system (Beckman Coulter, Fullerton, CA). Uniquely indexed NEXTflex adapters (Bioo Scientific, Austin, TX) were ligated onto each sample to allow for multiplexing. Adapter-ligated libraries were amplified [1 cycle at 98 °C for 45 s; 15 cycles at 98 °C for 15 s, 65 °C for 30 s, and 72 °C for 30 s; 1 cycle at 72 °C for 1 min; and a hold at 4 °C] using a KAPA library amplification kit (KAPA Biosystems, Wilmington, MA) and purified with AMPure XP beads (Beckman Coulter).

    Modified protocol for the SMARTer ultra low RNA and Nextera DNA library preparation kits

    mESC were lysed in the reaction buffer included in the SMARTer Ultra Low RNA Kit v3 (Cat. # ) (TaKaRa, Japan). cDNA was then synthesized using the SMARTer Ultra Low RNA Kit followed by library construction using the Nextera DNA Sample Preparation Kit (Cat. # FC) (Illumina, San Diego, CA), according to the manufacturers’ protocols. We performed 10 cycles of PCR for cells (~ 10 ng RNA) (SMARTer ), and 18 cycles of PCR for cells (~ 1 ng RNA) (SMARTer ).

    Next-generation sequencing

    Ten pM of pooled libraries were processed using a cBot (Illumina) for cluster generation before sequencing on an Illumina HiSeq (2 × 76 bp run).

    RNA-Seq data analysis

    Mapping

    Reads were mapped to rRNA sequences (GI numbers: , , , , and Ensembl IDs: ENSMUST, ENSMUST, ENSMUST, ENSMUST) using Bowtie2 (version ) [24]. Reads that were not mapped to rRNAs were then mapped to the mouse genome (mm10) using TopHat (version ) [25].

    Read coverage over transcripts

    The longest transcript from each gene was chosen to represent the gene. The reads were then mapped to all the transcript sequences using Bowtie2. Transcripts with fewer than total fragment counts or shorter than  bps were filtered out leaving at least 12 k transcripts for each sample. Each full-length transcript was subdivided evenly into bins. The mean coverage of fragments over each bin was normalized to the total coverage over the whole transcript and then averaged over all the transcripts. Alternatively, the coverage of fragments over each position of the  bps downstream of the 5′-end or upstream of the 3′-end was normalized by the mean coverage of the whole transcript, and then averaged over all the transcripts.

    Discovery of splicing junctions

    The number of known splicing junctions (defined as junctions with both 5′- and 3′- splice sites annotated in the reference gene set) supported by at least one read in each sample was counted using RSeQC (version ) [26].

    Saturation plots

    Each point in a saturation curve was generated by randomly selecting the desired number of fragments and calculating the percentage of genes with more than 10 fragments over all the genes. For each sample, this procedure was repeated three times and the curve represents the average percentage of genes at each corresponding number of fragments.

    Sample clustering

    Hierarchical clustering of samples was performed using the log2(cpm + 1) values of all the genes using the dist function and Euclidean method in R, as well as the hierarchical clustering (hclust) function and complete method in R.

    Differential expression

    The number of fragments in each known gene from GENCODE Release M4 [27] was enumerated using the htseq-count script within the HTSeq package (version ) [28] with options -m union and -s no/reverse (“no” for strand-unspecific protocols and “yes” for strand-specific protocols). Fragments that were mapped to multiple genes or multiple locations were discarded. For strand-specific protocols, fragments that were mapped to the antisense strand of the genes were discarded. Genes represented by fewer than 10 fragments in all samples were removed before performing differential expression analysis. Differences in gene expression between conditions were statistically assessed using the R/Bioconductor package edgeR (version ) [29]. Genes with a false discovery rate (FDR) ≤  and length >  bps were called as differentially expressed. The software used in this study is listed in Table 1.

    Full size table

    Box plots of gene expression, GC content and gene length

    Between a pair of protocols, the genes with elevated expression in one protocol compared to the other protocol were identified by edgeR at FDR <  and log2 ratio > 1. Then the gene expression, GC content, and gene length for the two groups of more highly expressed genes were plotted in box plots. The gene expression is the average FPKM (number of fragments per kilobase per million mapped fragments) value of all the samples used in the evaluation of the standard input or ultralow input protocols. The longest transcript representing each gene was used to calculate both gene GC content and length.

    Availability of data and materials

    The raw dataset for the ultralow protocol has been deposited in GEO and can be accessed by the accession number GSE The other datasets for the standard input protocols are still being analyzed for a manuscript in preparation. They will be deposited and made available at GEO after the manuscript is submitted. Until then, the datasets are available from the corresponding author on reasonable request.

    Abbreviations

    Association of Biomolecular Resource Facilities

    Count per million fragments mapped to exons

    Differentially expressed genes

    External RNA Controls Consortium

    False discovery rate

    Fragments per kilobase per million

    Gene Expression Omnibus

    Hierarchical clustering

    High-throughput sequencing

    Long non-coding RNAs

    The University of Texas MD Anderson Cancer Center

    Mouse embryonic stem cells

    Next-generation sequencing

    Principal component analysis

    Quantitative PCR

    Ribonucleic acid sequencing

    Ribosomal RNA

    References

    1. 1.

      Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. ;()– https://doi.org//nature PubMed PMID: ; PubMed Central PMCID: PMC

      ArticlePubMedPubMed Central Google Scholar

    2. 2.

      Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. ;– https://doi.org//annurev.genom PubMed PMID:

      CASArticlePubMed Google Scholar

    3. 3.

      Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. ;5(7)–8. https://doi.org//nmeth PubMed PMID:

      CASArticlePubMed Google Scholar

    4. 4.

      Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. ;()–9. https://doi.org//science PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticlePubMedPubMed Central Google Scholar

    5. 5.

      Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. ;(3)– https://doi.org//j.cell PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticlePubMedPubMed Central Google Scholar

    6. 6.

      Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research. ;41(Database issue):D–D https://doi.org//nar/gks PubMed PMID: ; PubMed Central PMCID: PMC

      Article Google Scholar

    7. 7.

      Oliver HF, Orsi RH, Ponnala L, Keich U, Wang W, Sun Q, et al. Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics. ; https://doi.org// PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticlePubMedPubMed Central Google Scholar

    8. 8.

      Consortium SM-I. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Nat Biotechnol. ;32(9)– https://doi.org//nbt PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticle Google Scholar

    9. 9.

      Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol. ;32(9)– https://doi.org//nbt PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticlePubMedPubMed Central Google Scholar

    10. Schuierer S, Carbone W, Knehr J, Petitjean V, Fernandez A, Sultan M, et al. A comprehensive assessment of RNA-seq protocols for degraded and low-quantity samples. BMC Genomics. ;18(1) https://doi.org//sy PubMed PMID: ; PubMed Central PMCID: PMCPMC

      CASArticlePubMedPubMed Central Google Scholar

    11. Zhao S, Zhang Y, Gamini R, Zhang B, von Schack D. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep. ;8(1) https://doi.org//s PubMed PMID: ; PubMed Central PMCID: PMC

      CASArticlePubMed

    Sours: https://bmcgenomics.biomedcentral.com/articles//s


    405 406 407 408 409