Advanced Search
Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4455 Views: 1297
Edited by: Jinfeng Chen Reviewed by: Hassan RasouliSaumik BasuAftab Nadeem
Abstract
Identifying differentially expressed (DE) genes across specific conditions is vital in understanding phenotypic variation. The fast-growing RNA sequencing (RNA-seq) provides much information that efficiently quantifies gene expression. Methods and tools dedicated to differential gene expression analysis from RNA-seq data have also increased rapidly. More than 30 DE methods have been published; however, many comparison studies highlight that no single method outperforms others in all circumstances. In this study, we test and compare the performances of three widely used R packages: edgeR, DESeq2, and limma voom, with published Cumbie's Arabidopsis thaliana data. Even though the standard DE analysis has been extensively used and improved over the past years, time course RNA-seq can also provide an advanced understanding of gene regulation, biological development, and identifying DE genes. Therefore, we also conducted a time course analysis using another published Ursache's Arabidopsis time course dataset. These methods are initiated in separate R packages, with detailed R codes and explanations constructed to help build a more convenient user experience.
Keywords: RNA-seqBackground
In recent years, RNA sequencing has become the leading choice for genome-wide relative quantification of gene expression and, in particular, the analysis of differential gene expression (DGE) across multiple conditions of interest (Casassola et al., 2013; Van den Berge et al., 2019; Chung et al., 2021). RNA-seq has mainly been applied to study new disease biology, including studies on disease-related DGE analysis and cancer biomarker discovery, cancer heterogeneity and evolution, drug resistance, the cancer microenvironment and immunotherapy, and neoantigens (Hong et al., 2020). It can also identify host-pathogen interactions in eukaryotic cells, including the immune response (Costa et al., 2013). In addition, it plays a significant role in studying quantitative trait loci associated with gene expression in complex diseases (Costa et al., 2013). DGE analysis is one of the most common applications of RNA-seq. DE genes can be identified from different species, tissues, and periods, revealing their function, potential molecular mechanisms, and potential as biomarkers.
RNA-seq data analysis routinely involves a few steps: trimming adaptor sequences and poor-quality nucleotides; alignment to reference genome, or transcriptome, or assembling them de novo; counting mapped reads; normalization to remove possible bias; and identifying significant differences between two or more conditions (Costa-Silva et al., 2017). Regularly, DGE analysis is the final step in RNA-seq studies, aiming to determine which genes have a statistically significant difference, and provide pairwise magnitudes of difference for each gene.
The substantial expansion in RNA-seq has generated more than 30 algorithms and tools for DGE analysis (Lamarre et al., 2018). Table 1 lists essential information from several generally accepted tools dedicated to DGE analysis, and summarizes assumed distributions, and default normalization strategies. There are four main categories of methods: (1) assume the data follow a negative binomial distribution, like DESeq2 and edgeR; (2) assume the data follow a log-normal distribution, like limma voom; (3) assume the data follow a Poisson distribution, like Cufflinks; (4) are non-parametric, such as SAM-Seq.
Table 1. RNA-seq DGE tools discussed in this study
DEG tools | References | Assumed distribution | Normalization | Citations (Dec, 2021) |
edgeR | (Robinson et al., 2010) | Negative binomial | RLE | 32,222 |
DESeq2 | (Love et al., 2014) | Negative binomial | TMM | 23,247 |
limma voom | (Ritchie et al., 2015) | Log-normal | TMM | 14,853 |
Cufflinks | (Trapnell et al., 2010) | Poisson | FPKM | 12,191 |
SAM-Seq | (Li and Tibshirani, 2013) | None | Internal | 467 |
Consequently, many comparison studies have been done, but there is no gold standard. Gierlinski et al. (2015) and Froussios et al. (2019) tried to tackle the obstacle of choosing the best probabilistic model. They recommend using tools based on the negative binomial, and log-normal distributions for the cross-replicate variability of RNA-seq read counts in yeast (Saccharomyces cerevisiae) and Arabidopsis thaliana. Non-parametric methods are seldomly used, and require higher replicate samples for reasonably good performance, so they can be used as alternatives when the data do not fit the negative binomial law (Lamarre et al., 2018). The purpose of this protocol is to demonstrate the principal steps needed to generate diverse DGE results using different methods, and provide a global representation of the expression changes across multiple conditions, especially for plant species. From previous comparison studies (Table 1), we determined the most widely used were these three R packages: edgeR, DESeq2, and limma voom with Arabidopsis thaliana data. This paper will walk the users through an RNA-seq differential expression analysis using three R packages, and implement a comparison of the three methods. Time course sequencing data is a particular type of RNA-seq data, which can provide an opportunity to evaluate gene expression patterns at specific stages of development, or at different time points after a specific treatment (Spies et al., 2017). We will also provide an example of analyzing Arabidopsis time course data.
Procedure
Category
Bioinformatics and Computational Biology
Plant Science > Plant molecular biology
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link