Differential Expression Analysis: Simple Pair, Interaction, Time-series

Han Qu; Meng  Qu; Shibo  Wang; Lei  Yu; Qiong  Jia; Xuesong  Wang; Zhenyu  Jia

doi:10.21769/BioProtoc.4455

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Peer-reviewed

Differential Expression Analysis: Simple Pair, Interaction, Time-series

ZJ Zhenyu Jia email

Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4455 Views: 1297

Edited by: Jinfeng Chen Reviewed by: Hassan Rasouli Saumik Basu Aftab Nadeem

Original research article

The authors used this protocol in:

https://pubmed.ncbi.nlm.nih.gov/29509844/

Jul 2018

Download PDF

Ask a question

How to cite

Favorite

Cited by

Abstract

Identifying differentially expressed (DE) genes across specific conditions is vital in understanding phenotypic variation. The fast-growing RNA sequencing (RNA-seq) provides much information that efficiently quantifies gene expression. Methods and tools dedicated to differential gene expression analysis from RNA-seq data have also increased rapidly. More than 30 DE methods have been published; however, many comparison studies highlight that no single method outperforms others in all circumstances. In this study, we test and compare the performances of three widely used R packages: edgeR, DESeq2, and limma voom, with published Cumbie's Arabidopsis thaliana data. Even though the standard DE analysis has been extensively used and improved over the past years, time course RNA-seq can also provide an advanced understanding of gene regulation, biological development, and identifying DE genes. Therefore, we also conducted a time course analysis using another published Ursache's Arabidopsis time course dataset. These methods are initiated in separate R packages, with detailed R codes and explanations constructed to help build a more convenient user experience.

Keywords: RNA-seq

Bioinformatics

Benchmarking

Differential gene expression

Significantly expressed genes

Time course

Background

In recent years, RNA sequencing has become the leading choice for genome-wide relative quantification of gene expression and, in particular, the analysis of differential gene expression (DGE) across multiple conditions of interest (Casassola et al., 2013; Van den Berge et al., 2019; Chung et al., 2021). RNA-seq has mainly been applied to study new disease biology, including studies on disease-related DGE analysis and cancer biomarker discovery, cancer heterogeneity and evolution, drug resistance, the cancer microenvironment and immunotherapy, and neoantigens (Hong et al., 2020). It can also identify host-pathogen interactions in eukaryotic cells, including the immune response (Costa et al., 2013). In addition, it plays a significant role in studying quantitative trait loci associated with gene expression in complex diseases (Costa et al., 2013). DGE analysis is one of the most common applications of RNA-seq. DE genes can be identified from different species, tissues, and periods, revealing their function, potential molecular mechanisms, and potential as biomarkers.

RNA-seq data analysis routinely involves a few steps: trimming adaptor sequences and poor-quality nucleotides; alignment to reference genome, or transcriptome, or assembling them de novo; counting mapped reads; normalization to remove possible bias; and identifying significant differences between two or more conditions (Costa-Silva et al., 2017). Regularly, DGE analysis is the final step in RNA-seq studies, aiming to determine which genes have a statistically significant difference, and provide pairwise magnitudes of difference for each gene.

The substantial expansion in RNA-seq has generated more than 30 algorithms and tools for DGE analysis (Lamarre et al., 2018). Table 1 lists essential information from several generally accepted tools dedicated to DGE analysis, and summarizes assumed distributions, and default normalization strategies. There are four main categories of methods: (1) assume the data follow a negative binomial distribution, like DESeq2 and edgeR; (2) assume the data follow a log-normal distribution, like limma voom; (3) assume the data follow a Poisson distribution, like Cufflinks; (4) are non-parametric, such as SAM-Seq.

Table 1. RNA-seq DGE tools discussed in this study

DEG tools	References	Assumed distribution	Normalization	Citations (Dec, 2021)
edgeR	(Robinson et al., 2010)	Negative binomial	RLE	32,222
DESeq2	(Love et al., 2014)	Negative binomial	TMM	23,247
limma voom	(Ritchie et al., 2015)	Log-normal	TMM	14,853
Cufflinks	(Trapnell et al., 2010)	Poisson	FPKM	12,191
SAM-Seq	(Li and Tibshirani, 2013)	None	Internal	467

Consequently, many comparison studies have been done, but there is no gold standard. Gierlinski et al. (2015) and Froussios et al. (2019) tried to tackle the obstacle of choosing the best probabilistic model. They recommend using tools based on the negative binomial, and log-normal distributions for the cross-replicate variability of RNA-seq read counts in yeast (Saccharomyces cerevisiae) and Arabidopsis thaliana. Non-parametric methods are seldomly used, and require higher replicate samples for reasonably good performance, so they can be used as alternatives when the data do not fit the negative binomial law (Lamarre et al., 2018). The purpose of this protocol is to demonstrate the principal steps needed to generate diverse DGE results using different methods, and provide a global representation of the expression changes across multiple conditions, especially for plant species. From previous comparison studies (Table 1), we determined the most widely used were these three R packages: edgeR, DESeq2, and limma voom with Arabidopsis thaliana data. This paper will walk the users through an RNA-seq differential expression analysis using three R packages, and implement a comparison of the three methods. Time course sequencing data is a particular type of RNA-seq data, which can provide an opportunity to evaluate gene expression patterns at specific stages of development, or at different time points after a specific treatment (Spies et al., 2017). We will also provide an example of analyzing Arabidopsis time course data.

Procedure

Please login or sign up for free to view full text