Differential Expression Analysis

Matthew Greenig; Andrew Melville; Derek Huntley; Mark Isalan; Michal Mielcarek

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Differential Expression Analysis

MG Matthew Greenig

AM Andrew Melville

DH Derek Huntley

MI Mark Isalan

MM Michal Mielcarek

This method is extracted from research article: Front Mol Biosci, Sep 2020

Cross-Sectional Transcriptional Analysis of the Aging Murine Heart

DOI: 10.3389/fmolb.2020.565530

Request a Protocol

Ask a question

Favorite

The R package DESeq2 was used to analyse differential expression. According to the process described by Love et al. (2014), DESeq2 models count data using the following Negative Binomial distribution:

where K_ij is the read count for gene i in sample j. Parameters μ and α are estimated using a generalized linear model (GLM) with a logarithmic link function:

where r indexes columns in the design matrix, which correspond to different conditions in our experiment (values of age/sex across the 14 mice).

The GLM is fit to un-normalized read counts, and normalization constants (s_ij) are estimated using a median-of-ratios method to derive the final μ estimate for each distribution. The α parameter, related to a gene's expression variance, is estimated by fitting a curve across all genes, modeling the relationship between an initial α estimate and mean read count for each gene. The final α estimate for each gene is calculated by shrinking the gene's initial estimate toward the curve—i.e., by using information across all genes of similar expression strength to derive a more reliable estimate for expression variance. DESeq2 also performs shrinkage on β estimates, using a zero-centered normal distribution as a prior to shrink values toward zero. This results in adjusted (shrunken) log2 fold changes, which are less sensitive to changes in expression levels for genes with low transcript counts. To account for the large number of tests being performed, DESeq2 uses the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) with a false discovery rate of 0.1 to calculate an adjusted p-value for each statistical test.

To analyse expression differences between age groups, DESeq2's Wald test was implemented. The Wald test calculates a p-value of observing a condition's association with a gene's expression, under the null hypothesis that the condition's corresponding GLM coefficient, β_ir, divided by its standard error, is distributed under a standard normal distribution. To contrast the effects of two levels of a condition, DESeq2 estimates a contrast β coefficient, $β_{i}^{c}$ . $β_{i}^{c}$ is calculated as the difference between β_i estimates for two conditions, and a two-tailed p-value is again calculated under the null hypothesis:

where SE( $β_{i}^{c}$ ) is the coefficient's standard error. DESeq2 multiplies the tail integrals of the normal distribution by 2 to achieve a two-tailed test. Genes with adjusted p < 0.05 (after this multiplication) were identified as significant in each pairwise comparison, corresponding to an α level of 0.05.

To identify genes whose expression was significantly associated with different age-related variables, the likelihood ratio test (LRT) statistic was computed using DESeq2 (Love et al., 2014):

where L denotes a likelihood function, Θ is the set of parameters in the full model, and Θ₀ is the set of parameters in a reduced model for a random vector k—in our experiment a vector of transcript counts for a single gene. We used a full model containing age group, sex, and interaction coefficients and generated two different reduced models by removing the coefficients associated with different explanatory variables in the full model. One reduced model was generated by removing all age-related terms from the full model (LRT for age) while the other reduced model was generated by removing only the age/sex interaction term (LRT for age/sex interaction). By noting that $- 2 log t \to_{}^{d} χ_{f}^{2}$ , where f is the difference in the number of independent parameters between the full and reduced models, both LRTs calculate a one-tailed p-value for each gene. For the variable removed from each model, this p-value expresses the importance of that variable in explaining the expression level of each gene. Using this method, genes with adjusted p-values < 0.05 were identified as significant, corresponding to an α level of 0.05.

To visualize the differentially-expressed (DE) genes identified using the methods described above, the R packages EnhancedVolcano (Blighe, 2019) and gplots (Warnes et al., 2019) were used. EnhancedVolcano was used to display each gene's shrunken log2 fold change (LFC) against its adjusted p-value.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol