Advanced Search
Published: Jul 20, 2024 DOI: 10.21769/BioProtoc.5031 Views: 286
Reviewed by: Zihao Zheng
Abstract
Linking genotype to phenotype is a long-standing research topic in quantitative genetics and is essential for biological sciences. As a causal inference method, genome-wide mediation analysis (GMA) has proven to be a powerful tool to complement the widely used genome-wide association study (GWAS) and transcriptome-wide association study (TWAS). GMA can identify single nucleotide polymorphisms (SNPs) that directly affect the final phenotype or do so indirectly through intermediate omics data mediation. When considering transcriptomics data as the intermediate trait, this method can pinpoint specific genes as transcriptional mediators connecting genotype with phenotype. The results from GMA, therefore, provide explicit causation hypotheses for downstream functional studies. This protocol provides code and a workflow to demonstrate the mediation analysis procedure using a simulated dataset. The demo analysis can be conducted in R using a laptop or desktop computer.
Keywords: Genome-wide mediation analysisBackground
Phenotypic traits, especially complex agronomic traits, are usually determined by a large number of genetic loci. The penetrance of the genetic variants often occurs through different layers of intermediate molecular processes, i.e., gene regulation [1], spatiotemporal gene expression [2], and translation [3]. Identifying the genetic loci and revealing the related intermediate molecular processes are, therefore, critical for biological understanding and crop improvement.
Genome-wide association study (GWAS) establishes the association between genotype [i.e., single nucleotide polymorphisms (SNPs)] and traits of interest [4]. Similarly, transcriptome-wide associated study (TWAS) connects SNP with gene expression levels [5]. However, none of these methods bridges the three variables of genotype, intermediate molecular processes, and conventional phenotype in a single analysis. Mediation analysis, as a statistical causal inference method, establishes a causal chain for these three variables, i.e., from genotype as the exposure to the intermediate molecular process (i.e., gene expression) as the mediator and eventually to the conventional phenotype as the outcome. Our previous work has laid down the statistical foundation of the high-dimensional mediation analysis [6] and applied it to a maize diversity panel to identify a number of mediator genes [7]. By modeling genotype to phenotype through two pathways, GMA can detect SNPs 1) directly affecting phenotype without any mediation through the intermediate variables under investigation or 2) indirectly through intermediate processes, i.e., gene expression or other intermediate omics data.
In this workflow, we use simulated transcriptome data as the intermediate molecular traits to demonstrate the GMA procedure. We provide input data formats and output results interpretation. It is worth noting that GMA is not limited to treating transcriptome data as the intermediate variable; other genome-wide variables, such as metabolome or proteome, can also be considered as intermediate variables. These intermediate variables identified through GMA can be further validated through biological functional analyses.
Equipment
Laptop or desktop computer running a Windows or Linux system (i.e., a MacBook Pro computer)
Running time: To run a small dataset (i.e., in this example, approximately 300 different genotypes with 5,000 SNPs and RNA-seq data of 1,200 genes), 5 GB RAM and 4 CPU cores with 0.5 h running time should be sufficient. However, for a larger dataset (i.e., 300 genotypes with approximately 1 million SNPs and 8,000 genes), at least 160 GB RAM and 16 CPU cores with 1 h running time are recommended.
Software and datasets
R v4.1.3 (https://cran.r-project.org/)
R packages:
data.table_1.14.8
glmnet_4.1-8
MASS_7.3-58.2
rrBLUP_4.6.3
parallel_4.1.3
doParallel_1.0.17
CompQuadForm_1.4.3
circlize_0.4.16
dplyr_1.1.0
RcolorBrewer_1.1-3
Input data
To conduct GMA, three input datasets are needed: a SNP matrix as the exposure, the phenotypic values as the outcome, and the omics matrix as the mediator, as well as one optional input dataset (i.e., a matrix of principal components to control population structure). The formats of the input datasets are described below.
Exposure (“input/Z_matrix.txt”): a numerical matrix of bi-allelic SNP data (Table 1). It is coded as “-1, 0, 1” to represent homozygous reference, heterozygous, and homozygous alternative genotypes. For missing data, it can simply be imputed by averaging the genotypic value or using SNP imputation software such as Beagle [8], fastPHASE [9], or LinkImpute [10].
Table 1. SNP genotype matrix (or the Z matrix).
In the matrix, the row represents the plant genotype, and the column represents the SNP site.
| [SNP1] | [SNP2] | … | [SNPM] | |
| [genotype 1] | -1 | 0 | -1 | |
| [genotype 2] | 0 | 1 | 1 | |
| … | … | … | … | … |
| [genotype n] | 1 | -1 | … | 0 |
Confounder (“input/X0_matrix.txt”): optional principal components of the Z matrix (Table 2). This matrix can be calculated from the Z matrix to control for the population structure; confounders can also be anything that may affect mediators and outcomes, like environmental factors that can affect gene expression (mediator) and phenotype (outcome). It is optional for the GMA.
Table 2. Principal components of the Z matrix (or the X0 matrix). In this matrix, the row represents the plant genotype, and the column represents each principal component.
| [PC1] | [PC2] | [PC3] | |
| [genotype 1] | 37.67511 | -2.86529 | 0.788754 |
| [genotype 2] | 62.41023 | -85.0894 | 16.63496 |
| … | … | … | .. |
| [genotype n] | 65.67986 | -96.6183 | 9.525744 |
Mediator (“input/X_matrix.txt”): the intermediate omics dataset (Table 3). For the omics input data, population-wide measurement of each variable is required, for example, a population-wide RNA-seq data for each gene [11].
Table 3. Intermediate omics data matrix (or the X matrix). Here, normalized RNA-seq data were employed as the intermediate trait. The row represents the plant genotype, and the column represents each normalized read count of each gene.
| [gene 1] | [gene 2] | … | [gene q] | |
| [genotype 1] | 2.215721 | -0.74665 | … | 0.037008 |
| [genotype 2] | -0.96617 | 0.433468 | … | 0.441124 |
| … | … | … | … | … |
| [genotype n] | 0.297719 | -0.63026 | … | 0.009335 |
Outcome (“input/y_matrix.txt”): the phenotypic trait of interest (Table 4).
Table 4. Phenotype matrix (or the y matrix).
The row represents the plant genotype, and the column represents a trait of interest (BLUP/BLUE are recommended over raw phenotype).
| [trait1] | |
| [genotype 1] | 4.214806 |
| [genotype 2] | 3.828688 |
| … | … |
| [genotype N] | -3.691982 |
Procedure
© 2024 The Author(s); This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Share
Bluesky
X
Copy link