4.2. Latent Interaction Testing (LIT) framework

Andrew J. Bass; Shijia Bian; Aliza P. Wingo; Thomas S. Wingo; David J. Cutler; Michael P. Epstein

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

4.2. Latent Interaction Testing (LIT) framework

AB Andrew J. Bass

SB Shijia Bian

AW Aliza P. Wingo

TW Thomas S. Wingo

DC David J. Cutler

ME Michael P. Epstein

This method is extracted from research article: bioRxiv, Sep 2023

Identifying latent genetic interactions in genome-wide association studies using multiple traits

DOI: 10.1101/2023.09.11.557155

Ask a question

Favorite

Our strategy builds from the above observations and estimates the ITV and ITC to detect latent genetic interactions. To derive estimates of these quantities, we first remove the additive genetic effect from the traits to ensure that any variance and covariance effects are not due to the additive effect. Let us denote the trait residuals as $e_{j k} = Y_{j k} - β_{k} X_{j}$ where we assume the effect size is known for simplicity. We can then express the ITV and ITC as a function of these residuals: the ITV of trait $k$ and the ITC between traits $k$ and $k^{'}$ is defined as $V a r [Y_{j k} ∣ X_{j}] = E [e_{j k}^{2} ∣ X_{j}]$ and $C o v [Y_{j k}, Y_{j k^{'}} ∣ X_{j}] = E [e_{j k} e_{j k^{'}} ∣ X_{j}]$ , respectively (Appendix 5.1.1). Thus, we can estimate the ITV by squaring the residuals, $e_{j k}^{2}$ , and estimate the ITC between traits $k$ and $k^{'}$ by the pairwise product of the residuals (i.e., the cross products), $e_{j k} e_{j k^{'}}$ . Aggregating the ITV and ITC estimates across all individuals, we denote the cross product (CP) terms in the $n \times s$ matrix $Z^{C P}$ where the $j$ th row vector is $Z_{j}^{C P} = [e_{j 1} e_{j 2}, e_{j 1} e_{j 3}, \dots, e_{j, r - 1} e_{j r}]$ , and the squared residual (SQ) terms in the $n \times r$ matrix $Z^{S Q}$ where $Z_{j k}^{S Q} = e_{j k}^{2}$ .

Our inference goal is to assess whether the SNP, $X_{n \times 1} = {[X_{1}, X_{2}, \dots, X_{n}]}^{T}$ , is independent of the squared residuals and cross products,

where ‘.’ denotes all the rows (or individuals) and ‘ $⫫$ ’ denotes statistical independence. In the above regression model, this corresponds to testing the global null hypothesis $H_{0} : γ_{1} = γ_{2} = \dots = γ_{r} = 0$ versus the alternative hypothesis $H_{1} : γ_{k} \neq 0$ for at least one of the $k = 1,2, \dots, r$ traits. While a regression model can be directly applied to the squared residuals and cross products to test the global null hypothesis (see Appendix for mathematical details), a univariate model approach does not adequately leverage pleiotropy and requires a multiple testing correction which reduces power.

To address these issues, we develop a new multivariate kernel-based framework, Latent Interaction Testing (LIT), that captures pleiotropy across the ITV and ITC terms to increase power for detecting latent interactions. There are three key steps in the LIT framework (Figure 1):

We expand on the above steps in detail below.

Step 1: In the first step, LIT standardizes the traits and then regresses out the additive genetic effects, population structure, and any other covariates. This ensures that any differential variance and/or covariance patterns are not due to additive genetic effects or population structure. Suppose there are $l_{1}$ measured covariates and $l_{2}$ principal components to control for structure. We denote these $l = l_{1} + l_{2}$ variables in the $n \times l$ matrix $H$ . After regressing out these variables and the additive genetic effects, the $n \times r$ matrix of residuals is $e = \tilde{Y} - X \hat{β} - H \hat{A}$ , where $\tilde{Y}$ is the standardized trait matrix, $\hat{β}$ is a $1 \times r$ matrix of effect sizes and $\hat{A}$ is a $l \times r$ matrix of coefficients estimated using least squares. We also regress out population structure from the genotypes which we denote by $\tilde{X}$ .

The above approach only removes the mean effects and does not correct for variance effects from population structure which can impact type I error rate control [⁵⁷]. A strategy to adjust for the variance effects is to standardize the genotypes with the estimated individual-specific allele frequencies (IAF), i.e., the allele frequencies given the genetic ancestry of an individual. However, it is computationally costly to standardize the genotypes for biobank-sized datasets as it requires estimating the IAFs of all SNPs using a generalized linear model [⁵⁸^,⁵⁹]. Therefore, in this work, we remove the mean effects from structure and then adjust the test statistics with the genomic inflation factor to be conservative. Our software includes an implementation to standardize the genotypes using the IAFs for smaller datasets.

Step 2: The second step uses the residuals, $e$ , to reveal any latent interactions by constructing estimates of the ITV and ITC. For the $j$ th individual’s set of trait residuals, the ITVs are estimated by squaring the trait residuals while the ITCs are estimated by calculating the cross products of the trait residuals. We express the squared residuals as $Z_{j}^{S Q} = [e_{j 1}^{2}, e_{j 2}^{2}, \dots, e_{j r}^{2}]$ , and the $s = (\begin{matrix} r \\ 2 \end{matrix})$ pairwise cross products as $Z_{j}^{C P} = [e_{j 1} e_{j 2}, e_{j 1} e_{j 3}, \dots, e_{j, k - 1} e_{j r}]$ . Importantly, when the studentized residuals are used, then $Z_{j}^{S Q}$ and $Z_{j}^{C P}$ represent an unbiased estimate of the ITVs and ITCs, respectively. We aggregate these terms across all individuals into the $n \times (r + s)$ matrix $Z = [\begin{array}{l} Z^{S Q} & Z^{C P} \end{array}]$ .

Step 3: In the last step, we test for association between the adjusted SNP and the squared residuals and cross products (SQ/CP) using a kernel-based distance covariance framework [³¹^–³³]. Specifically, we apply a kernel-based independence test called the Hilbert-Schmidt independence criterion (HSIC), which has been previously used for GWAS data (see, e.g., [³⁵^–³⁸]). The HSIC constructs two $n \times n$ similarity matrices between individuals using the SQ/CP matrix and genotype matrix, then calculates a test statistic that measures any shared signal between these similarity matrices. To estimate the similarity matrix, a kernel function is specified that captures the similitude between the $j$ th and $j^{'}$ th individual.

Since our primary application is biobank-sized data, we use a linear kernel so that LIT is computationally efficient. The linear similarity matrix is defined as $K_{j j^{'}} : = k ({\tilde{X}}_{j}, {\tilde{X}}_{j^{'}}) = {\tilde{X}}_{j} {\tilde{X}}_{j^{'}}$ for the genotype matrix and $L_{j j^{'}} : = k (Z_{j}, Z_{j^{'}}) = Z_{j} Z_{j^{'}}^{T}$ for the SQ/CP matrix. The linear kernel is a scaled version of the covariance matrix and, for this special case, the HSIC is related to the RV coefficient. We note that one can choose other options for a kernel function, such as a polynomial kernel, projection kernel, and a Gaussian radial-basis function that can capture non-linear relationships [³⁴^,³⁵].

Once the similarity matrices $K$ and $L$ are constructed, we can express the HSIC test statistic as

which follows a weighted sum of Chi-squared random variables under the null hypothesis, i.e., $T | H_{0} \sim \sum_{i, j}^{n} \frac{1}{n} λ_{K, i} λ_{L, j} v_{i j}^{2},$ where $λ_{K, i}$ and $λ_{L, j}$ are the ordered non-zero eigenvalues of the respective matrices and $v_{i j} \sim N o r m a l (0,1)$ . Intuitively, the test statistic measures the ‘overlap’ between two random matrices where large values of $T$ imply the two matrices are similar (i.e., a latent genetic interactive effect) while small values of $T$ imply no evidence of similarity (i.e., no latent genetic interactive effects). We can approximate the null distribution of $T$ using Davies’ method, which is computationally fast and accurate for large $T$ [³⁵^,³⁸^,⁶⁰].

For the linear kernel considered here, we implement a simple strategy to substantially improve the computational speed of LIT. We first calculate the eigenvectors and eigenvalues of the SQ/CP and genotype matrices to construct the test statistic. Since the number of traits, $r$ , is much smaller than the sample size, $n$ , we can perform a singular value decomposition to estimate the subset of eigenvectors and eigenvalues in a computationally efficient manner [⁶¹^–⁶³]. This allows us to circumvent direct calculation and storage of large $n \times n$ similarity matrices. Let $L = V_{L} D_{L} V_{L}^{T}$ and $K = V_{K} D_{K} V_{K}^{T}$ be the singular value decomposition (SVD) of the similarity matrices where the matrix $D$ is a diagonal matrix of eigenvalues and $V$ is a matrix of eigenvectors of the respective kernel matrices. We can then express the test statistic in terms of the SVD components as $T = \frac{1}{n} t r (D_{K} R D_{L} R^{T})$ , where $R = V_{K}^{T} V_{L}$ is the outer product between the two eigenvectors. Thus, for a single SNP, the test statistic is $T = \frac{1}{n} t r (D_{K} R_{d_{1} \times d_{2}} D_{L} R_{d_{2} \times d_{1}}^{T})$ , where $d_{1} = r + s$ is the rank of the SQ/CP matrix and $d_{2} = 1$ is the rank of the genotype matrix such that $d_{1}, d_{2} ≪ n$ .

We explore an important aspect of the test statistic in Equation 5, namely, the role of the eigenvalues in determining statistical significance. The above equations suggest that the eigenvalues of the kernel matrices are emphasizing the eigenvectors that explain the most variation in the test statistic. While this may be reasonable in some settings, the interaction signal can be captured by eigenvectors that explain the least variation and this can be very difficult to ascertain beforehand [⁴⁰]. In this case, the testing procedure will be underpowered. Thus, we also consider weighting the eigenvectors equally in LIT, i.e., $T = \frac{1}{n} t r (R R^{T}) = \frac{1}{n} \sum_{i = 1}^{n} D_{R, i}^{2}$ , where $D_{R}$ are the eigenvalues of the outer product matrix. In this work, we implement a linear kernel (scaled covariance matrix) and so, in this special case, weighting the eigenvectors equally is equivalent to the projection kernel.

In summary, there are two implementations of the LIT framework. The residuals are first transformed to calculate the SQ and CP to reveal any latent interactive effects. We then calculate the weighted and unweighted eigenvectors in the test statistic which we refer to as weighted LIT (wLIT) and unweighted LIT (uLIT), respectively. We also apply a Cauchy combination test (CCT) [⁴¹] to combine the $p$ -values from the LIT implementations to maximize the number of discoveries and hedge for various (unknown) settings where one implementation may outperform the other. More specifically, let $p_{c}$ denote the $p$ value for the $c = 1,2$ implementations. In this case, the CCT statistic is $T^{'} = \frac{1}{2} \sum_{c = 1}^{2} t a n \{(0.5 - p_{c}) π\}$ , where $π \approx 3.14$ is a mathematical constant. A corresponding $p$ -value is then calculated using the standard Cauchy distribution. Importantly, when applying genome-wide significance levels, the CCT $p$ -value provides control of the type I error rate under arbitrary dependence structures. In the Results section, we refer to the CCT $p$ -value as aggregate LIT (aLIT).

We can extend LIT to assess latent interactions within a genetic region (e.g., a gene) consisting of multiple SNPs. In the first step, we regress out the joint additive effects from the multiple SNPs along with any other covariates and population structure. In the second step, we calculate the squared residuals and cross products using the corresponding residual matrix. Finally, in the last step, we construct the similarity matrices and perform inference using the HSIC: the linear similarity matrix for the $n \times m_{0}$ genotype matrix $\tilde{X}$ is $K_{j j^{'}} = k ({\tilde{X}}_{j}, {\tilde{X}}_{j^{'}}) = {\tilde{X}}_{j} {\tilde{X}}_{j^{'}}^{T}$ and our test statistic is $T = \frac{1}{n} t r (D_{K} R_{d_{1} \times d_{2}} D_{L} R_{d_{2} \times d_{1}}^{T})$ where $d_{2} = m_{0}$ is the rank of the genotype matrix.

Compared to the previous section, this extended version of LIT is a region-based test for interactive effects instead of a SNP-by-SNP test. A region-based test is advantageous to reduce the number of tests compared to a SNP-by-SNP approach. However, in this work, we demonstrate LIT on SNP-by-SNP genome-wide scan to demonstrate the scalability.

This work is licensed under a Creative Commons Attribution 4.0 International License, which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol