In the following sections, we first formalize the input data, and then describe a standard gene–SNP association model, extend this model for the effect of a single SNP on a single gene through perturbation of a certain network branch, and explain how to model SNP–branch perturbation. Lastly, based on this framework, we describe the details of the GEVIN algorithm as a genome-wide analysis of genetic perturbations within the signaling network.
GEVIN takes the following as input. (i) Genotyping of a large collection of SNPs Q across a group of individuals I. Xq is a column vector representing the genotyping of SNP q ∈ Q across all individuals in I. (ii) The transcriptional response of multiple genes in group G following each of the multiple stimulations in group S across all individuals in group I. Let Yg be the measured |I| × |S| transcriptional response matrix of gene g ∈ G, where represents the change in expression level of gene g following stimulation s ∈ S in individual i ∈ I. (iii) Regulatory network MG,S, which is acquired from the scientific literature. Such a network consists of a wiring diagram M that is triggered by the collection of stimulations S and controls the transcriptional response of genes in group G. The structure of the regulatory network M defines a collection of network branches (that is, the distinct signaling pathways), denoted by B. Each branch b ∈ B is characterized by the subset of its triggering stimulations Sb ∈ S (its upstream stimuli) and the subset of its regulated genes Gb ∈ G (its downstream genes). In accordance, we define the activation signature of a branch b ∈ B as a Boolean |G| × |S| matrix Ab, where for each g ∈ Gb and s ∈ Sb, and all other entries are set to zero.
Our analysis generally relies on a standard multivariate regression model (Rencher 2002; Hidalgo and Goodman 2013) to evaluate the association between a single SNP q ∈ Q and a single gene g ∈ G based on its transcriptional response to all stimulations in group S. In this model, the explanatory variable is the genotypic data Xq and the multivariate outcome variable is the transcriptional response matrix Yg:
To gain flexibility in the relationships between the different genotypic groups, the genetic effect was modeled as a categorical variable.
GEVIN relies on a basic assumption that if a given SNP truly perturbs a certain network branch b ∈ B, it would induce genetic variation only in the genes regulated by this branch (g ∈ Gb) and only following the relevant triggering stimulations (s ∈ Sb), as they compose the activation signature of branch b. To model the effect of SNP q on gene g through perturbation of a given branch b ∈ B, we add constraints to the regression model in Equation 1 as follows: the coefficients are fixed to zero for all genes and stimulations in which (that is, = 0 if and only if s ∈ S\Sb or g ∈ G\Gb; Figure S2 in File S1 exemplifies the resulting regression in the case of the toy network in Figure 1A). Finally, a likelihood ratio test of the constrained regression model (comparing the model with and without the SNP variable Xq) provides a P-value for the association of a single gene g and a single SNP q acting through a certain branch b. We refer to this P-value as Pq,b,g. Overall, |Q| × |B| × |G| P-values are calculated for each combination of a gene, a SNP, and a network branch.
A P-value for the genetic perturbation of a branch b by a SNP q, denoted Pq,b, is calculated for each SNP q ∈ Q and branch b ∈ B by combining the P-values of all the genes downstream to the branch using Fisher’s combined probability test (Fisher 1925). Formally, for each SNP q and branch b, Fisher’s test is applied to combine all P-values in the set {Pq,b,g|g ∈ Gb}.
To avoid statistical inflation (e.g., due to correlation among target genes), the combined P-values of each branch b are normalized by the “genomic inflation factor” of the branch, as described in Devlin et al. (2001). The genomic inflation factor of a given branch b is defined as the ratio of the median of the observed (real data) combined P-values vs. the expected median. Here, the expected combined P-values are calculated using 10 permuted data sets that were generated by shuffling the genotypic data labels of all individuals. In all cases, the median is calculated across the set {Pq,b|q ∈ Q} of combined P-values (that is, across all SNPs for the same branch). We refer to these adjusted combined P-values as the “GEVIN scores.”
Finally, a permutation-based false discovery rate (FDR) is determined for each branch to conservatively identify significant SNP–branch perturbations. This is performed by generating R permuted data sets (by randomly reshuffling the genotypic data labels of individuals; here, R = 10), calculating GEVIN scores for all SNPs in each permuted data set, and then computing the FDR of a branch b as the fraction of permuted SNPs that obtained higher GEVIN scores than a certain threshold. Importantly, by shuffling only the genotypic data, the activation signature of each branch is retained, maintaining correlations between the regulating genes and the triggering stimulations of each branch.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.