eQTL generation and analysis

MF Menachem Fromer
PR Panos Roussos
SS Solveig K Sieberts
JJ Jessica S Johnson
DK David H Kavanagh
TP Thanneer M Perumal
DR Douglas M Ruderfer
EO Edwin C Oh
AT Aaron Topol
HS Hardik R Shah
LK Lambertus L Klei
RK Robin Kramer
DP Dalila Pinto
ZG Zeynep H Gümüş
AC A. Ercument Cicek
KD Kristen K Dang
AB Andrew Browne
CL Cong Lu
LX Lu Xie
BR Ben Readhead
ES Eli A Stahl
MP Mahsa Parvizi
TH Tymor Hamamsy
JF John F Fullard
YW Ying-Chih Wang
MM Milind C Mahajan
JD Jonathan M J Derry
JD Joel Dudley
SH Scott E Hemby
BL Benjamin A Logsdon
KT Konrad Talbot
TR Towfique Raj
DB David A Bennett
PJ Philip L De Jager
JZ Jun Zhu
BZ Bin Zhang
PS Patrick F Sullivan
AC Andrew Chess
SP Shaun M Purcell
LS Leslie A Shinobu
LM Lara M Mangravite
HT Hiroyoshi Toyoshiba
RG Raquel E Gur
CH Chang-Gyu Hahn
DL David A Lewis
VH Vahram Haroutunian
MP Mette A Peters
BL Barbara K Lipska
JB Joseph D Buxbaum
ES Eric E Schadt
request Request a Protocol
ask Ask a question
Favorite

For the 16,423 genes with above-threshold expression, gene-level eQTL (gene expression quantitative trait loci) were derived using the N = 467 genetically-inferred Caucasian samples (209 SCZ cases, 206 Controls, and 52 AFF cases), across the 6.4 million genotyped and imputed markers with imputation score (INFO) ≥ 0.8 and estimated minor allele frequency (MAF) ≥ 0.05. eQTL were computed using a linear model on the imputed genotype dosages using MatrixEQTL69. The gene expression data were adjusted for the covariate model, although without adjusting for ancestry vectors. In addition, the estimated Dx effect was added back to the residuals because we wanted to allow for an effect of diagnosis on gene expression. The 5 ancestry vectors were included instead in the eQTL model to control for ancestry differences in SNP allele frequencies. Thus, the final regression model for eQTL discovery in the full Caucasian CMC cohort was:

FDR was estimated separately for cis-eQTL (defined as ≤ 1 MB between SNP marker and gene position) and trans-eQTL (> 1 MB between marker and gene), controlling for FDR one chromosome at a time. The regression modeling was performed for SNPs on the X chromosome in the same manner as for those on the autosomes (i.e., with a dosage scaling between 0 and 2 for both males and females); this gender-neutral model was appropriate here since the gene expression was already adjusted for gender.

Additionally, eQTL were generated separately in SCZ cases and controls, and the combination of those samples (excluding AFF cases). However, permutation of disease status indicated that the overlaps between case-derived eQTL and control-derived eQTL were similar to the amount expected for two homogeneous sets of these sample sizes, and there was limited evidence for condition-specific eQTL. Nevertheless, to potentially identify eQTL that differ by disease state, a disease-genotype interaction term was also explicitly tested, but only a handful of such associations were found to be significant after controlling for FDR.

Lastly, per-gene permutations were performed to identify genes with at least one significant eQTL after correcting for multiple marker testing13. 1000 permutations were performed per gene and FDR was estimated on the permutation P values using the qvalue R package (Dabney A and Storey JD. qvalue: Q-value estimation for false discovery rate control. R package version 1.43.0).

Using similar techniques to derive isoform expression quantitative trait loci (isoQTLs), we identified 3,355,111 significant cis-isoQTLs at FDR ≤ 5%, representing 27,691 isoforms of 10,779 genes. IsoQTLs and gene-level eQTLs overlapped substantially; 58% of isoQTLs were cis-eQTLs for the parent gene at FDR ≤ 5%; conversely, 71% of cis-eQTLs for genes with at least one represented isoform were isoQTLs at FDR ≤ 5%. There were, however, 1,584 genes having no cis-eQTL (FDR ≤ 5%) that nevertheless had at least one significant isoQTL. At the isoform level, there were 39,414 significant trans-isoQTLs, representing 964 isoforms (836 genes), of which 61% were also trans-eQTLs for the same gene.

Since there exist a number of previous brain eQTL studies, we wanted to assess the overlap of the eQTL derived here from CMC with those existing databases. To that end, eQTL for the DLPFC from the (i) Braincloud16 (GEO accession number GSE30272, n samples = 108), (ii) NIH17 (GEO accession number GSE15745, n samples = 145), and (iii) Harvard Brain Tissue Resource Center (HBTRC) / Harvard Brain Bank (HBB)15 (GEO accession number GSE44772, n samples = 146) datasets were generated as previously described21. In addition, eQTL for the frontal cortex from the (iv) UKBEC data18 (GEO accession number GSE46706, n samples = 134) were generated in a similar manner using imputed genotypes obtained directly from the study authors. eQTL for a (v) meta-analysis of brain cortical regions (N = 424) were also obtained from the supplementary materials included with the publication19; note that this meta-analysis included some of the individual studies above. For each of these 5 datasets, an FDR threshold of 5% was used to declare significance of cis-eQTL, and those associated pairs were carried forward for testing. For RNA-seq-based eQTLs from DLPFC (Brodmann area 9, n samples = 92) that are part of the Genotype-Tissue Expression (GTEx) Project13, we utilized those eQTLs significant after permutation (as performed by the GTEx Consortium); these data were downloaded from the GTEx Portal (www.gtexportal.org), corresponding to dbGaP accession number phs000424.v6.p1.

Next, before performing any comparison analyses, the database eQTL were first filtered, removing all eQTL involving: a) array probes that mapped to more than one gene, b) genes not expressed above the minimum threshold in our cohort (and thus would necessarily be missing from our results), c) genes that could not be uniquely mapped to Ensembl (v70) genes, or d) SNPs not included in our analysis.

Then, because the data herein are substantially larger than existing brain eQTL datasets that were therefore more limited in power for eQTL discovery, we focused on testing the sensitivity of our eQTL towards recapitulating publicly available eQTL. To robustly assess this sensitivity, we considered the eQTL P values from our CMC cohort with regards to the eQTL associations described in each public database we had curated. We scored the overlap using π1, the proportion of non-null hypotheses (as estimated by the ‘qvalue’ package in R) among the distribution of CMC P values for the database eQTL SNP-gene association pairs.

For another comparison with genome-wide eQTL, we also used the unpublished, but publicly available HBCC microarray cohort (dbGAP ID: 000979.v1.p1) described below to generate a large set of eQTL better powered for replication of the CMC-derived eQTL. Genotypes were obtained using the HumanHap650Yv3 or Human1MDuov3 chips, and ancestry components were subsequently inferred as above for CMC. Next, eQTL were generated using N = 279 genetically-inferred Caucasian samples (76 controls, 72 SCZ, 43 BP, 88 MDD) in an analogous manner to CMC, adjusting for diagnosis and 5 ancestry components.

Lastly, we employed the eQTL derived from the unpublished ROS/MAP study (https://www.synapse.org/#!Synapse:syn3219045, details on the study given below) in a limited way to replicate the 5 single-gene QTL associations we detected as having strong overlap with GWAS risk variants. The subset of the ROS/MAP cohort currently RNA-sequenced and analyzed (N = 461 DLPFC samples) was used to derive eQTL. To account for non-genetic factors such as batch effects, age, gender, and technical artifacts in the gene expression data, PEER70 was applied. The optimal numbers of PEER factors for association analysis were determined based on the factors that resulted in the maximal number of cis-eQTLs. This procedure identified between 30 and 40 factors in this DLFPC dataset; here, we used 30 PEER factors. We regressed out these factors from the gene expression levels and used the residuals as phenotypes for all eQTL association analyses. The ROS/MAP study 26 takes advantage of data and biological specimens from more than 1000 persons from two prospective, longitudinal clinical-pathologic studies of older subjects that are non-demented at the time of recruitment (the religious order study and the memory and aging project). The subjects have detailed clinical and phenotypic data such as detailed annual cognitive function testing, clinical evaluations for dementia, and a detailed neuropathologic examination.

From January 1994 through June of 2010, 1,148 persons agreed to annual detailed clinical evaluation and brain donation at the time of death. Of these, 1,139 have completed their baseline clinical evaluation: 68.9% were women; 88.0% were white, non-Hispanic; their mean age was 75.6 years; and mean education was 18.1 years. To date, there have been 287 cases of incident dementia and 273 cases of incident AD with or without a coexisting condition.

From October 1997 through June 2010, 1,403 persons agreed to annual detailed clinical evaluation and donation of brain, spinal cord, nerve, and muscle at the time of death. Of these, 1,372 have completed their baseline clinical evaluation: 72.7% were women; 86.9% were white, non-Hispanic; their mean age was 80.0 years; and mean education was 14.3 years, with 34.0% with 12 or fewer years of education. To date, there have been 250 cases of incident dementia and 238 cases of incident AD with or without a coexisting condition. At this time, over 900 subjects from either ROS or MAP are deceased and have frozen brain tissue available for data generation. To avoid population stratification artifacts in the genetic analyses, the first 500 subjects were randomly selected from among those subjects that are self-reported to be of white, non-Hispanic ancestry, and have genome-wide genotype data (N = 1,709 for the entire ROS and MAP studies) that confirm this self-reported ancestry.

To assess how cis- and trans-eQTLs relate to known enhancer sequences, we tested for overlap between eQTLs and enhancer sequences from the Roadmap Epigenomics Consortium71. More specifically, we used chromatin states for enhancer sequences (active, genic, and weak enhancers), derived from a recent joint analysis that the Roadmap Epigenomics Consortium applied in different chromatin immunoprecipitation sequencing (ChIP-seq) data across 98 human tissues and cell lines. We included tissues that were assayed for 6 different chromatin marks (H3K4me1, H3K4me3, H3K27ac, H3K36me3, H3K27me3, and H3K9me3). We tested for enrichment of significant eQTLs at FDR ≤ 5%, using as an index eQTL SNP (eSNP) the most significantly associated SNP per gene (“max-eQTL”), which resulted in 13,137 and 851 cis- and trans-eSNPs, respectively. For each tissue or cell line, we counted the number of index eSNPs that lie within enhancer sequences respectively found in that tissue or cell line. To assess if this overlap is higher than expected by chance, we generated 1,000 sets of random SNPs matched with the index cis- and trans-eSNPs, in terms of allele frequency, gene density, distance from TSS, and density of tagSNPs arising from genomic variability of linkage disequlibrium. Z scores were estimated as:

Where observed is the number of index eSNPs that lie within enhancers, and meannull and SDnull are the mean and standard deviation of the null distribution of overlap, as estimated using the sets of permuted SNPs.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A