eQTL generation and analysis

Menachem Fromer; Panos Roussos; Solveig K Sieberts; Jessica S Johnson; David H Kavanagh; Thanneer M Perumal; Douglas M Ruderfer; Edwin C Oh; Aaron Topol; Hardik R Shah; Lambertus L Klei; Robin Kramer; Dalila Pinto; Zeynep H Gümüş; A. Ercument Cicek; Kristen K Dang; Andrew Browne; Cong Lu; Lu Xie; Ben Readhead; Eli A Stahl; Mahsa Parvizi; Tymor Hamamsy; John F Fullard; Ying-Chih Wang; Milind C Mahajan; Jonathan M J Derry; Joel Dudley; Scott E Hemby; Benjamin A Logsdon; Konrad Talbot; Towfique Raj; David A Bennett; Philip L De Jager; Jun Zhu; Bin Zhang; Patrick F Sullivan; Andrew Chess; Shaun M Purcell; Leslie A Shinobu; Lara M Mangravite; Hiroyoshi Toyoshiba; Raquel E Gur; Chang-Gyu Hahn; David A Lewis; Vahram Haroutunian; Mette A Peters; Barbara K Lipska; Joseph D Buxbaum; Eric E Schadt

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

eQTL generation and analysis

MF Menachem Fromer

PR Panos Roussos

SS Solveig K Sieberts

JJ Jessica S Johnson

DK David H Kavanagh

TP Thanneer M Perumal

DR Douglas M Ruderfer

EO Edwin C Oh

AT Aaron Topol

HS Hardik R Shah

LK Lambertus L Klei

RK Robin Kramer

DP Dalila Pinto

ZG Zeynep H Gümüş

AC A. Ercument Cicek

KD Kristen K Dang

AB Andrew Browne

CL Cong Lu

LX Lu Xie

BR Ben Readhead

ES Eli A Stahl

MP Mahsa Parvizi

TH Tymor Hamamsy

JF John F Fullard

YW Ying-Chih Wang

MM Milind C Mahajan

JD Jonathan M J Derry

JD Joel Dudley

SH Scott E Hemby

BL Benjamin A Logsdon

KT Konrad Talbot

TR Towfique Raj

DB David A Bennett

PJ Philip L De Jager

JZ Jun Zhu

BZ Bin Zhang

PS Patrick F Sullivan

AC Andrew Chess

SP Shaun M Purcell

LS Leslie A Shinobu

LM Lara M Mangravite

HT Hiroyoshi Toyoshiba

RG Raquel E Gur

CH Chang-Gyu Hahn

DL David A Lewis

VH Vahram Haroutunian

MP Mette A Peters

BL Barbara K Lipska

JB Joseph D Buxbaum

ES Eric E Schadt

This method is extracted from research article: Nat Neurosci, Sep 2016

Gene Expression Elucidates Functional Impact of Polygenic Risk for Schizophrenia

DOI: 10.1038/nn.4399

Request a Protocol

Ask a question

Favorite

For the 16,423 genes with above-threshold expression, gene-level eQTL (gene expression quantitative trait loci) were derived using the N = 467 genetically-inferred Caucasian samples (209 SCZ cases, 206 Controls, and 52 AFF cases), across the 6.4 million genotyped and imputed markers with imputation score (INFO) ≥ 0.8 and estimated minor allele frequency (MAF) ≥ 0.05. eQTL were computed using a linear model on the imputed genotype dosages using MatrixEQTL⁶⁹. The gene expression data were adjusted for the covariate model, although without adjusting for ancestry vectors. In addition, the estimated Dx effect was added back to the residuals because we wanted to allow for an effect of diagnosis on gene expression. The 5 ancestry vectors were included instead in the eQTL model to control for ancestry differences in SNP allele frequencies. Thus, the final regression model for eQTL discovery in the full Caucasian CMC cohort was:

FDR was estimated separately for cis-eQTL (defined as ≤ 1 MB between SNP marker and gene position) and trans-eQTL (> 1 MB between marker and gene), controlling for FDR one chromosome at a time. The regression modeling was performed for SNPs on the X chromosome in the same manner as for those on the autosomes (i.e., with a dosage scaling between 0 and 2 for both males and females); this gender-neutral model was appropriate here since the gene expression was already adjusted for gender.

Additionally, eQTL were generated separately in SCZ cases and controls, and the combination of those samples (excluding AFF cases). However, permutation of disease status indicated that the overlaps between case-derived eQTL and control-derived eQTL were similar to the amount expected for two homogeneous sets of these sample sizes, and there was limited evidence for condition-specific eQTL. Nevertheless, to potentially identify eQTL that differ by disease state, a disease-genotype interaction term was also explicitly tested, but only a handful of such associations were found to be significant after controlling for FDR.

Lastly, per-gene permutations were performed to identify genes with at least one significant eQTL after correcting for multiple marker testing¹³. 1000 permutations were performed per gene and FDR was estimated on the permutation P values using the qvalue R package (Dabney A and Storey JD. qvalue: Q-value estimation for false discovery rate control. R package version 1.43.0).

Using similar techniques to derive isoform expression quantitative trait loci (isoQTLs), we identified 3,355,111 significant cis-isoQTLs at FDR ≤ 5%, representing 27,691 isoforms of 10,779 genes. IsoQTLs and gene-level eQTLs overlapped substantially; 58% of isoQTLs were cis-eQTLs for the parent gene at FDR ≤ 5%; conversely, 71% of cis-eQTLs for genes with at least one represented isoform were isoQTLs at FDR ≤ 5%. There were, however, 1,584 genes having no cis-eQTL (FDR ≤ 5%) that nevertheless had at least one significant isoQTL. At the isoform level, there were 39,414 significant trans-isoQTLs, representing 964 isoforms (836 genes), of which 61% were also trans-eQTLs for the same gene.

Since there exist a number of previous brain eQTL studies, we wanted to assess the overlap of the eQTL derived here from CMC with those existing databases. To that end, eQTL for the DLPFC from the (i) Braincloud¹⁶ (GEO accession number GSE30272, n samples = 108), (ii) NIH¹⁷ (GEO accession number GSE15745, n samples = 145), and (iii) Harvard Brain Tissue Resource Center (HBTRC) / Harvard Brain Bank (HBB)¹⁵ (GEO accession number GSE44772, n samples = 146) datasets were generated as previously described²¹. In addition, eQTL for the frontal cortex from the (iv) UKBEC data¹⁸ (GEO accession number GSE46706, n samples = 134) were generated in a similar manner using imputed genotypes obtained directly from the study authors. eQTL for a (v) meta-analysis of brain cortical regions (N = 424) were also obtained from the supplementary materials included with the publication¹⁹; note that this meta-analysis included some of the individual studies above. For each of these 5 datasets, an FDR threshold of 5% was used to declare significance of cis-eQTL, and those associated pairs were carried forward for testing. For RNA-seq-based eQTLs from DLPFC (Brodmann area 9, n samples = 92) that are part of the Genotype-Tissue Expression (GTEx) Project¹³, we utilized those eQTLs significant after permutation (as performed by the GTEx Consortium); these data were downloaded from the GTEx Portal (www.gtexportal.org), corresponding to dbGaP accession number phs000424.v6.p1.

Next, before performing any comparison analyses, the database eQTL were first filtered, removing all eQTL involving: a) array probes that mapped to more than one gene, b) genes not expressed above the minimum threshold in our cohort (and thus would necessarily be missing from our results), c) genes that could not be uniquely mapped to Ensembl (v70) genes, or d) SNPs not included in our analysis.

Then, because the data herein are substantially larger than existing brain eQTL datasets that were therefore more limited in power for eQTL discovery, we focused on testing the sensitivity of our eQTL towards recapitulating publicly available eQTL. To robustly assess this sensitivity, we considered the eQTL P values from our CMC cohort with regards to the eQTL associations described in each public database we had curated. We scored the overlap using π₁, the proportion of non-null hypotheses (as estimated by the ‘qvalue’ package in R) among the distribution of CMC P values for the database eQTL SNP-gene association pairs.

For another comparison with genome-wide eQTL, we also used the unpublished, but publicly available HBCC microarray cohort (dbGAP ID: 000979.v1.p1) described below to generate a large set of eQTL better powered for replication of the CMC-derived eQTL. Genotypes were obtained using the HumanHap650Yv3 or Human1MDuov3 chips, and ancestry components were subsequently inferred as above for CMC. Next, eQTL were generated using N = 279 genetically-inferred Caucasian samples (76 controls, 72 SCZ, 43 BP, 88 MDD) in an analogous manner to CMC, adjusting for diagnosis and 5 ancestry components.

Lastly, we employed the eQTL derived from the unpublished ROS/MAP study (https://www.synapse.org/#!Synapse:syn3219045, details on the study given below) in a limited way to replicate the 5 single-gene QTL associations we detected as having strong overlap with GWAS risk variants. The subset of the ROS/MAP cohort currently RNA-sequenced and analyzed (N = 461 DLPFC samples) was used to derive eQTL. To account for non-genetic factors such as batch effects, age, gender, and technical artifacts in the gene expression data, PEER⁷⁰ was applied. The optimal numbers of PEER factors for association analysis were determined based on the factors that resulted in the maximal number of cis-eQTLs. This procedure identified between 30 and 40 factors in this DLFPC dataset; here, we used 30 PEER factors. We regressed out these factors from the gene expression levels and used the residuals as phenotypes for all eQTL association analyses. The ROS/MAP study ²⁶ takes advantage of data and biological specimens from more than 1000 persons from two prospective, longitudinal clinical-pathologic studies of older subjects that are non-demented at the time of recruitment (the religious order study and the memory and aging project). The subjects have detailed clinical and phenotypic data such as detailed annual cognitive function testing, clinical evaluations for dementia, and a detailed neuropathologic examination.

From January 1994 through June of 2010, 1,148 persons agreed to annual detailed clinical evaluation and brain donation at the time of death. Of these, 1,139 have completed their baseline clinical evaluation: 68.9% were women; 88.0% were white, non-Hispanic; their mean age was 75.6 years; and mean education was 18.1 years. To date, there have been 287 cases of incident dementia and 273 cases of incident AD with or without a coexisting condition.

From October 1997 through June 2010, 1,403 persons agreed to annual detailed clinical evaluation and donation of brain, spinal cord, nerve, and muscle at the time of death. Of these, 1,372 have completed their baseline clinical evaluation: 72.7% were women; 86.9% were white, non-Hispanic; their mean age was 80.0 years; and mean education was 14.3 years, with 34.0% with 12 or fewer years of education. To date, there have been 250 cases of incident dementia and 238 cases of incident AD with or without a coexisting condition. At this time, over 900 subjects from either ROS or MAP are deceased and have frozen brain tissue available for data generation. To avoid population stratification artifacts in the genetic analyses, the first 500 subjects were randomly selected from among those subjects that are self-reported to be of white, non-Hispanic ancestry, and have genome-wide genotype data (N = 1,709 for the entire ROS and MAP studies) that confirm this self-reported ancestry.

To assess how cis- and trans-eQTLs relate to known enhancer sequences, we tested for overlap between eQTLs and enhancer sequences from the Roadmap Epigenomics Consortium⁷¹. More specifically, we used chromatin states for enhancer sequences (active, genic, and weak enhancers), derived from a recent joint analysis that the Roadmap Epigenomics Consortium applied in different chromatin immunoprecipitation sequencing (ChIP-seq) data across 98 human tissues and cell lines. We included tissues that were assayed for 6 different chromatin marks (H3K4me1, H3K4me3, H3K27ac, H3K36me3, H3K27me3, and H3K9me3). We tested for enrichment of significant eQTLs at FDR ≤ 5%, using as an index eQTL SNP (eSNP) the most significantly associated SNP per gene (“max-eQTL”), which resulted in 13,137 and 851 cis- and trans-eSNPs, respectively. For each tissue or cell line, we counted the number of index eSNPs that lie within enhancer sequences respectively found in that tissue or cell line. To assess if this overlap is higher than expected by chance, we generated 1,000 sets of random SNPs matched with the index cis- and trans-eSNPs, in terms of allele frequency, gene density, distance from TSS, and density of tagSNPs arising from genomic variability of linkage disequlibrium. Z scores were estimated as:

Where observed is the number of index eSNPs that lie within enhancers, and mean_null and SD_null are the mean and standard deviation of the null distribution of overlap, as estimated using the sets of permuted SNPs.

Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol