Tissue Specific Gene Expression Imputation

CM Carlo Maj
TA Tiago Azevedo
VG Valentina Giansanti
OB Oleg Borisov
GD Giovanna Maria Dimitri
SS Simeon Spasov
PL Pietro Lió
IM Ivan Merelli
request Request a Protocol
ask Ask a question
Favorite

Data used for the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). ADNI was launched in 2003 as a public-private partnership led by Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), other biological markers, clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). In the present work, we analyzed the ADNI1-GWAS dataset including gene array genotyping data for 808 samples available on ADNI portal.

Rigorous quality control has been performed. Namely, samples have been checked for sex, missing genotype rates lower than 0.05 and heterozygosity levels F < 0.2, while variants with Hardy–Weinberg p-value < 1e – 10 have been removed. Then, using the tool by W. Rayner 5 we checked SNPs for strand consistency, allele names, position, Ref/Alt assignments and minor allele frequency (MAF) in comparison to the reference panel. In order to increase the available genetic information, we imputed our data using Sanger Imputation Server 6 exploiting Eagle2 for phasing (Loh et al., 2016) and Positional Burrows–Wheeler Transform (Durbin, 2014), considering Haplotype Reference Consortium version 1.1 (McCarthy et al., 2016) as reference panel. As a postimputation quality control, we removed variants with info quality level < 0.6. Genotype calls with posterior probability < 0.9 were set to missing. Post-QC imputed data was used to estimate gene expression regulation across the different samples.

In order to predict the genetic component of gene expression, we used PrediXcan that evaluates the aggregate effects of cis-regulatory variants (within 1MB upstream or downstream of genes of interest) on gene expression via an elastic net regression method (Gamazon et al., 2015). PediXcan needs a reference dataset in which both genome variation and gene expression levels have been measured to build prediction models for gene expression. We exploited already available models trained on GTEX data 7 to impute tissues specific transcriptomic profiles in a total of 42 tissues (we excluded sex specific tissues, e.g., prostate, ovary, etc.). The imputed transcriptomic profiles were subsequently analyzed using different machine learning approaches ( Figure 1 ). On the one hand, unsupervised machine learning methods were used to analyze data structure, on the other hand, supervised methods were used to test for the presence of “signal” compared to AD related phenotypes.

Framework of integrative analysis of multi-tissues expression profiles. Starting from genotyping data (m individuals per n variants) we imputed tissues specific transcriptomic profiles (for any tissue T i, where i = 1‚…‚ k) by means of cis-eQTL PrediXcan models trained on GTEx data. Variational autoencoder followed by support vector machine (SVM) latent dimension-tissue match on the imputed gene expression matrices (m individuals per z genes) is used as a feature selection to identify the most relevant genes per tissue (T i = gene 1‚…‚ gene s where i is the i th tissue and s in the number of prioritized genes) to provide as input of the recurrent neural network classifier.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A