NPH sample RNA sequencing and data preprocessing

WH Wenrui Huang
AB Anne Marie Bartosch
HX Harrison Xiao
SM Suvrajit Maji
EY Elliot H. H. Youth
XF Xena Flowers
SL Sandra Leskinen
ZT Zeljko Tomljanovic
GI Gail Iodice
DB Deborah Boyett
ES Eleonora Spinazzi
VM Vilas Menon
RM Robert A. McGovern
GM Guy M. McKhann
AT Andrew F. Teich
request Request a Protocol
ask Ask a question
Favorite

RNA was extracted from biopsy samples using the miRNeasy Mini Kit (QIAGEN; Cat No./ID: 217004), which purifies total RNA including miRNA (note that our library prep protocol selects poly(A) coding mRNA). RNA integrity was measured on an Agilent Bioanalyzer, and samples with RNA integrity number (RIN) values ≥6 were selected for sequencing. RNAs were prepared for sequencing using the Illumina TruSeq mRNA Library Prep Kit, and samples were sequenced with Illumina HiSeq 2000, 2500, and 4000 (potential batch effect attributable to different sequencers was regressed out along with other confounding variables using surrogate variable analysis; see below), and all samples underwent single-end sequencing to 30 M read depth. The quality of all fastq files was confirmed with FastQC v 0.11.864. Fastq files that passed quality check were further mapped to Genome Reference Consortium Human Build 37 (GRCh37) reference genome with STAR65. Output BAM files from STAR were further processed with featureCounts66 to obtain raw counts for each gene of all the sequenced samples.

Genes with <5 counts in at least 90% of all samples were first filtered out from the raw count matrix, followed by variance stabilizing transformation (VST) on filtered counts utilizing the varianceStabilizingTransformation function from DESeq2 R package67. VST is a widely applied strategy that transforms data from a distribution of fluctuating variance into a new distribution with nearly constant variance in order to facilitate downstream analysis68. After VST, surrogate variable analysis (SVA)23 was used to identify variation in gene expression not attributable to β-amyloid load or tau load. Specifically, a full model (with primary variables to be kept) and a null model (with intercept only) were built using the model.matrix() function, and then sva() function from sva r package was applied to determine surrogate variables (SVs) from the VST-processed gene expression matrix, in which sva function parameter dat was assigned with the gene expression matrix, mod as the full model, mod0 as the null model and method = irw. The identified SVs, which represent known and unknown confounding variables in our dataset, were later regressed out with removeBatchEffect function from limma R package69. The filtered, VST-processed, and surrogate variable regressed count matrix was used for all downstream analyses in this manuscript. As an additional exercise, we also attempted to regress out confounding variables individually. To do this, the NPH expression matrix was first quantile-normalized, followed by log2 transformation, and then the batch effect was removed through the ComBat R function from the sva package, and finally age, gender, and RIN were regressed out using the removeBatchEffect R function from the limma package. This method did not yield any significant genes at an FDR threshold of 0.1, confirming the relatively minimal transcriptomic changes in this tissue cohort. This also highlights the need to regress out both known and hidden confounders in our dataset in order to detect the relatively minimal changes in gene expression that we do find. Finally, we also performed differential gene expression analysis on the NPH count data as an additional exercise and used DESeq267 to do the following comparisons: (1) No AD pathology (no β-amyloid or tau, n = 32) vs. any AD pathology (either β-amyloid and/or tau, n = 74), (2) No β-amyloid pathology (n = 49) vs. any β-amyloid pathology (n = 57), and (3) No tau (n = 42) vs. any tau (n = 64). These analysis yielded 2, 19, and 4 genes passing FDR 0.1, respectively (Supplementary Data 1). These results further confirm the overall consistency of gene expression signatures in these biopsies and supports our view that this tissue is from patients with the earliest stages of AD pathology, before more extensive pathophysiologic changes have occurred.

For our analysis below, we combine RNA-seq data from all biopsies to achieve higher statistical power; this analysis is effectively investigating changes in gene expression shared by two areas of neocortex related to early AD pathology. This power is necessary to detect the relatively subtle changes in gene expression accompanying these early changes in AD pathology (see below). Analyzing frontal and parietal areas separately showed that both analyses trended in a way similar to the combined dataset (see Supplementary Data 1).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A