RNA-seq and differential expression power analysis

BF Benjamin Jung Fair
LB Lauren E Blake
AS Abhishek Sarkar
BP Bryan J Pavlovic
CC Claudia Cuevas
YG Yoav Gilad
request Request a Protocol
ask Ask a question
Favorite

39 RNA-seq fastq files for human left ventricle were chosen at random from GTEx v7 (Figure 1—source data 1, see Acknowledgements). Additionally, the 10 human and 18 chimpanzee RNA-seq libraries previously generated (Pavlovic et al., 2018), and all the novel chimpanzee RNA-seq libraries generated in this study are described in Figure 1—source data 1. Fastq files for samples that were sequenced on multiple lanes were combined after confirmation that all gene expression profiles for all fastq files cluster primarily by sample and not by sequencing lane. GTEx-derived fastq files were trimmed to 75 bp single-end reads to match the non-GTEx sequencing data. Reads were aligned to the appropriate annotated genome (GRCh38.p13 or Pan_tro_3.0 from Ensembl release 95; Zerbino et al., 2018) using STAR aligner (Dobin et al., 2013) default parameters. Only chromosomal contigs were considered for read alignment throughout this work. That is, unplaced contigs were excluded from the reference genome. Gene counts were obtained with subread featureCounts (Liao et al., 2014) using a previously described annotation file of human-chimpanzee orthologous exons (Pavlovic et al., 2018). The gene expression matrix was converted to CountsPerMillion (CPM) with edgeR (Robinson et al., 2010) to normalize reads to library size. The mean-variance trend was estimated using limma-voom (Ritchie et al., 2015). Genes with less than 6 CPM in all samples were excluded from further analysis. This cutoff was chosen based on visual inspection of the voom mean-variance trend to identify where the trend becomes unstable. To normalize differences in orthologous exonic gene size between human and chimpanzee, we converted the log(CPM) matrix to log(RPKM) based on the species-specific length of orthologous exonic regions. We then visually inspected PCA plots and hierarchical clustering to identify potential outliers and batch effects (Figure 1). We note that although all the chimpanzee RNA isolation and sequencing were prepared separately from GTEx heart samples, PCA and clustering analysis suggest that the inter-species differences vastly outweigh the technical batch effects (based on the inter-species but within-batch samples sourced from Pavlovic et al., 2018). The dataset of 49 human samples and 39 chimpanzee samples was culled to 39 human and 39 chimpanzee samples based on exclusion of 5 obvious human outliers which did not cluster with the rest and had among the lowest read depths (Figure 1A–B). The remaining human samples to exclude to reach a balanced set of 39 human and 39 chimpanzee samples were chosen by excluding the remaining GTEx samples with the lowest mapped read depths. Differential expression was tested using limma (Ritchie et al., 2015), using the eBayes function with default parameters and applying Benjamini-Hochberg FDR estimation. For Figure 1—figure supplement 2, this process was repeated at varying sample size (sampling with replacement) and at various read depths. Sequencing depth subsampling analysis was performed at the level of bam files to obtain matched numbers of mapped reads across samples and differential expression analyses was repeated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A