Raw sequencing data were received in FASTQ format. Read mapping was performed using Tophat 2.0.6 [55] and the human genome references assembly GRCh37 (http://feb2012.archive.ensembl.org/). The resulting SAM alignment files were processed using the HTSeq Python framework and the respective GTF gene annotation, obtained from the Ensembl database [56]. Gene counts were further processed using the R programming language [57] and normalized to Reads Per Kilobase of transcript per Million mapped reads (RPKM) values. In order to examine the variance and the relationship of global gene expression across the samples, different correlation values have been computed including Spearman’s correlation of gene counts and Pearson’s correlation of log2 RPKM values. The resulting correlation values were visualized using multi-dimensional scaling plots (MDS) and heatmaps (S2 Fig).
Subsequently, the Bioconductor packages DESeq [58] and edgeR [59] were used to identify differentially expressed genes (DEG). Both packages provide statistics for determination of differential expression in digital gene expression data using a model based on the negative binomial distribution. The non-normalized gene counts have been used here, since both packages include internal normalization procedures. The resulting p-values were adjusted using the Benjamini and Hochberg’s approach for controlling the false discovery rate (FDR) [60]. Genes with an adjusted p-value < 0.05 found by both packages were assigned as differentially expressed.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.