In the case of genes with multiple mapped probes, the average over all mapped probes was used. We also repeated the inter-dataset analysis using max variance probe instead of the average for multi-probe aggregation, and we observed that the results did not change appreciably. For classification, all samples were standardized to have mean 0 and standard deviation 1 across all genes. This standardization only impacts the performance of the classification methods that are based on absolute expression, as GRAPE and DIRAC are invariant to any monotonic normalization. In the multi-dataset analyses, transcription profiles were considered to be only the genes that occur in every dataset within the analysis. Standardization was performed over the common set of genes, rather than all of the genes for each dataset.
All TCGA gene expression data were IlluminaHiSeq_RNASeqV2. All TCGA data were downloaded using the R package “TCGA2STAT”. Two samples were discarded from analysis due to suspicion of being outliers (see Additional file 1: Topic S2).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.