The HTSeq (version 0.6.p1)26 pipeline is used for calculating the number of reads that align to different genes in the genome. As mentioned previously, only reads that can be uniquely assigned to a gene are counted. The GDC ran HTSeq on all samples as unstranded libraries in order to maintain consistency for cross-sample comparisons.
The raw counts are normalized into Fragment per kilobase million mapped reads (FPKM) and Upper Quartile Normalized FPKM (FPKM-UQ) using all protein-coding genes as the denominator.
where,
RCg: Number of reads mapped to the gene
RCpc: Number of reads mapped to all protein-coding genes
RCg75: The 75th percentile read count value for genes in the sample
L: Length of the gene in base pairs
The GDC has also generated exon-level quantification using the DEXSeq27,46 pipeline. The first step in this pipeline is to create the flattened General Feature Format (GFF) file, which essentially collapses the information for multiple transcripts spanning the same exon into exon counting bins for that exon. Once the flattened GFF file is obtained, the number of reads that overlap with each exon counting bin are calculated. The result is a flat file which has raw counts for each exon. This data type is not currently available in the GDC data portal and will appear in a later data release.
The GDC miRNA harmonization pipeline begins with a realignment of TCGA and TARGET miRNA-Seq reads using a similar strategy of the GDC DNA-Seq alignment pipeline. Because reads of miRNA-Seq are typically short, only BWA Aln was used.
miRNA quantification is done with a modified version of the miRNA Profiling Pipeline v0.2.732 from BCGSC (British Columbia Genome Sequencing Center). In this pipeline, miRNA species and miRNA isoforms are counted differently, and normalized Reads Per Million values are also derived. The final results from each miRNA-Seq sample is a miRNA species quantification file and a miRNA isoform quantification file, in a human-readable format compatible to the original TCGA data.
The hg19-based probeset metadata were obtained from the Affymetrix website, and then lifted over to GRCh38. Probes with reference bases not matching between hg19 and GRCh38 were removed.
To generate Copy Number Segment file, all SNP and CNV probes are used for CBS calculation, with the only exception that probes in the Pseudo-Autosomal (PAR) regions were removed in males prior to calculation. To generate the Masked Copy Number Segment file from this result, all probesets in chromosome Y and in the frequent copy number variant regions in germlines obtained from GenePattern were also removed prior to calculation.
Using probe sequence information provided in the manufacturer’s manifest, HM27 and HM450 probes were remapped to the GRCh38 reference genome47. Type II probes with a mapping quality of <10, or Type I probes for which the methylated and unmethylated probes map to different locations in the genome, and/or had a mapping quality of <10, had an entry of ‘*’ for the ‘chr’ field, and ‘−1’ for coordinates47. These coordinates were then used to identify the associated transcripts from GENCODE v22, the associated CpG island, and the CpG sites’ distance from each of these features. Multiple transcripts overlapping the target CpG were separated with semicolons. Beta values were inherited from existing TCGA Level 3 DNA methylation data (hg19-based) based on Probe IDs.
The same genetic variant can be represented in VCF format in multiple different ways48, and many of these discrepancies can not be easily solved by existing normalization tools. In order to reduce false-positive annotations, GDC requires a strict matching of Chromosome, Position, and Alternative Alleles during implementation of MAF annotations. However, in various variant comparisons in this paper, we applied a loose matching strategy to regard two variants the same if they have overlapping regions between starting and ending positions. This is particularly useful when a non-INDEL caller, such as SomaticSniper or MuSE, represents INDEL sites as point mutations.
mRNA expression count, miRNA expression count, Copy Number Segmentation, and Methylation Beta Values were collected from the GDC Data Portal. For mRNA expression and miRNA expression, we removed low-expressed genes and miRNAs if 99% or more samples have less than or equal to 1 count. For Copy Number Segments, we computed average segmentation means on each gene weighted on overlapped length between segments and genes. For methylation data, we removed probes that are empty in more than 5% of the samples, and imputed the remaining empty values with the probeset mean. The methylation data is also randomly down-sampled by 25% of the probes to reduce computational burden.
To integrate these data together for t-SNE clustering, we first computed the top 200 Principal Components (PCs) from each of the four data types using the R function prcomp. The top 200, ranked by “variation explained”, from these 800 combined PCs were selected and further scaled to the arbitrary weights of 3:3:1:1 for mRNA:Methylation:miRNA:CNV, and finally used as input for the t-SNE analysis with the R package Rtsne. We gave less weights to miRNA and copy number because they don’t separate cancer types as well compared to the other two data types. We found that the 3:3:1:1 ratio gives the best visual performance in the scatter plot compared to the other combinations tested. We ran t-SNE 1000 times with random seeds and displayed the result that minimizes the cost function49.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.