For our analyses, we used somatic mutations identified with MuTect [36] from Mutation Annotation Format (MAF) files for 18 cancer types (see Additional file 1: Table S1) in TCGA, retrieved using gdc-scan (v1.0.0) [37]. The MAF files were then converted to tumor-normal pair variant call format (VCF) files using the maf2vcf tool in the vcf2maf software package [38], with the GRCh38/hg38 genome build available from the Broad Institute resource bundle [39] used as the reference genome. Because these VCFs still contained data for both tumor and normal samples, they were then manipulated to remove data from the paired normal sample, leaving final, tumor-only VCF files for compatibility with pVAC-Seq, which accepts only single-sample VCFs. Each disease type consisted of a variable number of patients (see Additional file 1: Figure S1).
We then annotated these VCF files using Variant Effect Predictor (VEP, v88) [40]. VEP was run according to pVAC-Seq’s recommendations [41] with the Downstream and Wildtype VEP plugins [42] used, gene symbols added to output where available, and mutation consequence terms based on Sequence Ontology annotation guidelines [43]. We also used VEP’s GRCh38 annotation cache (rather than querying remotely) for efficiency.
As the TCGA data we obtained did not allow us to calculate patient-specific HLA types, we assumed each tumor could occur in the setting of any HLA allele type, allowing us to explore neoepitope distributions among a broader theoretical population. To do this, we generated a list of HLA alleles to use for subsequent analysis based on allele frequencies originating from the Allele Frequency Net Database [44] and summarized for use in the software POLYSOLVER (v1.0) [45]. The average frequencies across races (Asian, Black, and Caucasian) of alleles for each HLA gene (HLA-A, HLA-B, and HLA-C) were calculated and normalized to sum to 100%. We then selected the top 145 average-frequency HLA alleles for subsequent analysis (see Additional file 1: Tables S2 – S4), encompassing all HLA alleles among 99% of individuals in the general population.
pVAC-Seq (v4.0.8) was run for each patient and allele combination using 9mer epitopes generated from 17-mer peptides surrounding each missense mutation, and using MHC binding predictions generated by NetMHCpan (v2.8) [46]. For each resulting neoepitope from pVAC-Seq, additional metrics were applied as described below. Note that for the purposes of this study, only epitopes resulting from missense mutations were considered for further analysis, and all peptides were considered to be expressed at equal levels. Note also that neoepitopes from breast cancer (BRCA), cervical cancer (CESC) and melanoma (SKCM) were not assessed for protein sequence similarity against peptides other than their paired normal epitopes.
To assess the degree to which HLA alleles might have overlapping preference for putatively novel binding neoepitopes predicted for mutations across TCGA (see Neoepitope Prioritization Metrics), 1000 random sets of six of the previously described set of HLA alleles (two HLA-A, two HLA-B, and two HLA-C alleles) were chosen using the random.sample function (without replacement) from the random module in Python 2.7.13 [47] (the combinations tested are available in Additional file 2: Table S5). All unique amino acid sequences of neoepitopes that bound to one or more alleles within each random allele set were counted; separate counts were kept for neoepitopes that bound to one, two, three, four, five, or six of the six alleles (i.e. increasing levels of overlap). The script for randomly sampling allele sets and determining overlap is available in our GitHub repository [48].
For comparison, we assessed recurrence rates among 2,813,809 simulated neoepitopes (9mers) mirroring the size of the TCGA data set. These neoepitopes were drawn randomly from the GRCh38 peptidome, with subsequent introduction of a random single amino acid substitution at a random position along each 9mer. These simulated peptides were labeled by patient and disease site to produce a random set of peptides for each patient equivalent in size to that patient’s predicted neoepitope repertoire. We repeated this process again for a smaller set of 1000 simulated neoepitopes to assess trends in peptide similarity scores. The gene of origin of the random peptide and the gene corresponding to its closest peptide match in the human proteome were retained for protein sequence similarity analysis (see “Tumor vs. closest human peptide sequence similarity” below).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
 Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.