The first step in variant protein identification was the generation of protein sequence databases containing the modified amino-acid sequences (Additional file 1: Figure S1a). Briefly, protein-level outputs from variant effect predictor [39] were parsed to proteins containing single amino-acid variants, insertions, deletions, frameshifts, stop-loss mutations, and fusions. Variant peptides were filtered against a canonical human proteome from UniProt (20,187 non-redundant proteins) to remove peptides that also mapped to this reference database. Variant sequences longer than six amino acids and containing up to two missed tryptic cleavages on either side of the mutated site were produced and added to the FASTA file.
We explored variant-peptide detection with regards to proteogenomic database size and content. Variant proteins were obtained from five different sources: dbSNP [20]; COSMIC [21]; UniProt [22]; exome-seq [36]; and RNA-seq [37]. Augmented search databases were created in 23 different ways derived from combinations and subsets of these databases (Additional file 1: Figure S1b; Additional file 2). We defined community-based databases to include dbSNP, COSMIC, and variants annotated in UniProt. Four sub-databases of COSMIC and dbSNP were made to include single nucleotide variants, indels, variants affecting genes in the COSMIC cancer gene census and frameshifts, or stop losses or fusions. For sample-specific database searches, all 59 NCI60 cell-lines containing exome-seq data and 41 cell-lines containing RNA-seq data were used. Three further databases restricted to subsets of variants were generated for a total of four sample-specific databases per cell-line and per analyte type. We combined sample-specific and community-based databases in two different ways: we used a sample specific approach and a general approach where all RNA-seq and exome-sequencing (exome-seq) datasets were merged. In total, the RNA-seq cell-line data characterized 675 cell-lines, which were also included separately in their own database, as was all the exome-seq data. A total of 473 different database combinations (Additional file 3; Additional file 1: Figure S1b) were explored across all available cancer cell-lines.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.