Other driver prediction methods

OC Olivier Collier
VS Véronique Stoven
JV Jean-Philippe Vert
request Request a Protocol
ask Ask a question
Favorite

We compare the performance of LOTUS to five other state-of-the-art methods: MutSigCV [21], which is a frequency-based method, TUSON [26] and 20/20+ [28] that combine frequency and functional information, MUFFINN [32] and DiffMut that analyses mutation profiles on genes.

MutSigCV searches driver genes among significantly mutated genes which adjusts for known covariates of mutation rates. The method estimates a background mutation rate for each gene and patient, based on the observed silent mutations in the gene and noncoding mutations in the surrounding regions. Incorporating mutational heterogeneity, MutSigCV eliminates implausible driver genes that are often predicted by simpler frequency-based models. For each gene, the mutational signal from the observed non-silent counts are compared to the mutational background. The output of the method is an ordered list of all considered genes as a function of a p-value that estimates how likely this gene is to be a driver gene.

TUSON uses gene features that encode frequency mutations and functional impact mutations. The underlying idea is that the proportion of mutation types observed in a given gene can be used to predict the likelihood of this gene to be a cancer driver. After having identified the most predicting parameters for OGs and TSGs based on a train set (called the TUSON train set in the present paper), TUSON uses a statistical model in which a p-value is derived for each gene that characterizes its potential as being an OG or a TSG, then scores all genes in the COSMIC database, to obtain two ranked lists of genes in increasing orders of p-values for OGs and TSGs.

The 20/20+ method encodes genes based on frequency and mutation types, and other biological information. It uses a train set of OGs and TSGs (called the 20/20 train set in the present paper) to train a random forest algorithm. Then, the random forest is used on the COSMIC database and the output of the method is again a list of genes ranked according to their predicted score to be a driver gene [28]. We did not implement this method, so we decided to evaluate its performance only on its original training set: the 20/20 dataset. Moreover, we applied the same method to compute the CE as for MutSigCV and TUSON, which should actually give an advantage to 20/20+, since it is harder to make predictions in a cross-validation loop using a smaller set of known driver genes.

DiffMut uses a dataset of somatic mutations and a dataset of healthy genomes, but no training sets of known driver genes. It compares the mutation profiles on a gene in the mutation dataset with the nucleotide variation profile in the healthy genomes, and computes for every gene a score that allows to rank all genes according to their potential as OG or TSG.

MUFFINN uses a dataset of somatic mutations and extracts the number of non-synonymous mutations per gene. Then, it computes a score (either DNmax or DNsum) and propagates these scores on a functional gene network (either HumanNet or STRING). The final scores are used to compute four different rankings for all genes. Among these four possibilities, we systematically used the version that yields the best result for MUFFINN,. Note however that, in practice, the user would not know which version should be preferred.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A