Two simulated control sequence sets, the shuffle control and genomic control, were used to estimate the expected frequency of affinity changing substitutions. The relative size of the simulated control sets depended on the total number of mutation calls in a particular cancer type. Lower relative size of the control data sets was used for the larger mutation sets (see Additional file 1: Table S1). Higher relative sizes of the control data sets was used for the cancers with lower numbers of mutation calls to provide stable estimates of expected frequencies.
The shuffle control set was obtained by shuffling the flanking sequences within [−50;+50] bp around the mutated base keeping the mutation context, the immediate 5' and 3' nucleotides, and the substitution itself, intact. Multiple shuffles were gathered for each mutation (Additional file 1: Table S1). This was the only step where the window length was explicitly used.
The windows for the genomic control were sampled from intronic and promoter regions in a way that they did not overlap the cancer mutation-centered windows. Each segment of [−50;+50] bp had the central base and its neighboring 5' and 3' nucleotides identical to the mutation context of a given somatic mutation locus, the respective nucleotide alternative was added. For each somatic mutation several genomic control windows were extracted, the number depended on the total number of mutations for a particular cancer type (Additional file 1: Table S1).
Both the shuffle and genomic controls were used to predict transcription factor binding sites in the same way as for the cancer data. For each binding motif the windows with binding sites predictions for the germline alleles were used to evaluate statistical significance of the affinity loss. Likewise, the windows with binding sites predicted for the simulated mutated alleles were used to evaluate statistical significance of the affinity gain. The windows with predictions for both alleles participated in both types of analysis (Fig. 1), and the windows without predictions were discarded.
Since binding sites predictions depended on the nucleotide composition and, consequently, on the mutation contexts (the 5’ and 3’ nucleotides proximal to the mutated base), we equalized the mutation contexts distributions of the test and control data for each particular cancer type before the statistical evaluation. To achieve this, we sampled the windows with binding sites predictions in control data (both shuffle and genomic) to match a given mutation context distribution of a particular cancer for each binding motif separately.
In a limited number of cases there were not enough control data to completely equalize the contexts distribution (see Additional file 2: Table S2). Yet, even for cancer types with low numbers of mutation calls, where the relative required size of the control data sets was extremely large, no less than 95 % of predictions with matching contexts were successfully sampled from the control data. Importantly, for cancer types with abundant mutation calls context equalization was almost perfect (99.9-100 % match of the contexts distributions with the non-perfect match only for exceptional motifs, see Additional file 2: Table S2), since a lower relative size of the control data set was generally required (see Additional file 1: Table S1). During significance evaluation (see below) the “missing” control predictions were considered as if they made the contingency tables more uniform (i.e. reducing the difference and its possible statistical significance).
Thus, for each binding motif we obtained the final sets of mutation-centered windows with binding sites overlapping with or located in the close vicinity of mutations for test and control data with the equalized mutation contexts distribution. This eliminated possible bias from the non-randomness of mutational signatures and made possible a comparison of the binding sites alteration frequencies in cancer versus control data.
The events of mutation-induced motif changes were counted for each cancer type and the control data sets (shuffle and genomic) using the same procedure. For each binding motif the Fisher's exact test was computed using 2×2 contingency tables (substantial affinity loss or gain versus non-substantial affinity change/no change, cancer mutations versus the control data, separately for shuffle and genomic control, refer to Fig. 1 for a scheme).
Only cases that passed 0.05 FDR-corrected (for 278 tested binding motifs) Fisher’s exact test P-value in both comparisons (versus the shuffle and versus genomic controls) were considered significant for a particular cancer type.
For selected motifs we also assessed localization of mutations relative to the binding motif predictions (see the specific section in Results, the workflow is shown in Fig. 4).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.