Once variant annotation information is gathered, all sequence effects that are caused by the variant are then calculated and the features are extracted. Below is a high-level description of the extracted features. The equations used for computing each of the 20 features used in the TraP model are also depicted (Supplementary Methods).
Splice site changes: any change to the splice site motif is calculated using a Position Specific Scoring Matrix (PSSM) based on all human exons. Splice site strength before the substitution is calculated for both the 3′ splice site (3′ss) and 5′ splice site (5′ss) of the harboring exon (or nearest exon in case variant is in an intron). Next, if the variant is within the splice site region, the splice site is scored again after the substitution. 3’ss is regarded as 20 nt upstream to the exon-intron junction and the first 3 nt of the exon. The 5′ss is regarded as the last 3 nt of the exon and the first 6 nt of the downstream intron.
Cryptic splice site creation/disruption: if the variant creates or disrupts a canonical sequence (AG/GT), the flanking sequence will be calculated for its similarity to a splice site motif. The cryptic splice site PSSM score will be calculated both with and without the variant.
Interactions between splice sites: differences between existing and new splice sites are also calculated. These differences are later used as splicing effect weight factors to calculate more complex features such as the Splice Site Overall Score and the Variant Splice Score (F11 and F20 in Supplementary Methods and Supplementary Data 1). This follows the logic that an exon with a weak splice site will be highly affected by a variant creating a strong splice site, while a strong existing splice site will have no such effect.
Splicing regulatory binding sites: TraP construction pipeline loads four datasets of major splicing regulatory proteins: SRSF143, SRSF244, SRSF543 and SRSF643, and one set of splicing silencer sequences calculated in silico45. The variant is then tested for disruptions or creations of binding site sequences for any and all of the above regulatory sets.
CpG effects: DNA methylation changes can occur if the variant creates or disrupts a CpG di-nucleotide. Recent studies show that DNA-methylation affects the processes of transcription by changing the rate of RNA-polymerase II and also affect exon recognition, thus might contribute to the damaging effect of a variant46. This feature was eventually not incorporated into the final model since it did not add to the model’s ability to distinguish pathogenic variants.
Overall, 14 general properties of the variant (such as coordinate, gene name, etc.) and 32 features are either collected in the information acquisition process or calculated by the feature extraction pipeline for each variant, of which 20 features are used in the TraP model (Supplementary Data 1).
Each of the 20 selected features’ independent ability to differentiate between TraP-predicted pathogenic and benign variants in the ExAC 1.46 M synonymous variants dataset was also examined using frequency distributions (Supplementary Figs. 5–24). This was done separately for TraP-predicted pathogenic variants (TraP ≥ 0.459) and TraP-predicted benign variants (TraP < 0.459). As the values of the TraP features are not always normally distributed in the ExAC dataset, we used a non-parametric Mann–Whitney U-test to test the null hypothesis that the distribution of the values of each feature for TraP-predicted pathogenic variants is equal to the distribution for TraP-predicted benign variants (Supplementary Figs. 5–24, bottom line). 18 of the 20 TraP features have a significant difference (i.e., ability to discriminate) between high and low TraP variants (P-value < 2.2 × 10−16). Of interest, the two remaining features are related to splicing regulatory functions: the ‘combined ESR Score’ and ‘Negated ESR Score’ features achieve a P-value of 0.06 and 0.09, respectively. This suggests that contribution to TraP originating from cis-acting elements of regulatory proteins are not as straightforward as that of the other features. We also provide a Spearman correlation matrix for the 20 features and the TraP score itself, using the same ExAC 1.46 M variants dataset, to help highlight the independent information that each feature provides to the TraP model (Supplementary Fig. 25).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.