RESA-jLR

TZ Tianyun Zhang
HJ Hanying Jia
TS Tairan Song
LL Lin Lv
DG Doga C. Gulhan
HW Haishuai Wang
WG Wei Guo
RX Ruibin Xi
HG Hongshan Guo
NS Ning Shen
request Request a Protocol
ask Ask a question
Favorite

RESA employs a process of combining all putative positive and negative sets of somatic SNVs. The input variants to feed RESA-jLR combined all variants in positive and negative sets. Then the input dataset was split into training and test sets with the 3:1 ratio, with 3 quarters of the data used for training the model, and a quarter of the data used as an independent test set to evaluate the model performance. Because the sample size for each class on the training set was often imbalanced, thus, random oversampling was applied to replicate observations in minority classes, thereby rebalancing the dataset. RESA with the joint logistic regression model was the joint composition of two logistic regression models. One model depended on quality-based features such as variant quality, read depth, variant allele fraction, normalized probabilities of genotype, and allele depth. The other model derived its features from sequence-based attributes like mutation types, sequence contexts, and mutation signature components. Quality-based features were generally weighted similarly across datasets, while mutation sequence composition can be more sample-specific; hence, they were modeled differently. We trained the joint logistic regression model using the liblinear library, with L1 regularization applied to the quality-based model and L2 regularization applied to the sequence-based model using one-hot encoding. Each logistic regression model returned probability values for positive and negative classifications, with users being able to specify their thresholds based on these probabilities. Then we combined these two models into an integrated classifier with the following equation:

Where w=1,ifP0.50,otherwise, Pseq and Ppos were probabilities to be the positive class of the two regression models. RESA-jLR sets 0.5 as the default threshold, which meant RESA-jLR defined the SNVs as positive if Pposclassifier0.5. We also included probability as a parameter so that users could modify the thresholds.

After training, we assess the model's performance on the test set using the AUC score. As opposed to accuracy, the AUC score is appropriate for imbalanced datasets. Additionally, we applied the model to the candidate SNVs in the unsure set to refine and extrapolate the putative somatic variant set. This step recovered some true somatic variants filtered out in stringent criteria described earlier and enhanced sensitivity while maintaining high precision.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A