RESA-jLR

Tianyun Zhang; Hanying Jia; Tairan Song; Lin Lv; Doga C. Gulhan; Haishuai Wang; Wei Guo; Ruibin Xi; Hongshan Guo; Ning Shen

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

RESA-jLR

TZ Tianyun Zhang

HJ Hanying Jia

TS Tairan Song

LL Lin Lv

DG Doga C. Gulhan

HW Haishuai Wang

WG Wei Guo

RX Ruibin Xi

HG Hongshan Guo

NS Ning Shen

This method is extracted from research article: Genome Med, Dec 2023

De novo identification of expressed cancer somatic mutations from single-cell RNA sequencing data

DOI: 10.1186/s13073-023-01269-1

Request a Protocol

Ask a question

Favorite

RESA employs a process of combining all putative positive and negative sets of somatic SNVs. The input variants to feed RESA-jLR combined all variants in positive and negative sets. Then the input dataset was split into training and test sets with the 3:1 ratio, with 3 quarters of the data used for training the model, and a quarter of the data used as an independent test set to evaluate the model performance. Because the sample size for each class on the training set was often imbalanced, thus, random oversampling was applied to replicate observations in minority classes, thereby rebalancing the dataset. RESA with the joint logistic regression model was the joint composition of two logistic regression models. One model depended on quality-based features such as variant quality, read depth, variant allele fraction, normalized probabilities of genotype, and allele depth. The other model derived its features from sequence-based attributes like mutation types, sequence contexts, and mutation signature components. Quality-based features were generally weighted similarly across datasets, while mutation sequence composition can be more sample-specific; hence, they were modeled differently. We trained the joint logistic regression model using the liblinear library, with L1 regularization applied to the quality-based model and L2 regularization applied to the sequence-based model using one-hot encoding. Each logistic regression model returned probability values for positive and negative classifications, with users being able to specify their thresholds based on these probabilities. Then we combined these two models into an integrated classifier with the following equation:

Where $w = \{\begin{matrix} 1, i f P \geq 0.5 \\ 0, o t h e r w i s e \end{matrix})$ , $P_{seq}$ and $P_{pos}$ were probabilities to be the positive class of the two regression models. RESA-jLR sets 0.5 as the default threshold, which meant RESA-jLR defined the SNVs as positive if $P ({pos}_{classifier}) \geq 0.5$ . We also included probability as a parameter so that users could modify the thresholds.

After training, we assess the model's performance on the test set using the AUC score. As opposed to accuracy, the AUC score is appropriate for imbalanced datasets. Additionally, we applied the model to the candidate SNVs in the unsure set to refine and extrapolate the putative somatic variant set. This step recovered some true somatic variants filtered out in stringent criteria described earlier and enhanced sensitivity while maintaining high precision.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol