ML task 2; predicting the values of randomly missing DPs

HB Habib Bashour
ES Eva Smorodina
MP Matteo Pariset
JZ Jahn Zhong
RA Rahmad Akbar
MC Maria Chernigovskaya
KQ Khang Lê Quý
IS Igor Snapkow
PR Puneet Rawat
KK Konrad Krawczyk
GS Geir Kjetil Sandve
JG Jose Gutierrez-Marcos
DG Daniel Nakhaee-Zadeh Gutierrez
JA Jan Terje Andersen
VG Victor Greiff
ask Ask a question
Favorite

For this task (Fig. 6a, c and Supplementary Fig. 17B), we randomly deleted (either 2% or 4% of) DP values from subsamples of the human VH antibody sequences (683,534 sequences defined as training set in ML Task 1). We then predicted the deleted (missing) data using the multivariate imputation by chained random forests (MICRF) algorithm79 via the missRanger R package172. We repeated this step 20 times for each subsample size (50, 100, 500, 1000, 10000, 20000 antibodies) and reported the mean R2.

For both ML tasks—and both embedding types implemented in ML Task 1—we performed ablation studies by randomly permuting the column values in the input datasets for the ML models, and confirmed that the prediction accuracy was abolished (R2 ≤ 0).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A