Dataset processing

MJ Mengnan Jiang
ZY Zilan Yu
XL Xun Lan
ask Ask a question
Favorite

To train the model, we collected experimentally verified CDR3β-peptide pairs as positive samples from IEDB and McPAS. The data from VDJdb were utilized as a testing dataset to validate the model. All negative samples were generated by mismatching the peptide in each positive sample with a randomly selected CDR3β sequence from a healthy donor in TCRdb.33 Furthermore, we focused on the CDR3β sequences with 10-20 AAs, which started with ‘C’ AAs and ended with ‘F’ or ‘W’ AAs, and the peptides with 8-12 AAs that were presented by human MHC class I molecules. Therefore, we obtained an HLA-A∗02:01-restricted training set with 38,712 data points before cluster-based filtering and 28,584 data points after cluster-based filtering using iSMART according to our experimental needs. The number of data points in the testing set was 5,250. Additionally, TCR sequences of CD8+ T cells acquired from healthy human donors (18,331 and 8,337 cells from Donor 1 and Donor 2, respectively) were used to measure the generalization of VitTCR by calculating correlations between the predicted binding probabilities and clonal fractions. To demonstrate the reliability of VitTCR, a total of 165 experimentally validated CDR3β-peptide pairs from COVID-19 recoverees were also used to display a positive correlation between the predicted binding probabilities and the activation percentages of T cells.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A