We implemented machine learning models for predicting thrombin inhibitory activity, employing various classification algorithms. To evaluate model performance, we randomly split the data into three sets: 60% for training, 20% for validation, and 20% for testing, which ensures an adequate presence of positive samples in all datasets. The training set consisted of 528 samples, the validation set of 176 samples, and the testing set of 176 samples, which served as out-of-sample test data. We employed a support vector machine (SVM) with both linear and radial basis function kernels, logistic regression, random forest, k-nearest neighbors (kNN), and extreme gradient boosting (XGBoost) models for classification. The performance of these models was compared to determine the most suitable one for our final inference. We implemented these models using the widely used scikit-learn package [31], a popular machine learning library in Python.
First, all the baseline models were tuned by choosing the hyperparameters that lead to the best Matthew’s Correlation Coefficient (MCC) score on the validation set. We used the RandomizedSearchCV package for hyperparameter tuning and performed 5-fold cross-validation across the joint training and validation data sets. The SVM models were tuned for hyperparameters C and 𝛾; the random forest and XGBoost models for the number of estimators, maximum depth, minimum sample leaf node, and minimum sample split; the logistic regression model was tuned for C; and the kNN model for the number of nearest neighbors. The imbalance in the dataset was accounted for by setting the ‘class_weight’ parameter of the classifiers to ‘balanced’ which adjusts the weights according to class frequencies in the dataset. The final models were tested on labeled out-of-sample test sets and their MCC performance was compared with the average MCC score obtained during cross-validation to ensure that the models did not have high generalization errors. Based on these criteria, the best-performing model was chosen and was used to predict thrombin inhibitory activity in peptides collected from protein databases. The performance of the classification models was also estimated using Accuracy and F1 score (harmonic mean of precision and recall). The three effectiveness measures are defined as:
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.