Model evaluation is based on QTG-Finder (Lin et al. 2019) and QTG-Finder2 (Lin et al. 2020). Given the low number of known eQTGs, we use known QTGs and Arabidopsis orthologs of QTGs found in other species as positives and other genes as negatives, similar to QTG-Finder2. We use hyperparameter tuning to determine the best parameter combination (the number of trees, minimal samples split, and maximum number of features) using grid search and assess the area under the curve (AUC) of the receiver-operating characteristic (ROC) curve in an extended version of the 5-fold cross-validation framework. In this framework, the positives are randomly re-split into a training and validation set in a 4:1 ratio iteratively. Next, each set is combined with randomly selected negatives. The ratio of positives and negatives is an optimized hyperparameter. This splitting of positives is done 50 times, and for each positive set random selection of the negatives was conducted 50 times. This extensive procedure (2,500 evaluations) makes that positive cooccurs with all negative at least once with high probability. All machine-learning model training and testing in this study is performed using Python’s scikit-learn library version 1.0.2.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.