Boosted regression tree (BRT) models were constructed to predict the distribution of 21 dominant blood-sucking mite species at the county level. Briefly, a case-control study design was applied to build predictive models, where counties having at least one record of occurrence serve as “cases” and those surveyed but yielding no evidence of occurrence serve as “controls” [28]. The remaining counties where either no survey was conducted or surveys did not lead to conclusive findings were excluded from model-building but were included for risk mapping. To counterbalance the potential sampling bias of surveyed counties, we estimated the sampling probabilities of all counties by building a logistic regression with mite-survey history at the county level (1: yes, 0: no) as the response and ecoclimatic and socioenvironmental variables as predictors. Predictors were chosen using a backward procedure at the significance level of 0.05. The reciprocals of predicted sampling probabilities of all surveyed counties were used as weights (rescaled to have a mean 1) in the BRT models [29–31]. This weighting scheme creates a balanced pseudo-sample population when there is a sampling bias related to the outcome of interest [29], and it has been used in several ecological modeling studies [32–34]. In provincial-level administrative divisions (PLADs) where investigations of mites were scarce, e.g., in Guangxi, counties have been mostly assigned higher weights as they were under-sampled (Additional file 3: Fig. S2).
As eco-climatic predictors are often highly correlated with each other, we performed a clustering analysis on these predictors based on their pairwise correlation coefficients using the package “NbClust” of the R 4.0.3 software (Lucent Technologies, Jasmine Mountain, USA). Specifically, a binary distance matrix was formed with the distance between any pair of eco-climatic variables being 0 if the absolute value of correlation coefficient is bigger than 0.8 and 1 otherwise. The best number of clusters was chosen by the Krzanowski and Lai index [35]. This clustering analysis found eight clusters of the ecoclimatic (Additional file 3: Table S4). A continuous distance matrix where the distance is one minus the absolute value of correlation coefficient also identified the same clusters. Only one predictor from each cluster was used for model-fitting (Additional file 3: Table S4). For each BRT model, a total of 40 variables including 30 environmental factors, 8 ecoclimatic factors, 1 economic factor and 1 demographic factor (Additional file 3: Materials and Methods, Table S4, S5) were used as predictors.
The BRT models were fitted with a tree complexity of 5, a learning rate of 0.005 and a bagging fraction of 75%, based on their satisfactory performance in our previous researches [36, 37]. The output of each BRT model consists of both predicted probabilities of occurrence and relative contributions (or influences) of predictors. A training set with 75% of data points was randomly sampled without replacement, and the remaining 25% served as a test set. A BRT model was built using the training set, and then applied to the test set for validation if needed. The model-fitted risks were plotted on each predictor. Furthermore, receiver-operating characteristic (ROC) curves and areas under the curve (AUC) were produced to assess the predictive power of the models. Considering the possibility of false negative and false positive counties in the observed data, we also calculated partial area AUC with a tolerance level of 0.2 for omission error as described in Peterson’s study [38].
Model-predicted probabilities of occurrence were mapped to demonstrate the risk distribution for each of the 21 blood-sucking mites. We chose maximizes sensitivity + specificity along the ROC curve as a cut-off value for each final BRT model [39, 40]. Counties with predicted probabilities above the cut-off value for a given model were considered as having a high risk of harboring the corresponding blood-sucking mite species. We further estimated the sizes of populations in high-risk regions. For each mite species, the number, area and population size of model-predicted high-risk counties were compared to the quantities of counties with observed occurrence (Table (Table1).1). All statistical analyses were performed using the dismo and gbm packages of the R 4.0.3 software (Lucent Technologies, Jasmine Mountain, USA).
The average testing areas-under-curve (AUC) of the BRT models at the county level and model-predicted numbers, areas and population sizes of affected counties for the 21 most prevalent mite species in China
The predicted numbers are compared with the actual observations from field surveys and the relative differences (%) are given in parentheses
aTop 5 mite species affecting largest numbers of counties
bTop 5 mite species affecting largest areas
cTop 5 mite species affecting largest population sizes
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.