(b) Statistical analyses

YL Yuanheng Li
CD Christian Devenish
MT Marie I. Tosa
ML Mingjie Luo
DB David M. Bell
DL Damon B. Lesmeister
PG Paul Greenfield
MP Maximilian Pichler
TL Taal Levi
DY Douglas W. Yu
request Request a Protocol
ask Ask a question
Favorite

We converted the sample × species table to presence-absence data (1/0), and we only included species present at six or more sampling sites across the 121 samples. Our species dataset was thus reduced to 190 species in two classes, Insecta and Arachnida (figure 1b).

The general idea behind species distribution modelling is to ‘predict a species’ distribution’. We use each species’ observed incidences (1/0) at all sampling points, plus the environmental-covariate values at those points, to ‘fit’ a model that predicts the species’ incidences from the covariate values. Once we have a fitted model, we use it to predict the species’ probability of presence over the rest of the sampling area, where the environmental-covariate values are known but the species’ incidences are not. Spatial autocorrelation was accounted by a trend-surface component. JSDMs extend individual species distribution models by additionally accounting for co-occurrences of species (see the electronic supplementary material: Joint Species Distribution Model).

The statistical challenge is to avoid overfitting, which is when the fitted model does a good job of predicting the species’ incidences at the sampling points that were used to fit the model in the first place but does a bad job of predicting the species over the rest of the landscape. Overfitting is likely in our dataset because many of our species are rare, there are many candidate remote-sensing covariates, and we expect that any relationships between remote-sensing-derived covariates and arthropod incidences are indirect and thus complex, necessitating the use of flexible mathematical functions.

To minimize overfitting, we used regularization and cross-validation. Regularization uses penalty terms during model fitting to favour a relatively simple set of covariates, and cross-validation finds the best values for those penalty terms (tuning). First, we randomly split the species incidence data from the 121 samples in 89 sampling points into 75% training data (n = 91) and 25% test data (n = 30) (electronic supplementary material, figure S1). The training data were used to try 1000 different hyperparameter combinations in a fivefold cross-validation design, some of which are the penalty terms, to find the combination that achieves the highest predictive performance on the training data itself (see the electronic supplementary material: Tuning and Testing, figure S1). The model with this combination was then applied to the 25% test data to measure true predictive performance. To fit the model, we used the JSDM R package sjSDM 1.0.5 [42], with the DNN deep neural network (DNN) option to account for complex, nonlinear effects of environmental covariates (the DNN outperformed a linear model; see the electronic supplementary material, figure S11), which suits our dataset of many species with few data points and many covariates.

Finally, to estimate how OTU incidence affects the variability of predictive accuracies, we also tuned a model to the whole dataset in a fivefold cross-validation, found optimal hyperparameters, and used them in another fivefold cross-validation on the entire dataset to estimate the variability of predictive area under the curve (AUCs) by OTU (see the electronic supplementary material: Variability in Predictive AUC by OTU Incidence). We emphasize that method is only useful for estimating variability in predictive performance, given that it potentially overestimates predictive performance, which is what we avoided by using a pure holdout in the main analysis.

The mathematical functions used in neural network models are unknown, but it would be useful to identify the covariates that contribute the most to explaining each species incidences. We therefore carried out an ‘explainable-artificial intelligence‘ (xAI) analysis, using the R package flashlight 0.8.0 [52]. In short, for each environmental-covariate, we shuffled its values in the dataset and estimated the drop in explanatory performance on the training data. The most important covariate is the one that, when permuted, degrades explanatory performance the most (see the electronic supplementary material: Variable importance with explainable AI (xAI)).

Finally, after applying the final model to the test dataset, we identified 76 species that had moderate to high predictive performance (AUC70%). We used the fitted model and the environmental-covariates to predict the probability of each species’ incidence in each grid cell of the study area (‘filling in the blanks’ between the sampling points). The output of this one model is 76 individual and continuous species distribution maps, which we combined to carry out three landscape analyses. First, we counted the number of species predicted to be present (probability of presence50%) in each grid square to produce a species richness map. Second, we carried out a dimension-reduction analysis, also known as ordination, using the t-distributed stochastic neighbour embedding (T-SNE) method [53,54] to summarize species compositional change across the landscape. Pixels that have similar species compositions receive similar T-SNE values, which can be visualized. Third, we calculated Baisero et al.’s [55] site-irreplaceability index for every pixel. This index is the probability that loss of that pixel would prevent achieving the conservation target for at least one of the 76 species, where the conservation target is set to be 50% of the species’ total incidence.

Finally, we carried out post hoc analyses by plotting site irreplaceability, composition (T-SNE), and species richness against elevation, old-growth structural index [56] and inside/outside HJA.

Search protocols in the Bio-protocol database

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A