SDMs fit statistical associations between species’ observed presences and environmental variables; i.e., they estimate a species’ realized ecological niche (44). SDMs provide a useful framework to explore large-scale distributions of phytoplankton species, based on two major assumptions: (i) Species are not dispersal limited in the open ocean (51, 52) [but see (53)], a trait consistent with the generally wide geographic ranges of the species in our data; (ii) species are primarily controlled by environmental factors in their global distribution (36, 47, 52) and rapidly proliferate when conditions become suitable (18, 27). Applying SDMs to study marine phytoplankton distributions has emerged relatively recently (45). Since distribution patterns of phytoplankton species might change seasonally (30), reflecting the generally short generational cycles of phytoplankton owing to their largely microbial nature, we used a monthly matchup between species’ presences and environmental variables to calibrate the niches. We then projected niches onto global environmental data fields at 1° and monthly resolution to obtain maps of species’ presence. We developed SDMs that address three principal sources of uncertainty: (i) biases in sampling effort, (ii) predictor selection, and (iii) algorithm choice. All steps used to build SDMs are described herein.

Data binning. We used phytoplankton presence data, rather than abundance data, as the former are less sensitive to differences in sampling methods and are more widely available. We binned species’ presence observations into monthly 1° latitude × 1° longitude resolution to match the resolution of environmental predictors. Multiple observations per species and 1° cell that stemmed from the same month, but potentially from different years, thus counted as a single presence, resulting in a total of 245,322 species presences. The monthly data binning may have removed signals of temporal changes in species’ distributions throughout the years. However, since data originated predominantly from a few decades between 1950 and 2000 (1984 ± 17; mean ± SD) and since climatic changes during this period were much smaller relative to current global amplitudes of environmental factors (for example, sea surface temperature spans ~ −1.8 to ~32°C) (48), we expect such changes to have only a minor impact on global SDM projections.

Environmental background data. Since absence data for phytoplankton are unreliable on the basis of traditional sampling methods (20) but required by our presence-absence SDMs, we selected background data (also termed pseudoabsences) for each species, using a so-called target-group approach (54). This approach addresses spatial and temporal sampling biases in field-based presence data of species via the selection of the number and location of pseudoabsences. We defined large groups of species as target groups, assuming that variation in sampling effort applied to the entire target group reflected variation in sampling applied to each species within the target group (54). The sampling of species’ background data from the target group served two purposes: (i) Background sampling followed a sampling scheme similar to that of the species’ presence data (and thus received similar bias), thereby balancing presence data bias when fitting SDMs; (ii) extensive ocean areas, which lacked sampling, were not misclassified as areas of species’ absences. We used the Bacillariophyceae, Dinoflagellata, and Haptophyta separately to define “group-specific target groups” for their constituent species, as these taxa had different global sampling schemes. For the remaining taxa, the number of species was insufficient to build group-specific target groups. For these taxa, we used the total species as the target group, excluding Bacillariophyceae, as presence data of the latter were strongly north-south imbalanced.

In parallel, we used the total species as target group to sample the background for each species, which we refer to as “total target group” approach. We found that richness results were robust to the use of total versus group-specific target groups (fig. S3B).

We sampled background data in a stratified manner from the target group, dividing both the T and MLD gradient (spanned by the target group) into nine equally spaced intervals, yielding 81 strata (T × MLD combinations). Sampling data from each stratum separately assured that the breadth of these two key environmental factors was reflected in the backgrounds of species. The target group’s presence data were gridded at monthly 1° resolution, before sampling backgrounds from it. We tested whether the density of these monthly 1° cells of the target group reflected original sampling efforts (approximated by the number of samples in the raw data) and found that the two measures were highly correlated (Spearman’s ρ = 0.94 for latitude; Spearman’s ρ = 0.99 for longitude; binning data at 1° latitude or 1° longitude, respectively). For each species, we sampled 10 times more background data than the species had presences (55). Within each of the 81 strata, background data were randomly sampled. The amount of background data sampled from a specific stratum was proportional to the number of monthly 1° cells provided by the target group in this stratum, thereby reflecting original sampling efforts.

Statistical complexity. Statistical algorithm choice represents a key source of uncertainty in SDMs (56). We constructed SDMs based on either GLM (using the R package stats), GAM (R package mgcv), or RF (R package randomForest), as three algorithms of increasing statistical response shape complexity (57). We considered the GAM as our standard algorithm because of its intermediate complexity. We used comparably few predictors (n = 4) in models and fitted simple response shapes to account for the relatively few presences of most phytoplankton species (57). GLM included linear and quadratic terms and a stepwise bidirectional predictor selection procedure. GAM used smoothing terms with five basis dimensions, estimated by penalized regression splines without penalization to zero for single variables. To equalize the overall weight of presences versus background data per species, background data in GAM and GLM were weighted by the ratio of species’ presence to background data points. RFs included 4000 trees, simple terms, and single end node size. The weighting of data in individual RF trees was balanced by randomly subsampling same amounts of background data as the species had presences.

Single predictor skill tests. In addition to algorithm choice, predictor choice represents a major source of uncertainty in phytoplankton SDMs, as these organisms are not well studied regarding their most important niche factors. To select powerful predictors for SDMs, we assessed the individual skill of an extensive number of candidate predictors (n = 25) in discriminating species’ presences versus background data. The results of this test also served to identify the key environmental drivers of species’ distributions independently of the SDM analysis. We fitted single-factor GLM, GAM, and RF models to the presences versus background data of each species, for each candidate predictor. The species (n = 567) considered for predictor analyses generally contained a minimum of 24 presences as was used as a lower threshold for species in SDMs. Model explanatory skill was evaluated using the adjusted D2 (for GLM and GAM) and the out-of-bag error (for RF) statistic. For each species, predictors were ranked according to these statistics, and the mean variable ranks obtained across GLM, GAM, and RF served as a basis for predictor selection. We performed several sensitivity tests to evaluate the robustness of the predictor ranking. We compared rarely versus more frequently sampled species (i.e., ≥15, ≥24, and ≥50 presences), used different variables for the stratification of background sampling, and applied spatial thinning of species’ presences to a distance ≥300 or ≥600 km (using the R package spThin), which reduces potentially confounding effects of autocorrelation. None of these modifications changed the result that temperature was the top-ranked predictor across total species. However, the rank of predictors other than temperature tended to vary between setups.

Predictor choice for models. To capture predictor-based uncertainties, we fitted five member models, each using a different set of four predictors, for building an SDM ensemble. The species (n = 567) considered for modeling contained ≥24 presences, which corresponds to a presence-to-predictor ratio ≥6 per species. We used a randomization approach to select the four predictors per member model, using the test-based predictor ranking (see above) of each species as a basis. For the first member model, we selected four predictors at random, without replacement from those predictors that ranked among the 10 most powerful predictors per species. We omitted Spearman’s rank correlations between predictors greater than |0.7| in each predictor set (computed from the predictor data at global monthly 1° resolution). Predictors of the four other members were composed by the same criterion. Yet, we allowed each predictor to be selected only up to twice among the five members to omit biases due to overrepresentation of individual predictors in SDMs. If sampling among the top 10 predictors did not provide a sufficient number of predictors for the 5 sets × 4 predictors (given the correlation criterion), candidate predictors that ranked >10 were selected. Predictors were equally used in GAM, GLM, and RF.

Monte Carlo simulation. We used a Monte Carlo simulation to quantify uncertainty in our results emerging from the choice of different predictor sets. For each species modeled, we randomly selected one of the five predictor sets prepared (see above) to fit the SDM of the species and then calculated richness and turnover of total species. This procedure was repeated (n = 1000 runs) and we present the SD across the runs (Figs. 1, C and D, and 3B).

Evaluation of model members and ensemble construction. For each species, we evaluated the predictive skill of each member model based on a repeated (4×) split-sample cross-validation test. In this test, the species’ presences and background data are randomly split into four parts. The SDM member model is iteratively trained on the basis of three parts (75%) of the data and used to predict the remaining one-fourth (25%) of the data. The predicted values are then compared against the true values. We calculated the true skill statistic (TSS) (58) of this test. TSS ranges from −1 to +1 with values greater than zero indicating models performing better than at random. We retained member models with a TSS score of at least 0.35 for the construction of our SDM ensembles. Successful member models were then projected globally onto monthly (n = 12 months) environmental data fields, yielding probabilistic maps of species’ presence. Presence probabilities were generally higher for high-latitude species than for lower-latitude species, as ecological niches at high latitudes were readily captured by SDMs. To avoid spatial biases in multispecies analyses, we therefore binarized the projected probabilities to presence-absence from thresholds maximizing the TSS (package presenceAbsence). For each month, we averaged the successful member models of each particular species to obtain monthly ensemble mean projections. Each species thus obtained a value between zero and one per monthly 1° cell. We did not further binarize the ensemble projections to presence-absence, as binarization tends to overestimate the species’ presence toward the edges of the projected presence area, relative to its center. We hence argue that our ensemble mean projections characterize species’ distribution patterns at a higher level of detail compared to 0/1 projections and are better suited for multispecies analyses, in line with previous work showing that the sum of overlapping 0/1 projections tended to overestimate species richness (59).

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.