Statistical analysis

NS Nathan J. Steiger JS Jason E. Smerdon BC Benjamin I. Cook RS Richard Seager AW A. Park Williams EC Edward R. Cook

This protocol is extracted from research article:

Oceanic and radiative forcing of medieval megadroughts in the American Southwest

**
Sci Adv**,
Jul 24, 2019;
DOI:
10.1126/sciadv.aax0087

Oceanic and radiative forcing of medieval megadroughts in the American Southwest

Procedure

*Linear regression analysis*. We performed a decadal-scale linear regression analysis to assess the relative importance of climate drivers in explaining decadal hydroclimate over the period 800–1925 (Fig. 2A); for this panel, decadal-scale PDSI was the predictand and decadal-scale NINO3.4, AMO, and total forcing were the predictors. We performed the univariate and multivariate regressions within a bootstrap sampling framework that sampled both the choice of averaging and the values in the time series: The ensemble mean time series shown in Fig. 1 were averaged at 9-, 10-, and 11-year block averages and randomly sampled with replacement such that 999 (3 × 333) total regressions were performed for each predictor/predictand(s) option. The bootstrap sampling used here (and for all the panels in Fig. 2) allowed us to incorporate the uncertainty associated with the choice of averaging time and to more accurately estimate regression parameters with a relatively small sample size (*n* ≈ 110 at decadal time scales). Before computing the regressions, the averaged time series were detrended to remove autocorrelation (the removal of which was confirmed). The adjusted *r*^{2} values (adjusted for the number of predictors) for all 999 bootstrap regressions are summarized with box plots in Fig. 2A.

For Fig. 2 (B and C), we performed a similar decadal-scale, bootstrap regression analysis but with NINO3.4, AMO, and NASW averaged temperature as predictands and the total forcing as the predictor. We also showed the distribution of regression parameter values β_{1} (Fig. 2C), as in the regression equation *y* = β_{0} + β_{1}*x* + ε. These bootstrap-generated distributions provide an estimate of the confidence in β_{1}, although they are not identical to formal linear regression confidence intervals, which rely on assumptions that we did not make with the bootstrap approach. Note that the predictors here have not been standardized because we used the magnitude of the regression parameters in the main text; thus, the relative importance of the predictors cannot be deduced from Fig. 2C. Note also that because this analysis only captures linear relationships, there may be further nonlinear causal factors that are not accounted for.

*Logistic regression analysis*. We performed a logistic regression analysis to directly assess the role of climate drivers in explaining megadroughts in the Southwest (Fig. 2, D and E). For this analysis, the predictand is the binary variable of being in a megadrought or not being in a megadrought and the predictors are the ensemble mean NINO3.4, AMO, and total forcing time series as in Fig. 1C (local temperature is not included in the regression model because PDSI is explicitly dependent on temperature). Because of the strong autocorrelation in the annual megadrought time series, we binned all the relevant annual time series according to the boundaries of the start and end years of the megadroughts. Each megadrought period is thus a megadrought bin (*n* = 14), and the surrounding periods are nonmegadrought bins that are divided according to the following rule: If a given nondrought period is larger than the average megadrought bin size of 22 years, then that period is divided into the largest possible equal (or equal ±1)–sized bins such that no bin is larger than 22 years; the continuous nondrought periods at the beginning and end of the time series also follow this rule. The predictor variables of NINO3.4, AMO, and total forcing are averaged over each bin segment. The logistic regression for the probability of megadrought during a given bin, π = Pr (*Y* = 1 ∣ *X* = *x*), thus takes the form logit(π) = log [π/(1 − π)] = β_{0} + β_{1}*x*_{1} + … + β* _{k}x_{k}* for

A further complication is that a few very large volcanic eruptions in the forcing time series introduce outliers that can influence the regression. Thus, in Fig. 2 (D and E), we showed two logistic regression analyses with one using the full time series and another where four outlying bins were removed for all regression variables (these corresponded to two megadrought bins and two nondrought bins); the outliers in the total forcing time series were considered as those values that were more than three scaled median absolute deviations away from the median (the default outlier detection method of MATLAB).

*SOM analysis*. SOMs are a neural network-based cluster analysis (*30*) that has been used recently in the climate sciences to describe the continuum of climate patterns within a dataset [e.g., (*50*, *51*)]. An analysis of climate modes using SOMs is conceptually similar to traditional principal component analysis: The “nodes” of a SOM represent the primary modes of the underlying climate field. A critical advantage of a SOM analysis compared to a principal component analysis is that it does not impose linearity and orthogonality on states that may be related in nonlinear, nonorthogonal ways; practically, this feature makes SOMs more likely to correspond to physically plausible patterns [e.g., (*52*)].

Here, we used the SOM algorithm to assign annual SST anomaly fields to spatial patterns of a preset number. The SOM algorithm creates spatial patterns that maximize their similarity to the underlying SST fields (by minimizing their Euclidian distance) and then assigns each annual SST field to the best matching pattern. The SOM patterns are approximately the mean of the assigned SST fields and are thus approximately a composite of relatively similar SST fields (*50*). In addition, the SOM analysis organizes the patterns such that similar patterns are assigned to nearby locations within a regular two-dimensional grid. Thus, this full process allows one to visualize a reduced space continuum of patterns in the dataset (Fig. 3 and figs. S4 and S5).

For the SOM analysis, the SST fields are preprocessed by detrending the SST fields (removing the linear trend at each grid point over the full analysis period, 800–1925; c.f. fig. S2); this detrending prevents spurious trends in the pattern occurrence (*51*), although it does remove somewhat the influence of the warm AMO in the early centuries of the analysis. As in (*50*), we applied an area weighting of the SST fields according to the cosine of latitude because PHYDA is on a regular latitude-longitude grid. The SOM algorithm keeps track of which year’s SST field has been assigned to which best matching pattern (or “best matching unit”), and we made composites of JJA PDSI over the years assigned to each pattern, as shown in Fig. 3 and figs. S4 and S5.

For the bootstrap resampling for Fig. 3B, we performed a 500 iteration bootstrap resampling of the same number of years as in the megadroughts, drawn from the reference period; corresponding ranges for the nondrought years are very similar. Thus, they are not shown for clarity.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.