Once the database of BC and environmental pollution settings were obtained, the CrimeStat program performed the Nearest Neighbor Index (NNI) and Nearest Neighbor Hierarchical Clustering (NNHC) analysis. The SaTScan software was used for statistical spatial scanning analysis. We used interpolation techniques to estimate the values of contaminants emitted by the polluting companies and estimate PM10 values from the environmental monitoring stations. Likewise, the Getis-Ord cluster analysis was used to generate hot spot maps.
The Getis-Ord statistic was used as a spatial statistical technique to identify significant concentrations of high and low values of the number of deaths associated with BC at the municipal level (Figure 1). This statistic is a local indicator of spatial autocorrelation that allows the identification and visualization of local patterns of association (hot spots) and local instabilities of the global spatial association (Anselin, 2010; Getis and Ord, 2010). The ArcGIS spatial statistics tools section for the implementation of this pattern analysis technique was used. This statistic is defined as follows:
Equation 1. Spatial autocorrelation
where is the value of AGEB j, (d), is the spatial weighting between the value of AGEB i and AGEB j, that is, the value of the same variable in another geographic unit, represents the total number of geographic areas. While S is calculated as follows:
Assuming has a normal distribution, the results can be interpreted as Z scores along a normal curve. In this sense, positive Z scores above 1.96 are statistically significant at significance levels of 0.05, which indicates that the location of i is surrounded by relatively high values, while the opposite occurs when is significant and negative, it is that is, the location of i is surrounded by relatively low values.
After processing the information, the degree of concentration of the companies with information on polluting substances and BC cases was determined using the NNI proposed by Clark and Evans (1954). There are some investigations in which this technique was used to determine the degree of concentration of points over space and to identify clusters (Boix et al., 2015; Gasca-Sanchez et al., 2019; Meyer, 2006).
This technique consists of comparing the distance between the closest points, determining the average distance between neighbors, and comparing the expected average distance of a hypothetical random distribution. If the mean distance is less than the average of the random distribution, it can be determined that the distribution of points follows a pattern of agglomeration. (See Figure 3) On the other hand, if the mean distance is larger than the random distribution, the points are considered to follow a dispersion sequence (Levine, 2002).
Example of an agglomerated or dispersed distribution of a number of BC cases with their respective NNI values. The NNI values show the intensity of the agglomeration of the points in space, it can be seen that the points with NNI values greater than 1 are more dispersed than the points with NNI values close to 0. Source: Authors' elaboration.
The relationship of the distance to the nearest neighbor (NND, Neighbor Nearest Distance) is as follows:
Equation 3. Neighbor Nearest Distance
where is the observed mean distance between each point and its nearest neighbor, and = is the expected mean distance for the points in a random distribution pattern calculated as:
Equation 4. Random distribution pattern
where A is the minimum area (square meters) enclosing a rectangle around all the points, and N is the number of points. In general terms, the NNI is the ratio of the distance from the nearest neighbor observed to the average random distance.
Equation 5. Nearest Neighbor Index
Consequently, if the result generates coefficients higher than 1, the points are considered dispersed, whereas they are agglomerated if it is less than 1. Coefficients closer to 0 indicate higher concentration in the point cloud.
This technique identifies groups of points that are spatially close. It compares the distance between pairs of points with the expected distance of a hypothetical random distribution in a given area and clusters the groups of unusually close pairs (Levine, 2002). This generates first-order clusters; then, the analysis applied to first-order clusters to enclose in circles clusters that are unusually close, generating second-order clusters. This procedure continues until more levels of clusters are generated until they can no longer be found. Normally, the hierarchical clustering procedure generates groups up to third order. In order to generate first-order clusters, the software selected that each cluster would significantly cluster a minimum of five or more cases.
The technique to analyze the spatial distribution of pollutants in the sample of companies was IDW; this technique assumes that things that are close to others are more similar than others that are far away, so to predict a value in space takes as a reference to their nearest neighbors in a given radius. There is empirical evidence on the use of interpolation techniques applied to environmental pollution, using both IDW and Kriging Density Approach (Dhiman and Singh Sandhu, 2017; Duc et al., 2000; Shi et al., 2013). According to Cañada (2008), the spatial interpolation by IDW is developed as follows:
Equation 6. Spatial interpolation by IDW
where Z() is the value that predict the location (s0), n is the total sample points (pollutant firms) near the point to be predicted, λ is the weighted value assigned to each point and it will be used for the prediction of values. The point values diminish with the distance, were Z() is the value observed in the location . Although there are other interpolation methods such as Kiriging, the IDW interpolation is the one that best fits the database of polluting companies in this research, since the distribution of the points generated greater errors with other techniques.
On the other hand, the interpolation by Empirical Kriging Bayesian was used to distribute the PM10 values of the nine monitoring stations, employing the ArcGIS Geostatistical Analyst 10.4.1, since it presents a better adjustment to distribute the air pollution over continuous space. With a structure similar to the previous formula, the results were generalized by calculating the mean squared error of interpolation (RMSR) is described as follows:
Equation 7. Mean squared error of Interpolation
where is the value after interpolation and is the measured value in point . In the case of PM10 contamination interpolation, the EBK was chosen because with this technique the mean error was considerably reduced (.6530) in relation to the IDW technique (1.314). In Figure 4, the fit of the data is shown by means of the semivariogram.
Semivariogram for EBK interpolation of monitoring stations showing PM10 values.
Similarly, Kernel Density was used to identify the areas of the MMA where BC cases are intensified, which according to Kelsall & Diggle (1995) is denoted as follows:
Equation 8. Kernel density
where ɡ(xj) is the density of cell j, is the distance between cell j and a location of a BC case i, h is the standard deviation of the normal distribution, is a constant, is a weight in the location of a BC case and is an intensity of the location of a BC case. The density of BC cases provides assistance in identifying areas where the sample is intensified, as well as helping to identify spatial patterns.
Kulldorff (1997) spatial scan statistical analysis was used to identify risk areas for BC cases, using information on the population of each AGEB. This method has been widely used in health research as a tool to identify groupings of phenomena associated with health and sociodemographic (Kihal-Talantikite et al., 2013; Kulldorff and Nagarwalla, 1995; Rao et al., 2017). Statistical scanning was performed using a discrete Poisson model, identifying high-risk groups of BC cases in the AGEBS, in relation to their population. The expected number of BC cases in each AGEB is calculated as:
Equation 9. Discrete Poisson Model for high-risk groups
where c is the observed number of BC cases, p is the AGEB population, and C and P are the total number of BC cases and population, respectively. A relative risk of BC cases for each AGEB is calculated by dividing the observed number of BC cases by the expected number of BC cases. The alternative hypothesis is that there is a high risk of BC cases within the exploration cluster compared to what happens outside it.
Under the Poisson assumption, the likelihood function for a specific window is proportional to:
Equation 10. Relative risk for high-risk groups
where is the total number of BC cases, is the observed number of BC cases within the window, and y [] is the expected number of BC cases within the window under the null hypothesis that there is no difference. Because the analysis is conditioned on the total number of observed cases, - [], is the expected number of cases outside the window. is an indicator function, with = 1, is when the window has more cases than expected under the null hypothesis and 0 otherwise The Most likely cluster was determined through a maximum LLR (Log Likelihood Ratio), generating 999 Monte Carlo simulations to determine the statistical significance for the identified clusters.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.