3.2. Model Results and Interpretation
This protocol is extracted from research article:
Predicting Lung Cancer in the United States: A Multiple Model Examination of Public Health Factors
Int J Environ Res Public Health, Jun 6, 2021; DOI: 10.3390/ijerph18116127

Our modelling approach was always the same, regardless of specific method used. We (a) randomly partitioned the dataset into train (80%) and testing (20%) subsets, and (b) checked for outliers, multi-collinearity, and target leakage [25]. Model accuracy was assessed by performance on both a train partition (80%) and test partition (20%), determined by random sampling.

We fitted several models starting with two separate layers of variables: (1) adult smoking and (2) states. The rationale for adult smoking is because it is well-established as the number one contributing cause of lung cancer. The rationale for geographic states was because we expected differences by state in terms of ambient emissions, emission regulations, cultural differences, and baseline population health. The geographic states model contains data for forty-five states, using Alabama as the baseline dummy variable state. The remaining five states (Alaska, Kansas, Michigan, Minnesota, and Nevada), five territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, Virgin Islands) and Washington D.C. had insufficient data and were therefore excluded from analysis. We then examined models that include (3) only the EQI domain variables and (4) only the ambient emission variables. See Figure 6 for the regression results of the four models.

All Regression Models. Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Smoking is a very strong predictor of lung cancer. For every percentage increase in adult smoking, the number of lung cancer cases increases by 164.583 per 100,000 citizens. The variance explained (adj. R2) is 0.3141.

The state model in Figure 6 consists of 45 US states. Figure 6 shows the states sorted by t-value to show the relative magnitude of the impact of state. There are 30 statistically significant states at a level of p < 0.05 with all but Georgia significant at a level of p < 0.01. Some states have a positive coefficient estimate, indicating a positive association with lung cancer, whereas others have a negative coefficient estimate relative to Alabama, the arbitrary baseline state. The variance explained (adj. R2) is 0.5304.

Kentucky has the most positive coefficient, indicating that its citizens have a higher tendency to have lung cancer: 29.893 more cases per 100,000 residents vs. Alabama. There are seven other statistically significant, higher risk states: Arkansas, West Virginia, Illinois, Indiana, Missouri, Mississippi, Georgia. Conversely, Utah has the most negative coefficient, indicating a lower tendency to have lung cancer: 41.404 fewer cases per 100,000 residents vs. Alabama. There are twenty-one other statistically significant (p < 0.05), lower risk states: Maryland, Pennsylvania, New Jersey, Virginia, North Dakota, Iowa, Tennessee, Hawaii, Wisconsin, South Dakota, Arizona, Montana, Texas, Washington, Nebraska, Oregon, Wyoming, Idaho, New Mexico, California, and Colorado. Massachusetts is borderline statistically significant (p = 0.077).

The macro model consists of only the five EQI variables covering different domains: air, water, land, built, and sociodemographic. We model these by themselves to assess their macro-level impact on lung cancer without any confounding of smoking, state, or ambient emissions. A higher value of each of these indicates worse quality of environment [23]. Figure 6 shows the EQI domains sorted by t-statistic. Positive coefficients indicate worse environmental quality. An EQI_Air coefficient of 6.409 indicates that for every unit of worse air quality, there are 6.409 more lung cancer cases per 100,000 people. Water quality is also positive and statistically significant, but lower impact: 0.846 more lung cancer cases per 100,000 people.

According to the regression coefficients, there are countervailing, counterintuitive forces indicated by the quality of land, socio-demographic, and built domains, because they suggest that areas with worse environmental quality in the land, socio-demographic, and built domains have lower incidence of lung cancer. Unequal access and socio-economic disparities could partially explain the paradoxical results. Adding higher-order terms was attempted to resolve the paradoxical results, i.e., squared-terms: EQI_Land2, EQI_Built2, and EQI_SocioD2. Interaction terms were also attempted: EQI_Land*EQI_Built, EQI_Land*EQI_SocioD, and EQI_Built*SocioD. None of these higher-order terms helped the interpretability of the coefficients, and they increased the variance explained only a small amount (0.005) while increasing the collinearity, so the higher-order terms were dropped. The variance explained (adj. R2) is 0.2146.

Figure 6 also shows the micro-level variables, nine ambient emissions: Cyanide compounds, Carbon Monoxide, Carbon Disulfide, Diesel Exhaust, Nitrogen Dioxide, Tropospheric Ozone, Coarse Particulate Matter, Fine Particulate Matter, and Sulfur Dioxide. Six of these have data from both timeframes: Carbon Monoxide, Nitrogen Dioxide, Tropospheric Ozone, Coarse Particulate Matter, Fine Particulate Matter, and Sulfur Dioxide. Five of the ambient emissions are statistically significant in both timeframes: Nitrogen Dioxide, Tropospheric Ozone, Course Particulate Matter, Fine Particulate Matter, and Sulfur Dioxide. The higher the level of Fine Particulate Matter or Sulfur Dioxide, the higher the rate of lung cancer. Fine Particulate Matter is the most hazardous in both time periods T1 and T2. Almost as hazardous is Sulfur Dioxide. The variance explained (adj. R2) for this model is 0.3256, which is higher than adult smoking by itself.

Paradoxically, the higher the level of Nitrogen Dioxide, Tropospheric Ozone, or Course Particulate Matter, the lower the rate of lung cancer. Lowering the risk, paradoxically, is Course Particulate Matter, which is particular matter up to four times as large as Fine Particulate Matter but still respirable. Coarse Particulate Matter is not healthful, but a larger presence of it could mean that Fine Particulate Matter levels have decreased, amounting to an indirectly positive effect. Similarly, the negative coefficients of Nitrogen Dioxide and Tropospheric Ozone are paradoxical as well, but more difficult to understand. These negative coefficients may indicate countervailing, confounded effects or indirect effects. That is, Nitrogen Dioxide and Tropospheric Ozone may not be the factors directly causing lung cancer. According to Witschi (1988), “there is little evidence to implicate ozone or Nitrogen Dioxide directly as pulmonary carcinogens, but that they might modify and influence the carcinogenic process in the lung.” Overall, Nitrogen Dioxide and Tropospheric Ozone have shown mixed associations with lung cancer, implicated only as co-carcinogens, exacerbating lung disease [26,27,28]. A model testing Tropospheric Ozone and Nitrogen Dioxide in both timeframes with interaction terms results in Figure 7.

Testing for Interaction.

The coefficients of Tropospheric Ozone and Nitrogen Dioxide become positive (in both timeframes) in their relationship to lung cancer. The interaction terms are negative, and only the Nitrogen Dioxide interaction term is statistically significant, indicating a dampening multiplicative effect over time. This effect from the Nitrogen Dioxide interaction disappears when the other ambient emissions variables are added back in, so we drop it for the sake of simplicity. We attribute the negative coefficients to complex relationships among the various ambient emissions and possibly other variables not included in our model. These paradoxes notwithstanding, the micro-level model is more comprehensive than the macro-level EQI model. It seems that accounting for exposure to specific carcinogenic ambient emissions is more accurate, capturing more of the variance, than the simpler macro-level model.

The four models described thus far show significant explanatory and predictive power. We consider the adult smoking and state models to be foundational because adult smoking is obviously crucial to include, and the state model explains the most variance. We therefore combine adult smoking and geographic state to form the foundation for all multi-layer models. We examine the Foundation + EQI model results, grouped by variable layer (left side) and sorted by t-statistic (right side) in Figure 8.

Foundation + Environmental Quality Index; Residual standard error: 10.5 on 2236 degrees of freedom; Multiple R-squared: 0.6197, Adjusted R-squared: 0.6114; F-statistic: 74.36 on 49 and 2236 DF, p-value: < 2.2 × 10−16; Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Many states are positively associated with lung cancer, with Kentucky even more hazardous than adult smoking, according to their t-statistics. The next ten states are more hazardous than EQI_Air: Illinois, Arkansas, Indiana, Ohio, Missouri, New York, Georgia, Maine, West Virginia, North Carolina. Note that all of these states are in the Eastern, South, or Midwest regions of the United States. On the other hand, environmental quality indexes of sociodemographic, land, built environment and water domains are negatively associated with lung cancer, which is paradoxical. This could indicate a confounding of unhealthful environmental quality within healthful city living. For example, this could be where lower quality environment (vehicle exhaust) is experienced near high-quality healthcare systems, which can detect lung cancer early. Amidst those environmental domain variables are the states negatively associated with lung cancer: Utah, New Mexico, Colorado, Arizona, Wyoming, California, Tennessee, Idaho. Note that all but Tennessee are states in the Western region of the United States.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.