3.1. Data Cleaning
Predicting Lung Cancer in the United States: A Multiple Model Examination of Public Health Factors
Int J Environ Res Public Health, Jun 6, 2021; DOI: 10.3390/ijerph18116127

After examining descriptive statistics for each variable, we centered, scaled, and made log transformations for non-normally distribution variables. This was for the purpose of making variables consistent with the assumptions of multiple regression and for decreasing the amount of multi-collinearity. We affix the suffix “_log” to the variable name to indicate a log transformation, e.g., SO2_T1_log and CS2_log. We then checked for outliers and missing values for each variable, and if the proportion of outliers and missing values was less than 10%, replaced them with the median value of each state. If all counties of a state were missing values, those remained NA. The final sample size is 2,862 observations.

Table 3 and Table 4 show the final versions of the variables after cleaning (imputation and/or transformation). Nitrogen Dioxide in 2006–2010 had too many nulls and was therefore excluded from inclusion in any model.

Variables and Data Cleaning.

Descriptive Statistics.

We show a matrix plot among the EQI variables in Figure 3 and a matrix plot among the ambient emissions variables in Figure 4 and Figure 5, to show the correlations at the macro- and micro-levels. Most correlations are significant, which indicates a model is likely to be obtained, but also that we must check for collinearity.

Matrix Plot of Lung Cancer, Adult Smoking, and Environmental Quality Index, all domains. Significance codes: 0 ‘***’ 0.001 ‘.’ 0.1 ‘ ’ 1.

Matrix Plot of Lung Cancer with Variables in Time 1: Particulate Matter 2.5 and 10, Carbon Disulfide, Cyanide compounds, Carbon Monoxide, Diesel Exhaust, Nitrogen Dioxide, Tropospheric Ozone, Sulfur Dioxide. Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Matrix Plot of Lung Cancer with Micro Variables in Time 2: Particulate Matter 2.5, Particulate Matter 10, Carbon Monoxide, Tropospheric Ozone, Sulfur Dioxide. Significance codes: 0 ‘***’ 0.001 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

