Let us discuss the details of the two datasets used in this work.
Our dataset has been carefully curated from several open sources to examine the possible factors that may affect the COVID-19 related infection and death numbers in the 50 states of USA. The individual open-access data sources as well as the integrated (curated) dataset has been shared on GitHub (https://github.com/satunr/COVID-19/tree/master/US-COVID-Dataset). Below, we discuss a summary of the features and output labels of the integrated dataset.
Gross Domestic Product (in terms of million US dollars) for US states [31] (filename: source/GDP.xlsx, feature name: GDP).
Distance from one state to another (is not measured in miles but the euclidean distance between their latitude-longitude coordinates between the pair of states [32]) (filename: source/Data_distance.xlsx, feature name: d(state1, state2)).
Gender feature(s) is a fraction of total population representing the male and female individuals [33] (filename: source/Data_gender.csv, feature name: Male, Female).
Ethnicity feature(s) are the fraction of total population representing white, black, Hispanic and Asian individuals (we leave out other smaller ethnic groups) [34] (filename: source/Data_ethnic.csv, feature name: White, Black, Hispanic and Asian).
Healthcare index is measured by Agency for Healthcare Research and Quality (AHRQ) on the basis of (1) type of care (like preventive, chronic), (2) setting of care (like nursing homes, hospitals), and (3) clinical areas (like care for patients with cancer, diabetes) [35] (filename: source/Data_health.xlsx, feature name: Health).
Homeless feature is the number of homeless individuals of a state [36] (filename: source/Data_homeless.xlsx, feature name: Homeless). The normalized homeless population of each state is the ratio between its homeless and total population.
Total cases (and deaths) of COVID-19 is the number of individuals tested positive and dead [37] (filename: source/Data_covid_total.xlsx, feature name: Total Cases and Total Death). The normalized infected/death is the ratio between the infected/death count to total population of the given state.
Infected score and death score is obtained by rounding normalized total cases and deaths to discrete value between 0–6 (feature name: Infected Score, Death Score).
Death-to-Infected is a feature measuring impact of death in terms of the difference between death and infected scores. It is calculated as max(Death Score – Infected Score, 0).
Lockdown type is a feature capturing the type of lockdown (shelter in place: 1 and stay at home: 2) in a given state [37, 38] (filename: source/Data_lockdown.csv, feature name: Lockdown).
Day of lockdown captures the difference in days between 1st January 2020 to the date of imposition of lockdown in a region [39] (filename: source/Data_lockdown.csv, feature name: Day Lockdown).
Population density is the ratio between the population and area of a region [40] (filename: source/Data_population.csv, feature name: Population, Area, Population Density).
Traffic/activity of airport measures the passenger traffic (also normalized by the total traffic across all the states of USA [41] (filename: source/Data_airport.xlsx, feature name: Busy airport score, Normalized busy airport).
Age groups (0—80+) in brackets of 4 year (also normalized by total population) [40] (filename: source/Data_age.xlsx, feature name: age_to_, Norm_to_, e.g. age4to8); we later group them in brackets of 20 for the purposes of analysis.
Peak infected (and peak death) measures the duration between first date of infection and date of daily infected (and death) peaks [40] (feature name: Peak Infected, Peak Death).
Testing measures the number of individuals tested for COVID-19 (total number, before and after imposition of lockdown) [38, 42] (filename: source/Data_testing.xlsx, feature name: Testing, Pre-lockdown testing, Post-lockdown testing).
Pre- and post-infected and death count measures the number of individuals infected and dead before and after lockdown dates (feature name: Testing, Pre-infected count, Pre-death count, Post-infected count, Post-death count).
Days between first infected and lockdown date (feature name: First-Inf-Lockdown).
The above features, their abbreviations and summary statistics (i.e., mean, standard deviation, maximum and minimum) are enlisted in Table 1. Note that, for gender and ethnicity we report the fraction of the total state population falling in each category.
The features in the order shown under “Feature name” are: GDP, inter-state distance based on lat-long coordinates, gender, ethnicity, quality of health care facility, number of homeless people, total infected and death, population density, airport passenger traffic, age group, days for infection and death to peak, number of people tested for COVID-19, days elapsed between first reported infection and the imposition of lockdown measures at a given state.
The New York City (NYC) datasets (https://github.com/satunr/COVID-19/blob/master/US-COVID-Dataset/NYC_dist_mob.xlsx) show the inter-borough distance and mobility as well as COVID-19 infected (https://github.com/satunr/COVID-19/blob/master/US-COVID-Dataset/NYC-Inf.xlsx) and death counts (https://github.com/satunr/COVID-19/blob/master/US-COVID-Dataset/NYC-Dth.xlsx) for the 5 boroughs of NYC, namely, Manhattan, Queens, Brooklyn, Bronx and Staten Island.
Mobility data (based on traffic volume counts collected by DOT for New York Metropolitan Transportation Council (NYMTC) [43]) shows the number of trips from one borough to another.
COVID-19 data shows the number of COVID-19 infected and death counts for each borough [44].
We acquire the daily infected and testing counts across US from January—July, 2020 [45]. This dataset is part of the COVID Tracking project that collect COVID-19 statistics on the numbers on tests, cases, hospitalizations, and patient outcomes from every US state and territory by voluntary public participation.
We use the Scikit-learn library KBinsDiscretizer to group the continuous feature values into discrete values by creating balanced clusters using the quantile strategy [46].
Supervised machine learning algorithms learn a function that maps the input training data (i.e., features) to some output labels [47]. In this work, we consider the following supervised learning techniques. (Refer [48–54] for the details on these ML approaches.)
Support Vector Machine (SVM) is used for classification and regression problems that maps the inputs to high-dimensional feature spaces. SVM operates on hyperplanes—decision boundaries that help classify the data points. The objective is to maximize the separation between the data points and the hyperplane. SVM is memory efficient and effective for datasets with fewer data samples [55].
Stochastic Gradient Descent (SGD) is an iterative approach that fits the data to an objective function [56]. As the name suggests, it is a stochastic variant of the popular gradient descent (GD) optimization model [57]. In GD, the optimizer starts at a random point in the search space and reaches the lowest point of the function by traversing along the slope. Unlike GD that requires calculating the partial derivative for each feature at each data point, SGD achieves computational efficiency by computing derivatives on randomly chosen data points.
Nearest Centroid (NC) is a simple classification model that represents each class by the centroid of its members. Subsequently, it assigns each data point to the cluster whose centroid is the closest to it. NC is particularly effective for non-convex classes and does not suffer from any additional dependencies on model parameters [58].
Decision Trees (DTs) are a classification and regression technique that assigns target labels based on decision rules inferred from data features [59]. DT maintains the decision rules using a tree. A data point is assigned to a class by repeatedly comparing the tree root with the data point value to branch off to a new root.
Gaussian Naive Bayes (NB) are a class of fast, probabilistic learning techniques that apply the Bayes’ theorem to assign labels to the data points [60].
While supervised ML approaches generally yield reliable prediction accuracy, they often suffer from overfitting or convergence issues [47, 61]. Each of the above approaches has its own advantages and disadvantages. SVM works well when the underlying distribution of the data is not known. However, it is prone to overfitting when the number of features is much greater than the number of samples. SGD needs low convergence time for a large dataset, but it may require to fit a number of hyperparameters. Conversely, DT involves almost no hyperparameters, but often entails slightly higher training time. Unlike DT, NB requires less training time but works on the implicit assumption that all the attributes are mutually independent. Finally, NC is a fast method but is not robust to outliers or missing data. In the context of our work, we intuit that the discriminatory feature(s) will yield a high accuracy irrespective of the underlying supervised ML algorithm used.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
 Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.