Estimating infections

Liam J. Revell

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Estimating infections

LR Liam J. Revell

This method is extracted from research article: PeerJ, Aug 2021

covid19.Explorer: a web application and R package to explore United States COVID-19 data

DOI: 10.7717/peerj.11489

Request a Protocol

Ask a question

Favorite

Since the beginning of this pandemic, it has been widely understood that confirmed COVID-19 cases underestimate the true number of infections, sometimes vastly (Al-Sadeq & Nasrallah, 2020; Wu et al., 2020). This underestimation has multiple causes. One important factor is that there has been limited testing capacity throughout much of the SARS-CoV-2 pandemic in the United States, but particularly when the pandemic was in its earliest days (Rosenberg et al., 2020). A second significant factor affecting the disconnect between observed cases and true infections are the facts that in the United States SARS-CoV-2 testing is voluntary, population surveillance testing has been relatively scarce, and many cases of SARS-CoV-2 infection present asymptotically or with mild symptoms (Oran & Topol, 2020). As such, I consider confirmed COVID-19 deaths to be a much more reliable indicator of disease burden than confirmed cases. Deaths, however, are a lagging indicator of infections.

The key parameter that relates daily COVID-19 deaths to the number of infections is the infection fatality ratio (also called the infection fatality rate or IFR). IFR, normally expressed as a percent, is defined as the fraction of deaths among all infected individuals, taking into account both observed infections (‘cases’) and asymptomatic or unobserved infections (O’Driscoll et al., 2020). An IFR value of 1.5%, for example, would mean that, on average, for every 1,000 infections in a specified population, there would be 15 deaths.

I modeled the number of new SARS-CoV-2 infections on the ith day by taking the number of observed COVID-19 deaths on day i + k (in which k is the average lag period between initial infection and death, where death is the outcome of infection), and then dividing this quantity by the IFR. In other words, given 50 COVID-19 deaths on day i + k, and an IFR of 0.5%, we would predict that 10,000 new SARS-CoV-2 infections had occurred on day i. Both k, the average lag time from infection to death (in cases of SARS-CoV-2 infections resulting in death), and the IFR are to be specified by the user.

A fairly reasonable lag time between infection and death might be approximately three weeks. For example, during a large outbreak in Melbourne, Australia the time difference between the peak recorded cases and peak confirmed COVID-19 deaths was around 17 days. Infected persons normally test negative for the first few days following exposure (Kucirka et al., 2020), so this more or less corresponds with a three week lag. Likewise, Wilson et al. (2020) report a median time from symptom onset to death of 13 days, and a meta-analysis by Dhouib et al. (2021) showed an incubation period of approximately 5–7 days, also corresponding to a lag time of approximately 18–20 days.

Likewise, IFR values ranging from about 0.2% to over 1.0% have been reported over the course of the pandemic. For instance, a study based on an early, super-spreader event in Germany estimated an IFR (corrected to the demographic distribution of the local population) of 0.36% (Streeck et al., 2020). Other researchers have reported higher estimated IFR (e.g., Rinaldi & Paradisi, 2020). In a large meta-analysis O’Driscoll et al. (2020) estimated IFR of SARS-CoV-2 infection across 45 different countries and obtained median estimates ranging from 0.24% to 1.49%, with higher IFRs typically reported for countries with older populations. In general, it is probably reasonable to suppose that IFR has fallen through time as treatment of severely ill patients has improved (Fan et al., 2020). Likewise, even within the U.S., IFR is unlikely to be precisely the same at a given date in different jurisdictions, due to differences in demographic structure between areas as well as other factors.

I suspect that it is within reason for users of covid19.Explorer to specify an IFR that is no greater than about 1.5% and that declines gradually from the start of the pandemic towards the present, with a current IFR that is perhaps around 0.3–0.5% (O’Driscoll et al., 2020; Blackburn et al., 2021). Nonetheless, covid19.Explorer permits the user to specify a time-varying IFR by fixing the IFR at each quarter (on the website), or at any arbitrary time interval (using the R package directly), and then interpolating daily IFR between each period using local regression smoothing (LOESS; Cleveland, 1979). As such, it is also possible to build a model for IFR through time that both falls and rises, perhaps as stresses on local healthcare resources increase or decrease through time with rising and falling COVID-19 case numbers.

Reporting can vary through time including regularly over the course of the week. (For instance, fewer COVID-19 deaths tend to be reported on the weekends compared to Monday through Friday; e.g., Fig. 1A) To take these reporting artifacts into account, I used both moving averages and local regression (LOESS) smoothing. Both the window for the moving average and the LOESS smoothing parameter are controlled by the user.

The approach of using only confirmed COVID-19 deaths—though robust—does not permit us to estimate the true number of infections between k days ago and the present. To do this, I assumed a sigmoidal relationship (by default) between time and the ratio of daily confirmed cases over the estimated true number of infections—a quantity called the case detection rate or CDR (Fig. 1B). Since the number of confirmed cases cannot exceed the true number of new infections, logic dictates that the CDR should have a value that falls between 0 and 1.

I decided on a sigmoidal relationship between the case detection rate and time because it seemed reasonable to presume the ratio was very low early in the pandemic when confirming a new infection was limited primarily by testing capacity, but that CDR has probably risen (in many localities) to a more or less consistent value as testing capacity increased. Since getting tested is voluntary, and since many infections of SARS-CoV-2 are asymptomatic or only mildly symptomatic, this ratio seems unlikely to rise to very near 1.0 in the U.S. regardless of the availability of testing. Fig. 1, created using covid19.Explorer, shows daily confirmed cases/daily estimated infections (under our model) for all U.S. data over the entire course of the pandemic to date (Fig. 1B), given observed daily deaths (red bars) and assumed IFR evolution through time (blue curved line; Fig. 1A). Our plot seems to indicate a CDR of about 0.42 at the present; however, the reader should keep in mind that in practice this value is estimated separately for each jurisdiction that is being analyzed, and as such might be lower in some states and higher in others, even for a constant IFR value or function.

In the event that a sigmoid function cannot be fit to the implied daily CDR for a given state or jurisdiction, the software automatically substitutes the mean CDR from the last 30 days of data. Since I only used the CDR to estimate daily infections for the most recent time period of our data (see below), and since CDR tended to increase asymptotically towards a more or less constant value in most jurisdictions (e.g., Fig. 1), this seemed fairly reasonable. When using the covid19.Explorer in R (rather than through the web interface), this option can also be selected explicitly by the user. An important point to make in this context is that I intend the sigmoidal functional form to be a heuristic (rather than literal) means of capturing the approximate relationship between CDR and time since the start of the pandemic—and thus estimate the CDR for the most recently reported cases. If users are unsatisfied with the fit of the sigmoid curve to CDR, they are encouraged to substitute the mean implied CDR from the last 30 days of data. The reason I chose the sigmoid fit to begin with was primarily to avoid distortions driven by so-called ‘data dumps,’ in which a state or jurisdiction releases a large number of previously misclassified or unreported cases or deaths on a single day. In practice, using the mean implied CDR from the past 30 days or the fitted value of CDR from a sigmoid fit will not make much of a difference in the majority of jurisdictions represented in our data.

After fitting this sigmoidal curve to our observed and estimated cases through now—k days (or calculating the mean implied CDR from the most recent 30 days), we then must turn to the last period. To obtain estimated infections for these days, we merely divide our observed cases from the last k days of data by the fitted CDR values of our curve. Figure 2 shows the result of this analysis applied to data for the U.S. state of Massachusetts.

In addition to computing the raw number of daily infections, this method can also be used to estimate infections as a percentage of the total population. To make this calculation, I obtained state populations through time from the U.S. Census Bureau. Data was only given through 2019 at the time of writing, so to estimate state-level 2020 population sizes, I used a total mid-year 2020 U.S. population estimate of (331,002,651) to ‘correct’ each 2019 state population size to a 2020 level.

Finally, CDC mortality data splits New York City (NYC) from the rest of New York state. Since this contrast is interesting (e.g., Gonzalez-Reiche et al., 2020), I maintained the separation—and used a mid-2019 population estimate of (8,336,817) for NYC, then simply assumed that the population of NYC has changed between 2015 and 2020 in proportion to the rest of the state. Since they have a part: whole relationship, this seemed pretty reasonable. In fact, according to the U.S. Census Bureau from 2010 to 2019 the fraction of New York State residents living in New York City is estimated to have grown by around 0.1% per year, from 41.8% in 2010 to about 42.8% in 2019. If this trend continued through 2020, then I may have underestimated the population of New York City by about 0.2%. Since this is only relevant when considering per capita SARS-CoV-2 infections and COVID-19 deaths, I suspect it is a relatively minor source of error compared to other simplifying assumptions of this software.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol