Go back to Kinsa HealthWeather.

COVID-19 Rapid Case Growth Prediction Technical Approach

Kinsa has developed a method to identify periods of rapid case growth of COVID-19 infections (“outbreak events”) in real-time using Kinsa’s county-level illness signals and COVID-19 case data. Here, we’ve developed a model to predict outbreak events 14 days in advance at the state level.


Data collection
Kinsa uses a proprietary smart thermometer and mobile application to collect real-time syndromic monitoring data from a network of millions of households across the United States. This allows us to measure the onset and duration of symptoms, transmission rates and the overall incidence of illness with high geographic precision. Kinsa collected over 14.8 million temperature readings since March 15, 2020.

Table 1
Shows the population representation of the Kinsa user base

Screen Shot 2020-11-09 at 8.59.18 PM

As outlined in Chamberlain et. al. 2020, we use this information to calculate a variety of illness metrics including overall incidence of influenza-like illness (ILI), effective transmission rate (Rt) as well as two metrics which capture seasonally unusual illness levels (Atypical ILI) and transmission (Atypical Rt). We’ve shown previously that Atypical Rt is highly correlated with aggregated mobility data during the initial lock-down period (Figure 1a) and Atypical ILI is highly correlated with subsequent rises in COVID-19 cases (Figure 1b). 1

Figure 1a
Shows the correlation between Atypical Rt and mobility-based interventions

Screen Shot 2020-10-12 at 5.11.06 AM

Figure 1b
Shows the correlation between Atypical ILI and COVID-19 Case counts

Though we consistently see correlations between raw fever counts in Kinsa data and subsequent COVID-19 infections, the temporal relationship between the two has shifted over the course of the 2020 pandemic. Early in the pandemic, when testing was highly constrained and often restricted to the in-patient setting, we typically saw an 18 day lag between Kinsa data and COVID-19 case counts. More recently, as testing availability has increased substantially -- with drive-through testing, changes to CDC guidance around testing eligibility and improved supply chains -- we’ve seen the lead time of our data decline as more people are able to be tested earlier in the course of an infection.

Figure 1c
Shows the changing temporal relationship between Kinsa and COVID-19 data

Lag dynamics

For COVID-19 case data, we use the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University2

Feature engineering
We selected several classes of features to power our model. (1) Mean atypical Rt captures the aggregate deviation above expected seasonal transmission, for the preceding 7, 14 and 21 days. (2) We also include lagged and averaged values of Atypical ILI over 7, 14 and 21 days. We fit exponential growth curves to both (2) Kinsa Atypical ILI and (3) COVID-19 case incidence and input the coefficients. Lastly, we calculate the (5) day-over-day increase in COVID-19 case rate.

We used odds ratios to guide our feature selection. Intuitively, odds ratios measure the likelihood that an outcome will occur given a particular exposure, relative to the likelihood absent any exposure.3 We used this concept to inform both whether features were informative of coming outbreaks as well as the predictive time horizon over which the features typically operate. For example, when we looked at the likelihood of a COVID-19 outbreak over various time horizons relative to an initial incidence of Atypical fever transmission (Rt), we saw a significant increase in the risk of an outbreak occurring over a 2-3 week window. Transmission rates calculated from the COVID-19 data directly (COVID-19 Rt in Figure 2), on the other hand, were predictive of a coming outbreak over a much shorter period, typically only a few days.

Figure 2
Shows the time horizon and contribution of various features

Screen Shot 2020-10-12 at 5.12.58 AM

Machine Learning Classification
We used a standard logistic regression model for this problem. Not only is it a relatively simple and well-understood method, it also has the added benefit that it’s common to interpret the output of this model as showing the probability that an outbreak will occur.

We defined outbreak events quantitatively. First, we flagged any day where the rate of increase exceeded 2 cases per 1 million people. Second, we applied a smoothing function to identify outbreak events where (1) at least 5 of the previous 15 days exceed the case growth thresholds and (2) contiguous events are grouped together. We then train the model on the daily time series of these aggregated events.

Figure 3a
Shows the relationship between case velocity and outbreak events

We trained our model to predict outbreak events 14 days in the future. To evaluate the accuracy of the predictions, we performed multiple rounds of cross-validation, training on a random sample of 33 states and testing on the remaining 17.

We compared the first day that our algorithm flagged an outbreak probability greater than 0.5 to the first date of the outbreak event as defined previously. We consider any prediction that is followed by the onset of an outbreak within 28 days to be a true positive.

Figure 3b
Kinsa Healthweather showing outbreak predictions in Florida on March 13th and June 12th

We compared three feature sets. First, we looked at Kinsa and COVID features independently. As shown below, the model with access to the combined feature set shows significant improvement over the model using COVID features alone. 

Table 2
Shows the comparative analysis of difference feature sets

Feature Set Precision Recall F1
Kinsa Only .64 .62 .63
Covid Only .65 .74 .69
Combined .74 .79 .76

Lastly, we looked at the duration of time between our initial outbreak prediction and the initiation of the outbreak event. Typically, our outbreak detection provides 11 days of advanced warning of rapid increases in COVID-19 cases, with 50 percent of warnings falling between 8 and 15 days of advanced notice.

Figure 3c
Shows the frequency distribution of lead time for correct predictions



1 Real-time detection of COVID-19 epicenters within the United States using a network of smart thermometers S. Chamberlain, I. Singh, C. Ariza, A. Daitch, P. Philips, B. Dalziel
medRxiv 2020.04.06.20039909; doi: https://doi.org/10.1101/2020.04.06.20039909

2 https://github.com/CSSEGISandData/COVID-19

3 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/