Tuesday, June 23, 2020

A deeper dive into COVID data: what do those daily numbers actually mean? (Or: why growing hot spots are already worse than you know)

Every day we can go online and see a news report that today our state announced x new COVID cases, y new hospitalizations, and z new deaths. Have you ever wondered what those numbers actually mean?

Someone who was very new to all of this might think, oh, if those are the newly released numbers today, that means that yesterday there were x new cases, y new hospitalizations, and z new deaths.

Someone who has been following the numbers for a while would probably know by now that that's not quite true. They probably would have noticed that every week the number of daily deaths spikes in the middle of the week and plummets on the weekend. And they might know that this is because reporting slows down on the weekend and this creates a backlog that is cleared during the ensuing week.

Based on this knowledge, my general assumption was that most of the deaths reported on any given day were deaths that had occurred over the last few days, with some slowing down on the weekend and catching up during the week.

Turns out, it's more complicated than that!

Warning: this is going to be a very nerdy post full of graphs and number crunching, but if that sounds interesting to you, read on. The purpose of this post is largely educational, but if you don't feel like wading through the whole thing and want a take home message related to what's going on in the country right now, it's this: in places like Arizona, Texas, and several other southern states where numbers are now spiking, the real current numbers are likely already substantially higher than the reported numbers, and it might already be too late to avert disaster.

I've been following the numbers from covidtracking.com for quite some time. That site is the source for most visualizations of COVID daily numbers for the U.S. and its states that you'll see online. Every day the site updates and the number of newly reported cases, deaths, and (for many states) hospitalizations becomes available. What we are seeing there, and what is going into almost all the graphs that you might see, are the numbers by report date. Using the numbers from covidtracking.com, there's no way to see the actual dates on which those deaths, hospitalizations, or positive COVID tests occurred.

I recently started also looking at the numbers the state of Ohio provides on its COVID-19 dashboard. Here, the test date as well as (if applicable) hospitalization date and death date for each reported COVID case in Ohio are available. Here we can see on what dates all these events actually occurred, not just on what dates they were reported.

I've spent some time analyzing all these numbers and learned some things I found very interesting, so I wanted to share them with anyone else who'd like to learn more about this topic.

For the first set of graphs I looked at the numbers Ohio reported on each day of the one-week period from May 27 through June 2. This is a graph looking at all of the COVID cases Ohio reported during that week, by the case onset date (x-axis), with color coding indicating on which day of that week the cases were reported:


So, for example, new cases that occurred on May 26 were reported in large numbers on May 27, 28, and 29, and continued to be reported in decreasing numbers on May 30, 31, and June 1.

You can see from this graph that the bulk of the new cases reported between May 27 and June 2 happened relatively close in time to that week, but a quite substantial number of the cases occurred up to three weeks earlier, with smaller numbers going all the way back to March.

If you look closely, you might also notice that some of the colored bars are slightly negative, which would indicate that on the report date indicated by that color, the number of cases with that onset date was adjusted downward, presumably a result of corrections to make previously reported numbers more accurate.

Let's next look at the same graph but with hospitalizations:


It looks fairly similar to the new cases graph, although the peak at the right side is somewhat broader, indicating a tendency for there to be a bit more of a delay with the new hospitalization reporting than the new case reporting, at least over this time period.

And now the same graph for deaths:


Here the trends differ more from the other two graphs. The top of the peak is farther to the left, and the height of the peak relative to the height of the trailing bars toward its left is not as large, indicating that a greater proportion of deaths had reporting delays of more than a few days.

So those are examples of what the reporting delays can look like for cases, hospitalizations and deaths. Next let's look at the delays in a more quantitative manner. Here I used a longer period of time, from May 27 through June 10. For every case, hospitalization, or death reported in that time, I calculated the difference between reporting date and event date, and then I plotted the cumulative distributions of those values:


This graph shows, for any number of days from the day an event occurred, what fraction of events of that type are reported within that number of days. So for example, if you move right along the x-axis to 2 days, you can draw a line straight up and see that about 30% (0.3) of deaths are reported within two days of the death date, whereas for hospitalizations and cases it's more like 40%. Or say you want to see how many days it takes before half of a given event type are reported. Go up the y-axis to 0.5, then draw a line straight to the right and find that it takes 3 days for hospitalizations and cases but 5 days for deaths.

The distribution for deaths lags the distributions for cases and hospitalizations, meaning deaths tend to have longer delays, up to about 10 days or 75% reporting. Note that for all three types of event it takes about nine or ten days before you even get to three-quarters of full reporting! Beyond that point the trends reverse and it takes longer to pick up the most lagging cases and hospitalizations than the most lagging deaths; I'm not sure why that is. Also, the distributions for cases and hospitalizations are quite similar to each other.

You might wonder whether these distributions are affected by the day of the week on which the event occurred (or at least, I wondered that!).

The answer is yes. Here's a figure of the same thing except separated out by the day of week of each event (day of the event, not day the event was reported):


You can see that the delays are much more pronounced for deaths that occurred on a Tuesday, Wednesday, or Thursday, which makes sense with what I've already mentioned about reporting of deaths being slower on weekends. If a death isn't reported by Friday, there's a good chance it will take several additional days before it's reported. It takes a full seven days before there's even a 50% chance that a death occurring on a Wednesday will be reported.

Hospitalizations and new cases generally don't have as pronounced of a day-of-week effect on reporting delays. I did notice that for some reason, new cases occurring on Sundays apparently tend to have much longer delays.

So what does this all mean when it comes to interpreting the daily numbers that we see?

If you've read any of my posts about COVID you've seen graphs that look like this one:


The gray curve shows the number of deaths that were reported each day and shows the weekly oscillations I was talking about. You'll see trends like that on any COVID death graph that is showing the daily reported numbers with no smoothing. The black curve is a seven day moving average of the gray curve, which is another very common way of displaying COVID data. (Note: the dates for the black curve are the center dates of each seven-day period. Sometimes you might see the same sort of graph but with the dates being the end dates.) As each point is the average of seven days, it eliminates day-of-week effects and the weekly oscillations go away. You'll see this method of displaying the data on a lot of websites.

Similar graphs could be generated for cases and hospitalizations.

Now let's see what it looks like when we compare the numbers by their report date (which is the way they are almost always displayed) with the numbers by their event date.

First, new cases:


As I always do with the graph of daily cases for Ohio, I'll point out that the weird spike in mid-April is a result of a huge number of positive tests in prisons all being reported in a big bunch.

I will also point out, as usual, that reports of cases have to be put in the context of the total test numbers. If the cases are going up it might be because more testing is being done; you have to also pay attention to the percent positive. If cases and percent positive are both going up that's when you really know there's a problem. This is now happening in a number of states, mostly southern; it hasn't been happening in Ohio although the last few days unfortunately show signs we might be starting to head back in that direction...

You can see throughout the graph the delay between when positive tests happened and when they were reported. It does look like the delay was larger in March than it is now, which doesn't surprise me. But the delay still exists.

Next, hospitalizations:


And last, deaths:



So the theme of these three graphs is that when you're in a period of time where the numbers are going up, the average daily reported numbers are lower than the real daily average numbers. When you're in a period of time where the numbers are pretty flat, the two are about the same as each other. When you're in a period of time where the numbers are going down, the average daily reported numbers are higher than the real daily average numbers.

Another thing about these graphs is in all three, the numbers by event date (red) take a fairly sharp downward turn toward the end, but that's not real - that's from most of the numbers not having been reported yet. More on that in a moment.

Because with this pandemic the most rapid daily changes have always come from the growth phase, not the decline phase, the biggest deviation between the daily reported numbers and the real daily numbers always comes during times when the numbers are sharply rising. See, for example, around March 25-27 on the hospitalization graph. The real daily numbers were about twice the size of the numbers being reported at that time.

What does this mean in regard to what's going on today? There are a number of states, such as Arizona, Texas, Florida, and South Carolina, where numbers are currently going sharply up. Things are already looking bad but the reality is things are already worse than the numbers we're seeing!

And this doesn't take into account the delays from infection to onset of symptoms, from onset of symptoms to hospitalization, hospitalization to death, etc. I might explore those in another post. I think people are generally aware of these time lags and the fact that the numbers currently being reported reflect people who might have gotten infected a couple weeks ago. But the reporting delays explored in this post can create an even bigger time lag between "events on the ground" and when their effects really show up in the numbers we see.

This post is mainly about sharing some things I found interesting and trying to educate people, but if there's a big take home message, it's that if the numbers being reported are sharply rising, things are already worse than they look, so action needs to be taken ASAP. Fortunately in Ohio we did that in March early enough to avert disaster. In places like New York government officials waited too long and catastrophic death tolls were the result. Now, despite a much greater advance warning and much greater knowledge of the disease, it looks like the same mistakes might be in the process of being made in places like Arizona.

So that's the big takeaway, and now I also want to share some other analyses I found interesting.

The thing I mentioned about how the graphs of events by event date bend downward at the end but that's not real? So the graphs by report date have outdated numbers. And the graphs by event date have very incomplete numbers. Is there a way we can see what's really been going on, say, in the last week?

I took a stab at this by creating a function that takes the number of currently reported deaths or hospitalizations for each date along with how many days in the past each date was and returns an estimate of the "real" number of deaths or hospitalizations. The estimate is based on the cumulative distributions of reporting delays that I showed earlier.

So, for example, let's say that for a day five days in the past, there have currently been ten deaths reported. And say that the cumulative distribution shows that by day five, 50% of deaths for a given day are reported on average. Then the estimate for the "real" number of deaths for that day is doubled from ten to twenty. (The function is slightly more complicated because it uses the different distributions for each day of the week, which I found seemed to add a little accuracy.)

How do I know whether this method is effective? I can now look at the deaths that were reported by a date several weeks in the past, calculate the estimated daily deaths, and then compare them to what the numbers look like several weeks later, now that reporting is much closer to complete. I did find that the very most recent day has too much variability because the correction being made is very big, so I'm dropping the most recent day from the analysis, but other than that the results are pretty good. Let's take a look.

Here's a graph that shows the deaths by death date, both the most recent numbers and those that were reported by May 26, along with the estimated deaths from my calculations:


As you can see, the deaths that were reported by May 26 (gray curve) show the same downward bend at the end that we saw earlier in the graphs of the most recent numbers. The numbers for that same time range as of the most recent report (black) show that that downward bend was not real. The real numbers were basically flat at that time. But on May 26, using my estimation function, we would have seen that downward bend transformed into something that was basically flat (orange) and that turns out to be quite close to the more complete numbers that we now have for those dates.

The closer you get to the current date, the more uncertainty there is in these estimates, but all in all the method seems to work well. I also checked this with the function I made for hospitalizations as well as with the deaths/hospitalizations that were reported by a couple of other dates over the course of the last month, and I was quite pleased with how the results turned out.

Given the dual problems of the numbers by report date (what we usually see) being outdated, and the numbers by event date being very incomplete for more recent dates, this adjustment method seems like a decent way to better see what's really been going on recently. I'm not claiming that it provides any huge utility for informing decision making, but it does provide at least a little insight about the recent trends. Whether or not anyone else finds it useful, it was an interesting exercise for me to tackle!

Now let's update the graphs I showed earlier comparing events by report day to events by event day to include these estimates. Here's the updated graph for deaths:


You can see that, as we have been in a period of declining deaths, the estimated deaths from my calculation (blue) are in between the deaths by date reported and the deaths by event date.

The hospitalization graph is more interesting:


We have also been in a period of declining hospitalizations. If you are looking at a graph of new hospitalizations by their report date, which is the graph you'd normally see, it looks like the curve is still trending downward. But in the estimated curve, it appears that the downward trend has recently leveled off and (although the most recent estimated numbers contain the most uncertainty) we may be starting to head back up.

This is concerning. I was worried about what would happen when bars and restaurants reopened. We didn't see an immediate effect from that, but we may be seeing one now. The numbers will have to be closely watched in the coming days. If this is the start of a real upward trend, it suggests we need to back off from reopening for the types of businesses (such as bars and indoor dining at restaurants) with the most transmission risk.

It also reinforces the importance of everyone being careful and of wearing masks, for which there is now a lot of compelling evidence of benefit in slowing disease spread.

One last interesting tidbit I found from this analysis. We know that in the reported numbers there's a strong day of week effect with higher numbers midweek and lower numbers on the weekend. Are there day of week effects in the actual numbers by event day?

First, cases:


The numbers are much lower on the weekend, especially on Sunday. This isn't surprising. If someone gets tested as a result of, say, a doctor's appointment, that would clearly happen less on the weekend. (Note: in making this particular graph I excluded the very high single day totals from prison testing, which would bump up the Thursday and Friday numbers if they were included.)

Next, hospitalizations:


Similar to cases but with a less pronounced decline on the weekend. Also Monday appears to have a slight bump compared to other weekdays. This all makes sense as well. People would be somewhat less likely to go get admitted at a hospital on the weekend, and then there's a bump on Monday to partly make up for people delaying going over the weekend.

The deaths graph is what surprised me:

Friday really stands out from the other days and I wonder why this is. Now, I don't know for sure that all of the death dates are completely accurate. Perhaps there is some tendency for the dates to occasionally be in error in a way that biases them toward Fridays.

If it is a real trend, I do have a somewhat morbid hypothesis for why it might exist. Perhaps when someone is clearly going to die soon, there's a little bit of a tendency - whether it be in the dying person, in the family, or in the hospital staff - to want to "get it out of the way" before the weekend.

Just a thought. And it might not be a real trend.

So that wraps up this deeper dive into Ohio's COVID data. I hope you found it interesting and informative. I may do some other posts like this exploring other aspects of the data. I've always really loved numbers and analyzing numbers and making charts and graphs ever since I was a child. Now, between posts like this and analysis I'm doing of things directly related to my research job, I spend a huge amount of my time thinking about and working with COVID-related numbers. It can be a distressing topic to spend so much time on, but I think doing all this helps me feel a little less powerless, because I'm actively working on things to try to make a difference.

Stay safe, everyone.

No comments:

Post a Comment