The Importance of Conflicting Data

Having multiple data sources helps us know when there might be errors or misleading sources in our data sets

Aug 16, 2022

∙ Paid

When we evaluate scientific conclusions, the starting point of the evaluation is the data we are using. In an ideal world, the data we use to run an analysis or do a regression should be a clear representation of the world around us.

Yet there is still some distance between the events as they exist and the data as it has been gathered. Things can go wrong within that distance, and our data may poorly describe the world. But, if the data itself were bad, how would we know? How do we know when to be skeptical about the core data in a story or study and when to accept that it is a clear representation of the world around us?

The dismissal and skepticism of core data is an important component of the authority crisis. If people look at a data set and simply dismiss it out of hand, there is not much that can be done to convince them of the conclusions one might draw from that data. I want to talk about some recent stories in which the core data has been incorrect, how the errors were discovered, and how we can bend toward a world of trustworthy data.

The starting point with people for most data sets is credulity. People generally assume that data came from somewhere and that somewhere must have had some coherent path from a countable, recordable reality to data as it is being presented. The actual path of that data may be quite convoluted.

Side note: If you want to deep dive into how we go from “counting a COVID case” to “data on a dashboard,” I wrote last year about how the Florida Health Department manages its data.

Many things can strain the default credulity that we tend to give to data. We may see data that doesn’t seem to match with the world we see around us. The data could be pointing to a reality that we find implausible or unlikely, causing us to question the nature of that data.

But the most obvious reason to question a data set is when there is a second source for that data and it tells us a different thing. When this happens, it causes us to step back and ask more foundational questions like, “Where the hell did this data come from anyway and how did it get here?”

Over the last six months, there has been no small amount of turmoil regarding COVID deaths among children. For months, the CDC’s Data Tracker was giving a substantially inflated number of pediatric COVID deaths. At one point, the Data Tracker (the reference point for most journalists and politicians) was overcounting pediatric deaths by over 90%.

How did we know this? Because the CDC also tracked this data through the much more accurate NCHS. Even after they made a correction in March, these two data sets were substantially divergent.

Keep reading with a 7-day free trial

Subscribe to Matt Shapiro's Marginally Compelling to keep reading this post and get 7 days of free access to the full post archives.