One Data Format to Rule Them All – Why Data Management Matters

Like many of you, I’ve been following news of the pandemic very closely. Unlike many of you (I’m assuming), I have been simultaneously amazed and saddened by the impressive data visualizations that many news outlets have compiled. My two current favorites are the NYT Find Your Place in the Vaccination Line (https://www.nytimes.com/interactive/2020/12/03/opinion/covid-19-vaccine-timeline.html) and NPR’s visualization of how full hospitals are by county (https://www.npr.org/sections/health-shots/2020/12/09/944379919/new-data-reveal-which-hospitals-are-dangerously-full-is-yours). I guess I shouldn’t judge, given that you’re reading a blog from the leader of a data team, perhaps you have been equally enthralled by the data visualizations. These graphics, underpinned by data, make for compelling stories of the pandemic and interesting methods of conveying complex information.

While the data visualizations are the success stories of data and COVID-19, there are some recent stories that highlight just how challenging managing the data is. I’ve often heard some version of the line “you can make statistics say whatever you want them to say”, when trying to debate people online. I’m always rather insulted by this line. Firstly, because I work with statisticians and they are the first people to try and disprove conclusions and quadruple-check their data. They are absolutely scrupulous. Secondly, this line is not only dismissive, it allows the author to relieve themselves of having to learn how statistics are generated, how to find what assumptions were made in an analysis, what to look for in interpreting analyses and how to engage with data more broadly. And as big of a proponent as I am of everyone having a basic knowledge of statistics, I’m an even bigger proponent of everyone having a basic knowledge of data, how it’s generated, how it’s curated and how it’s used. While there are time-tested and robust ways of dealing with issues such as missing data, the rule of better data = better analysis usually stands. So how are we doing with generating and using “better data” for COVID?

A glaring illustration of the challenges with COVID data was highlighted in a recent NYT article (https://www.nytimes.com/2020/11/23/opinion/coronavirus-testing.html?searchResultPosition=1). The article discusses one of the important statistics used to tract the progress (or not) of the pandemic, percent positivity. Percent positivity is what it sounds like, the percentage of those tested who are positive. “The metric tells us whether we are testing enough or if the transmission of the virus is outpacing our efforts to slow it.” The challenge with doing this is that there are currently no federal standards for reporting this information. The CDC has developed guidelines for reporting new cases and deaths but there are no standards for reporting testing information. This lack of a federal standard leads to states reporting testing results in very different ways (e.g. including antigen-positive results or not). This lack of standardization leads to increased difficulty in using test positivity to know how the virus is spreading. Anyone looking at trends over time or between states would have to know exactly how each state calculated test positivity at that point in time to draw robust conclusions from the data.

Did you notice how many times I used the word “standard” in the previous paragraph? That wasn’t just a lack of a thesaurus. One of the tenets of good data management is data standardization. What is data standardization? A definition that I really like comes from the USGS, of all places. They define data standardization as “the rules by which data are described and recorded. In order to share, exchange and understand data, we must standardize the format as well as the meaning.” (https://www.usgs.gov/products/data-and-tools/data-management/data-standards). On my team, we incorporate data standards into our work frequently to ensure that datasets are uniform for analysis. This is especially important for data that is cumulative, e.g. coming into an organization in batches and not all at once. Imagine receiving results for a tests one week that had the antigens listed in one format and in another format the next week. It would be almost impossible to merge the datasets for analysis, or track that you had results for a particular antigen for all the specimens tested. So how can you ensure the data standards are met? In our work, we use our Data Transfer Plans (DTPs) to keep the data standardized. The DTPs outline the input format for transfer from the lab to us, this will include formatting standards and other standardizations such as file naming conventions. When the data is transferred to us, we check against the data format definitions in the DTP and ask the submitting labs to make corrections if something is submitted counter to the defined data definition.

While enacting DTPs for all the COVID reporting laboratories may not be feasible, it is still possible to publish guidelines to Departments of Health for collecting and reporting data like test positivity. While guidelines may not seem like a strong enough incentive or statement regarding uniform data collection, my experience has been that most people in the testing labs want the data to be useful and are willing and able to go along with guidelines and recommendations. These guidelines would go a long way to ensuring that the data could be easily compared across counties and states and make sure that important demographic information was included in all reporting.

Another article that was published in the Atlantic around Thanksgiving points to the importance of having adequate data management in place. (https://www.theatlantic.com/health/archive/2020/11/thanksgiving-makes-covid-19-data-weird/617226/) The article talks about the backlogs in getting test results out given the increase in testing and holiday/weekend staffing shortages. While this article emphasizes staffing shortages and other resourcing issues when discussing testing and result reporting backlogs, one line did stick out to me. “Less than two-thirds of those lab reports flow automatically into the health department’s electronic system, according to Mitts. Another 35 percent arrive in digital form but must be imported into the city’s database, and the remainder arrive via fax.” This stuck out to me since we have recently (and finally) completed the transition from fax-based data entry to direct, electronic data entry and the efficiency gain and traceability gains have been incredible. We’ve gotten feedback from the clinical sites that participant visits that used to take the better part of a day, now only take a few hours. On the data management side, having all the data submitted electronically and present in a single database (or data warehouse in our case), not only increases efficiency but allows for easy queries against the data and traceability of the data. In the Atlantic’s example, electronic data transfer would clearly improve the situation from a data flow perspective.

So what are the take aways here:

  1. Basic data management tenets are critical for all data and especially for high-volume, high impact data.
  2. Tenet: data standardization. Standardization of the data formats and definitions allows for robust comparisons and analyses. This can be accomplished with relatively simple steps such as publishing guidelines, enacting transfer plans, and performing checks on incoming data to ensure compliance with the standard.
  3. Tenet: electronic data submission. While there are data security risks to consider with electronic data submission, the are many benefits as well. Electronic data submission allows for real-time data transfer, logging of data received and automatic quality checks on incoming data. It also removes a step of human error in the transfer process.

Hopefully, we will soon see the level of investment in data collection and management within the US public health system that is warranted to ensure accurate and uniform data. Meanwhile, do you know how your organization collects and manages data? I guarantee that you do even if you’re not a “traditional” data organization, you are gathering data and it’s important to think about how that happens.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s