Advice from a data scientist: Careful while crosswalking

An exponential amount of data is available today, the majority of which the human mind is incapable of consuming. Ongoing efforts to tie together data siloed in different systems, departments or external databases requires building “crosswalks” between data sets—when enriching an electronic health record with other relevant data, for example, or evaluating clinical outcomes based on the type of device that was implanted into patients. Crosswalking is particularly challenging when it is done between different code sets (e.g., ICD-9 to ICD-10 or vice versa) rather than between different versions of the same one.

In short, a crosswalk is how to meaningfully unite disparate pieces of information in a common digital habitat. The basic process has existed for as long as anybody has tried to collect, group and align data in some way to make new discoveries. Broadly defined, this would include the way hand-drawn maps of cholera outbreaks in late 19th century England traced the epidemic to contaminated drinking water. It would also include more recent, albeit unsuccessful, efforts by Google to “nowcast” flu epidemics based on people’s searches for flu-related information.

A certain level of subjectivity is unavoidable on the front end when gargantuan amounts of data have to be sorted into buckets with other equivalent, identical or similar information. Categorization enables data to be “mapped” from source data fields to their “assigned” target data fields, whenever possible following established standards—e.g., HL7, SNOMED CT, ANSI, and ICD-10. There are hundreds of standards available, including ones specific to medical, pharmacy, surgical and device-related data, which can be both complementary and overlapping. But when there is no clear, one-to-one match, not everyone is going to agree on what goes where. While perspectives tend to split at departmental lines, the crosswalk has to address the issues and needs of end users, which may not give everyone what they want and may require compromise. Some items in one data set have no direct translation in another data set, much as some American words have no French or German equivalent.

Health systems have been legislatively mandated to crosswalk from ICD-9 (five characters and primarily numeric) to ICD-10 (seven characters and entirely alphanumeric) diagnoses and procedure code sets as a condition of reimbursement by the Centers for Medicate & Medicaid Services. As a practical matter, they need to assign these 68,000 diagnosis codes into clinically meaningful groupings and then merge them with other information sources to facilitate clinical and financial benchmarking. Whether to look only at primary diagnosis and procedure, or also include other codified information, remains a subject of debate. Often, for purposes of predictive analytics and machine learning, data groupings are ultimately dependent upon the business questions needing to be addressed so patterns can be accurately detected and used to drive incremental improvements.

Since crosswalked information is frequently used to make mission-critical decisions, including how to optimize reimbursement and where to make particular capital investments, health systems need to proceed with caution. Here are a few of my recommendations:

Establish policies and procedures. Some crosswalked data sets are more or less bulletproof, capable of surviving an upgrade to either of the two disparate systems being bridged because something is sitting in the middle of it that serves as a translator. Others can be a Pandora’s box, exposing organizations to migraines on future data and system conversions. If these files are to withstand the test of time and software upgrades, stakeholder teams need to be documenting data lineage and dependencies. Healthcare data governance best practices include attestation of everything done and impacted by any sort of interoperability exercise. Whenever data is being merged or migrated, consider what may change or occur in the future that will require a roadmap back to how, why and when the original crosswalk was constructed.

Dogmatically manage your data. Do an audit monthly or quarterly to ensure the integrity of your crosswalk. Run queries as data gets processed and assigned to clinically meaningful groupings, and regularly monitor reports and dashboards, for any unexplained deviations—a big jump in neurosurgeries on a case volume pie chart, for example. An aberrant pattern could be the signpost of a poor crosswalk of data, so check your work lest you respond to a false positive signal.

The proliferation of multiple data pools, such as GHX and FSEnet, should heighten demand for robust data management. The services effectively crowdsource decisions about how data gets defined and items get assigned to categories. Capturing the full value of these big-data sources will require individuals who understand the nuances of the data, how to comply with varying data pool requirements, and the tricky access and ownership issues that may arise.

Collaborate and communicate. Crosswalking effectively requires teamwork by people, including analytics and IT staff as well as clinicians, who may not frequently work together and view circumstances through different lenses. Refraining from making assumptions and getting on the same page—including how to troubleshoot outliers and exceptions—will effectively minimize the potential of the insights needed to survive in a patient-centered, prevention-focused, value-based healthcare world from derailing.

Don’t expect perfection. Accept and acknowledge the limitations of the crosswalking process. Empirical researchers do this all the time, conceding the limitations of their studies because a certain population of patients couldn’t be identified or there was variation in a model that they couldn’t adequately explain. Certain data elements cannot be ideally reconciled. Data mapping is often an imprecise science, so be flexible if you hope to reach practical conclusions.

Ed Hickey is assistant vice president of clinical data and analytics at HealthTrust, and his expertise includes quality and outcomes measurement, program evaluation and benchmarking. He was previously vice president of data science with naviHealth, where he was focused on predictive modeling. Earlier, Hickey was director of research and analytics at Vanderbilt University Medical Center in Nashville. He holds master’s degrees from the University of Michigan and Vanderbilt University’s Owen Graduate School of Business.

The views, opinions and positions expressed within these guest posts are those of the author alone and do not represent those of Becker’s Hospital Review/Becker’s Healthcare. The accuracy, completeness and validity of any statements made within this article are not guaranteed. We accept no liability for any errors, omissions or representations. The copyright of this content belongs to the author and any liability with regards to infringement of intellectual property rights remains with them.

At the Becker's 11th Annual IT + Revenue Cycle Conference: The Future of AI & Digital Health, taking place September 14–17 in Chicago, healthcare executives and digital leaders from across the country will come together to explore how AI, interoperability, cybersecurity, and revenue cycle innovation are transforming care delivery, strengthening financial performance, and driving the next era of digital health. Apply for complimentary registration now.

The hidden cost of lost clinical time and how leading health systems are responding

Next Up in Health IT

Signal Over Noise: Digital Shift Health Systems Can't Afford to Ignore

RWJBarnabas Health AI system cuts patient deaths by 18%: Study

Ransomware attack at health IT vendor exposes 442,000 patients' data

The hidden cost of lost clinical time and how leading health systems are responding

Next Up in Health IT

Join the 500,000+ healthcare executives who start their day with Becker’s