Why health system AI predictions can fail

Laura Dyrda (Twitter) - Thursday, April 18th, 2024

Health systems are exploring a variety of applications and use cases for artificial intelligence and generative AI, but the success or failure of projects depends on much more than the technology.

"One of the things with AI that's really nuanced is you have to ensure that the dataset used to train the model is representative of the population that model is intended to serve," said Jared Antzcak, chief digital officer of Sanford Health in Sioux Falls, S.D., during a keynote session at the Becker's 14th Annual Meeting on April 10. "In many cases, having the internal capabilities to develop these models or test and validate these models is absolutely essential to ensure it's representing our population, especially in the upper rural Midwest."

An AI model developed in an urban center with a different demographic may not translate well to the rural population; the accuracy and precision of the model may be degraded when applied to a different demographic population. Mr. Antczak and his team need the ability to test, train and validate AI models before integrating them into clinical workflows.

Sanford has built out a sophisticated data science team and developed AI applications for multiple use cases over the years from internal data. The system has also gotten into predictive analytics and is working on expanding the scope of its governance process to contemplate other types of AI, generative AI, and image-based AI, and work with partners to externally source information.

The system recently developed an AI model to deal with supply and demand for colonoscopies after the U.S. government's task force on preventive screenings released new guidelines recommending colon cancer screenings begin at age 45 instead of 55 years old.

"That significantly changed our ability to meet the demand within our system with a limited number of GI docs and colonoscopy slots available," said Mr. Antczak. "We developed a tool that helps screen our population based on risk, and instead of having first come first serve like we traditionally always have, we're able to risk stratify our population and offer to those who are at lower risk other screening methods, like stool-based samples that are FDA approved to meet that need."

Higher risk patients then have access to colonoscopy first so the system can provide the most efficient and effective care possible to the population.

Training the AI on population-specific data is important, and then health systems should focus on workflows, said Michael Pfeffer, MD, CIO of Stanford (Calif.) Health Care and Stanford School of Medicine.

"When we think about traditional AI, which is more than predictive, we're going to predict that this patient is at a higher risk for cancer, and whatever the clinical workflows and operations workflows are behind it are as important if not more important than the actual prediction itself," said Dr. Pfeffer. "It's understanding the overall utility and usefulness of the prediction, because if you don't have enough GI doctors to do the screening, then the prediction fails because bias is introduced. If you can't actually use the prediction in the right way, people are going to make decisions around the prediction that you wouldn't have predicted."

Dr. Pfeffer said generally health systems don't yet know how to monitor the possible workflow bias of generative AI. Asking the same question over and over again can at times elicit different responses.

"We have a lot of learning to do on the generative AI side and how we're going to monitor and go forward with this," Dr. Pfeffer said. His team at Stanford recently published findings in the Journal of Internal Medicine from a study on using AI to generate draft responses to patient questions. The researchers thought it would save time, but it didn't.

Was the pilot a failure?

"When we looked at burnout scores pre and post, the burnout scores went down," said Dr. Pfeffer. "We thought a lot about this because we didn't expect the main measurement not to actually come out positive, but what was interesting is it reduced cognitive burden. It presented the clinician with a starting point, and that was actually really helpful, even though the potential measure we were looking to see improve didn't. If we had not measured burnout scores, we would have thought this was a failure."

Health systems need a robust team of diverse clinical, IT, ethicist and strategy backgrounds to set the parameters for AI and build for the future.

"Learning what to measure in terms of success in deploying these kinds of things is really important, and I think it's going to be a lot of fun to dive into this," said Dr. Pfeffer. "You've got to get all the different stakeholders in the room…this is not an IT problem. This is an enterprise opportunity to really get it right."

Interested in speaking at the Becker's 15th Annual Meeting, April 28-May 1 in Chicago? Email agendateam@beckershealthcare.com to join the agenda. Limited speaking spots are also available at the Health IT + Digital Health + Revenue Cycle Conference and CEO+CFO Roundtable this fall.

Why health system AI predictions can fail

Featured Learning Opportunities

Featured Whitepapers

Featured Webinars