While healthcare buzzes with talk of EHR tools that can forecast a patient’s mortality or disease progression, the industry hasn’t reached its “MedGPT moment,” according to Stanford (Calif.) University researchers.
Current generative AI models trained on EHR data are roughly where ChatGPT was between GPT-2 and GPT-3, the experts wrote in a Jan. 7 Nature Medicine commentary (the AI chatbot is now at GPT-5.2). The applications learn patterns from historical data to “generate plausible patient timelines — sequences of diagnoses, procedures, medication codes, lab values, and their timing,” said the authors, including Nigam Shah, MD, PhD, chief data scientist of Palo Alto, Calif.-based Stanford Health Care.
“If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk,” they wrote. “However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an ‘oracle’ prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.”
To reach their potential, these tools need five evaluation criteria: performance by frequency, calibration, timeline completion, shortcut audits, and out-of-distribution validation, the researchers wrote. Better calibration, for example, would ensure a “30% predicted risk actually corresponds to 30% of patients experiencing that outcome.”