The AI Adverse Event Problem Nobody Is Talking About

The guide wire felt wrong. My hands, trained over two decades and thousands of procedures, sensed a resistance the navigation screen denied. The system’s three-dimensional reconstruction, a confident render of the patient’s sinus anatomy, showed a clear path. My proprioception, however, suggested I was about to perforate bone. I paused, withdrew the instrument, and re-evaluated with conventional fluoroscopy. The screen had been mistaken. My hands had been right. This time, it was a near miss. For the next patient, with another surgeon, it could be a cerebrospinal fluid leak or a punctured skull base.

This is the new reality of the artificially intelligent operating theatre. As a cardiac surgeon who also validates these algorithmic tools, I am not an AI sceptic. I am an AI realist, and I am profoundly concerned. We are adopting powerful surgical AI without acknowledging its unique and poorly understood failure modes. The adverse events are already occurring, yet the systems we have for detecting and learning from them are not fit for purpose.

The scale of our ambition has outpaced the rigour of our science. Regulatory bodies are struggling to keep up. The number of US Food and Drug Administration authorised AI and machine learning devices now exceeds 700, yet the evidence base underpinning this revolution is perilously thin. A 2023 analysis in Nature Medicine revealed that fewer than 10% of these devices were supported by any form of clinical trial data, let alone randomised controlled trials. Most are cleared through pathways, such as the 510(k) process, that require only “substantial equivalence” to a previous device. This framework, designed for incremental updates to physical hardware, is dangerously unsuited to the novel complexities of adaptive software, creating a veneer of safety that is not substantiated by high-quality evidence.

The consequences are being recorded, albeit quietly. Consider the Johnson & Johnson Acclarent TruDi navigation system. Investigative journalism by Reuters, while not peer-reviewed, has highlighted a pattern of malfunctions and injuries that appears to be corroborated by trends in the FDA’s own post-market surveillance data. This is not an isolated case. A recent investigation in JAMA Health Forum found that of 60 FDA-authorised AI devices involved in 182 recalls, 43% of those recalls occurred within the first year of approval. This suggests a systemic failure of pre-market assessment. These are not abstract statistics. For the patient on the table, they represent an unquantified risk of an algorithm making a mistake, a subtle miscalculation that no human would.

The current regulatory framework was not designed for this challenge. In the US, the FDA is developing a “Total Product Lifecycle” approach to manage algorithms that learn and evolve. In the UK, the MHRA’s ‘Software and AI as a Medical Device Change Programme’ is a step in the right direction. Its focus, however, remains on pre-market conformity, not post-deployment performance drift. This leaves a critical gap. CQC inspectors, for instance, currently lack the tools to assess the digital safety of a hospital, even as these AI systems become integral to care. The European Union’s landmark AI Act provides a glimpse of a more robust future, classifying surgical navigation aids as high-risk systems that will require far more stringent evidence for CE marking.

The debate cannot remain focused on “black box” algorithms or the distant threat of surgeon skill degradation. The danger is more immediate. It is the quiet miscalibration, the subtle dataset drift, the algorithmic assumption that does not hold true for a patient whose anatomy lies three standard deviations from the mean. When an AI-driven error occurs, who is responsible? The surgeon who trusted the screen? The hospital that procured the system? The manufacturer whose liability is often shielded by opaque user agreements?

We would not accept this evidentiary deficit for a new pharmaceutical or a prosthetic heart valve. The world of aviation, however, offers a powerful model for moving forward. For decades, pilots have used confidential, non-punitive near-miss reporting systems like the UK’s CHIRP or the US’s ASRS. These programmes succeeded because they fostered a cultural shift, moving the focus from individual blame to system-level analysis. They have created a vast, shared repository of safety intelligence, allowing the entire industry to learn from one crew’s close call. Medicine has no equivalent for AI-driven surgical errors. We are all flying blind, with each near miss remaining a private lesson rather than a public warning.

The solution is not to abandon innovation but to embrace a more sophisticated model of safety engineering. Part of the answer lies in comprehensive in silico verification and validation. Leading research centres are now building digital twins of patients and pathologies. These are not mere simulations but high-fidelity virtual patients, constructed from real clinical data, upon which an AI navigation system can be tested against millions of anatomical variations and surgical scenarios. These virtual proving grounds can probe for rare edge cases, induce instrument stress, and validate performance against a breadth of demographic diversity that a conventional clinical trial could never achieve. This is not a supplement to a trial. It is a rigorous, exhaustive pre-market evaluation designed to find and fix failures before an algorithm ever enters a human operating theatre.

The path forward requires three urgent actions. First, we must demand that our regulators, from the MHRA to the FDA, treat robust in silico validation not as an option but as a moral prerequisite for approving any high-risk navigational or autonomous AI. Second, we must build a mandatory, confidential adverse event registry for AI, an equivalent to aviation’s near-miss systems, where honesty is protected and every error becomes a lesson for all, not a liability for one. Third, this new ecosystem of oversight cannot be delegated. It must be surgeon-led. We who bear the final responsibility for a patient’s life must be the ones who define the standards and dissect the failures.

The promise of AI in surgery is immense, offering a future of greater precision, improved access, and better outcomes. But this promise is conditional. It depends entirely on our willingness to confront the truth that our current approach to safety is failing. We are navigating a new and complex landscape with powerful tools. It is time we admitted that without a better map, our most advanced instruments may be leading us astray.

Zain Khalpey MD PhD FACS | Director, Khalpey AI Lab | Senior AI Editor, CTSNet | HonorHealth, Scottsdale, Arizona

Competing interests: None declared.

At the Becker's 11th Annual IT + Revenue Cycle Conference: The Future of AI & Digital Health, taking place September 14–17 in Chicago, healthcare executives and digital leaders from across the country will come together to explore how AI, interoperability, cybersecurity, and revenue cycle innovation are transforming care delivery, strengthening financial performance, and driving the next era of digital health. Apply for complimentary registration now.

The AI Adverse Event Problem Nobody Is Talking About

Next Up in Artificial Intelligence

St. Joseph’s saves $7.2M with Oracle Health: 4 notes

MetroHealth to roll out 500 smart rooms

Judy Faulkner: Epic doesn't stifle competition

Next Up in Artificial Intelligence

Join the 500,000+ healthcare executives who start their day with Becker’s