AI’s accuracy fluctuates on radiology exam: 4 study takeaways

OpenAI’s GPT-4 large language model showed the highest level of accuracy when tasked with answering diagnostic radiology questions, according to a study published Nov. 20 in the European Journal of Radiology

Advertisement

Researchers at the University of Texas at Austin evaluated the performance of large language models, a type of generative AI, on the American College of Radiology’s Diagnostic in Training exam between November 2023 and January 2024. The four models included in the study were GPT-4, GPT-3.5, Claude and Google Bard.

Here are four study takeaways:

  1. Questions were distributed across radiology disciplines but did not include image-based questions.
  2. GPT-4 had the highest accuracy at 78% with a 4.1% fluctuation, followed by Google Bard at 73% with a 2.9% fluctuation, Claude at 71% with a 1.5% fluctuation and GPT-3.5 at 63% with a 6.9% fluctuation.
  3. During the study period, GPT-4′s accuracy decreased from 82% to 74% while Claude’s accuracy increased from 70% to 73%.
  4. The findings “[underscore] the need for continuous, radiology-specific standardized benchmarking metrics to gauge large language model reliability before clinical use,” the authors of the study wrote. 
Advertisement

Next Up in Radiology

  • University of California Davis Health in Sacramento awarded certificates to six students who completed its Radiology and Imaging Sciences Scholars…

Advertisement

Comments are closed.