AI's accuracy fluctuates on radiology exam: 4 study takeaways

OpenAI's GPT-4 large language model showed the highest level of accuracy when tasked with answering diagnostic radiology questions, according to a study published Nov. 20 in the European Journal of Radiology

Researchers at the University of Texas at Austin evaluated the performance of large language models, a type of generative AI, on the American College of Radiology's Diagnostic in Training exam between November 2023 and January 2024. The four models included in the study were GPT-4, GPT-3.5, Claude and Google Bard.

Here are four study takeaways:

  1. Questions were distributed across radiology disciplines but did not include image-based questions.

  2. GPT-4 had the highest accuracy at 78% with a 4.1% fluctuation, followed by Google Bard at 73% with a 2.9% fluctuation, Claude at 71% with a 1.5% fluctuation and GPT-3.5 at 63% with a 6.9% fluctuation.

  3. During the study period, GPT-4′s accuracy decreased from 82% to 74% while Claude’s accuracy increased from 70% to 73%.

  4. The findings "[underscore] the need for continuous, radiology-specific standardized benchmarking metrics to gauge large language model reliability before clinical use," the authors of the study wrote. 

Copyright © 2024 Becker's Healthcare. All Rights Reserved. Privacy Policy. Cookie Policy. Linking and Reprinting Policy.

 

Articles We Think You'll Like

 

Featured Whitepapers

Featured Webinars