OpenAI's GPT-4 large language model showed the highest level of accuracy when tasked with answering diagnostic radiology questions, according to a study published Nov. 20 in the European Journal of Radiology.
Researchers at the University of Texas at Austin evaluated the performance of large language models, a type of generative AI, on the American College of Radiology's Diagnostic in Training exam between November 2023 and January 2024. The four models included in the study were GPT-4, GPT-3.5, Claude and Google Bard.
Here are four study takeaways:
- Questions were distributed across radiology disciplines but did not include image-based questions.
- GPT-4 had the highest accuracy at 78% with a 4.1% fluctuation, followed by Google Bard at 73% with a 2.9% fluctuation, Claude at 71% with a 1.5% fluctuation and GPT-3.5 at 63% with a 6.9% fluctuation.
- During the study period, GPT-4′s accuracy decreased from 82% to 74% while Claude’s accuracy increased from 70% to 73%.
- The findings "[underscore] the need for continuous, radiology-specific standardized benchmarking metrics to gauge large language model reliability before clinical use," the authors of the study wrote.