OpenAI, the company behind ChatGPT, has introduced a new evaluation framework to assess how artificial intelligence systems perform in healthcare settings.
Here are six things to know about the new tool, called HealthBench:
- HealthBench was developed with input from 262 physicians to ensure it reflects real medical needs.
- The tool evaluates AI performance in realistic healthcare scenarios based on priorities identified by physicians.
- HealthBench features 5,000 simulated medical conversations, according to a May 12 news release.
- Each AI-generated response is scored using a rubric written by physicians. The rubric evaluates accuracy, clarity and helpfulness.
- The conversations are categorized into seven themes, such as emergency care, dealing with uncertainty, and global health. Each theme includes its own rubric for grading.
- GPT-4.1, one of OpenAI’s models, is used to assist with grading by assessing how well AI responses meet the criteria and generating overall scores.
The tool also compares AI responses with those written by physicians to gauge how closely the AI aligns with expert judgment. OpenAI said HealthBench’s grading closely matches how real physicians would assess the same responses.