Stanford scholars reviewed 80 different clinical foundation models, built from a variety of healthcare datasets, and found the following limitations:
- General-purpose language models such as ChatGPT can be useful for clinical tasks, but these tools underperform when it comes to medical-specific clinical language models.
- Clinical language models are often trained on large-scale private EHR datasets that provide the best performances.
- Foundation models for EHRs lack a common mechanism for distributing models to the broader research community.
- Many foundation models for EHRs have their model weights published, meaning researchers will have to re-train these models from scratch to use them or to validate their performance.
- The researchers said that there needs to be better evaluation paradigms, datasets and shared tasks in order to determine the value of clinical foundation models to health systems.