Overview
- The Science study, published Thursday, found OpenAI’s o1-preview beat physician baselines on multiple clinical reasoning tests, including early emergency triage at 67.1% accuracy versus 55.3% and 50.0% for two attendings.
- Across New England Journal of Medicine case benchmarks, the model included the correct diagnosis in 78.3% of differentials and led with the right answer 52% of the time.
- On a validated reasoning rubric from the NEJM Healer curriculum, it earned perfect scores in 78 of 80 cases, far above GPT-4 and practicing doctors.
- All evaluations were text-only and centered on emergency and internal medicine, so the results do not extend to imaging, audio, or every specialty.
- Study authors and outside editorialists called for prospective trials and stronger governance to address risks like hallucinations, brittle reasoning under uncertainty, and unclear accountability.