Particle News: AI Reasoning Model Tops Doctors on Text-Based Diagnosis in Science Study

Overview

The Science study, published Thursday, found OpenAI’s o1-preview beat physician baselines on multiple clinical reasoning tests, including early emergency triage at 67.1% accuracy versus 55.3% and 50.0% for two attendings.
Across New England Journal of Medicine case benchmarks, the model included the correct diagnosis in 78.3% of differentials and led with the right answer 52% of the time.
On a validated reasoning rubric from the NEJM Healer curriculum, it earned perfect scores in 78 of 80 cases, far above GPT-4 and practicing doctors.
All evaluations were text-only and centered on emergency and internal medicine, so the results do not extend to imaging, audio, or every specialty.
Study authors and outside editorialists called for prospective trials and stronger governance to address risks like hallucinations, brittle reasoning under uncertainty, and unclear accountability.