Particle.news
Download on the App Store

AI Reasoning Model Tops Doctors on Text-Based Diagnosis in Science Study

Experts say it needs rigorous trials before bedside use.

Overview

  • The Science study, published Thursday, found OpenAI’s o1-preview beat physician baselines on multiple clinical reasoning tests, including early emergency triage at 67.1% accuracy versus 55.3% and 50.0% for two attendings.
  • Across New England Journal of Medicine case benchmarks, the model included the correct diagnosis in 78.3% of differentials and led with the right answer 52% of the time.
  • On a validated reasoning rubric from the NEJM Healer curriculum, it earned perfect scores in 78 of 80 cases, far above GPT-4 and practicing doctors.
  • All evaluations were text-only and centered on emergency and internal medicine, so the results do not extend to imaging, audio, or every specialty.
  • Study authors and outside editorialists called for prospective trials and stronger governance to address risks like hallucinations, brittle reasoning under uncertainty, and unclear accountability.