Overview
- The peer-reviewed paper in Science, released Thursday, reports that OpenAI’s o1-preview beat physician baselines across six clinical reasoning tests.
- In Boston emergency room triage cases, the model reached 67.1% exact or near-exact diagnostic accuracy on 76 charts and topped two attending doctors at 55.3% and 50.0%, with blinded reviewers unable to tell AI from human write-ups.
- On challenging New England Journal of Medicine cases, the system listed the correct diagnosis in 78.3% of differentials and outperformed prior models, including GPT-4 on a key vignette set at 88.6% versus 72.9%.
- All evaluations were text only, so the model did not read imaging, hear heart or lung sounds, or see patient distress, which are routine cues in real care.
- Study authors and outside experts said large language models can hallucinate and jump to conclusions under uncertainty, and they urged randomized trials, human oversight, and ongoing safety and equity checks before deployment.