Overview
- Using real emergency-room records, the o1-preview model produced exact or near-correct diagnoses in about 67% of cases compared with roughly 50–55% for two attending physicians.
- The head-to-head tests covered multiple ER stages, including triage, first doctor interaction, and admission, with blinded attending physicians grading the outputs.
- OpenAI’s o1-preview also beat earlier systems on curated New England Journal of Medicine case sets and surpassed GPT-4 and OpenAI’s 4o in several reasoning tasks.
- Study authors and outside experts said the work was limited to written electronic health records and noted the model can push unnecessary tests, so doctors must stay in the loop.
- The team called for prospective clinical trials and tightly supervised pilots to see how AI can safely support bedside decisions in real workflows.