Overview
- A peer-reviewed Science paper tested OpenAI’s o1-preview on real emergency room records and found it hit the exact or near-correct diagnosis about 67% of the time.
- Two attending physicians given the same text reached roughly 50% to 55%, and blinded reviewers judged the outputs without knowing which were from people or AI.
- The evaluation used records from 76 Beth Israel patients shown at triage, first doctor contact, and hospital admission to reflect messy, unedited notes.
- Researchers stressed the study used text only and warned the model can suggest unnecessary tests or produce confident but wrong details.
- The team plans prospective clinical trials and is pushing for accountability rules as more clinicians try AI tools to support diagnosis.