Overview
- Two research teams published late this month showing linked results: a Penn State Diagnose‑a‑thon found consumer LLMs gave roughly 76% valid answers but had error rates above 20%, and a Binghamton team released a reproducible protocol that cut hallucinations in controlled tests.
- Penn State researchers ran a crowdsourced competition using 212 real and imagined prompts with ChatGPT‑4o, ChatGPT‑3.5, Gemini‑1.5 Pro and Llama3‑8b and had board‑certified physicians rate answers for accuracy and harm.
- The Penn State study found performance varied by specialty, with obstetrics/gynecology and otolaryngology scoring best and internal medicine, neurology and dermatology scoring poorest, and that fine‑tuning on medical texts did not reliably improve clinical appropriateness.
- Binghamton’s STAR Protocols workflow forced seven models to use retrieval‑augmented generation (RAG) against authoritative medical sources and then applied a seven‑model vote, reporting no unmatched terms or hallucinations across about 10,000 experiments in their tests.
- Both teams stress these are research‑stage findings: LLMs may augment trained clinicians if their outputs are verified, but current error rates, specialty gaps and the need for broader validation mean chatbots are not yet safe for unvetted patient‑facing diagnosis.