Particle.news
Download on the App Store

New Studies Find Chatbots Give Mostly Correct Health Answers but Still Make Harmful Errors

Researchers report that large language models answer about three‑quarters of everyday medical queries correctly while a retrieval plus multi‑model voting protocol sharply reduces fabricated claims.

Overview

  • Two research teams published late this month showing linked results: a Penn State Diagnose‑a‑thon found consumer LLMs gave roughly 76% valid answers but had error rates above 20%, and a Binghamton team released a reproducible protocol that cut hallucinations in controlled tests.
  • Penn State researchers ran a crowdsourced competition using 212 real and imagined prompts with ChatGPT‑4o, ChatGPT‑3.5, Gemini‑1.5 Pro and Llama3‑8b and had board‑certified physicians rate answers for accuracy and harm.
  • The Penn State study found performance varied by specialty, with obstetrics/gynecology and otolaryngology scoring best and internal medicine, neurology and dermatology scoring poorest, and that fine‑tuning on medical texts did not reliably improve clinical appropriateness.
  • Binghamton’s STAR Protocols workflow forced seven models to use retrieval‑augmented generation (RAG) against authoritative medical sources and then applied a seven‑model vote, reporting no unmatched terms or hallucinations across about 10,000 experiments in their tests.
  • Both teams stress these are research‑stage findings: LLMs may augment trained clinicians if their outputs are verified, but current error rates, specialty gaps and the need for broader validation mean chatbots are not yet safe for unvetted patient‑facing diagnosis.