Particle.news
Download on the App Store

New Studies Find Medical Chatbots Fail Patients and Repeat False Claims

Researchers urge real‑world trials with safeguards before broader patient use.

Overview

  • A randomized Nature Medicine trial of 1,298 UK participants found that using GPT‑4o, Llama 3 or Command R+ did not help people make better health decisions than traditional resources such as internet search or the NHS website.
  • When models were fed full clinical scenarios they identified conditions in about 94.9% of cases, yet with real users the systems identified relevant conditions in under 34.5% and recommended the correct action in under 44.2%.
  • Researchers documented inconsistent and sometimes unsafe guidance, including similar brain‑bleed symptoms receiving opposite advice and chatbots supplying erroneous details such as partial US numbers or the Australian emergency number.
  • A Lancet Digital Health analysis testing 20 models on more than a million prompts found LLMs accepted fabricated medical claims roughly 32% of the time, rising to about 46–47% when errors were embedded in hospital discharge notes and dropping to about 9% for social‑media style content.
  • Authoritative framing increased susceptibility to misinformation, with GPT models among the least prone and some other systems accepting false claims in up to around 63–64%, prompting calls for evidence‑checking guardrails, stress tests and regulatory scrutiny.