Particle News: New Studies Find Medical Chatbots Fail Patients and Repeat False Claims

Overview

A randomized Nature Medicine trial of 1,298 UK participants found that using GPT‑4o, Llama 3 or Command R+ did not help people make better health decisions than traditional resources such as internet search or the NHS website.
When models were fed full clinical scenarios they identified conditions in about 94.9% of cases, yet with real users the systems identified relevant conditions in under 34.5% and recommended the correct action in under 44.2%.
Researchers documented inconsistent and sometimes unsafe guidance, including similar brain‑bleed symptoms receiving opposite advice and chatbots supplying erroneous details such as partial US numbers or the Australian emergency number.
A Lancet Digital Health analysis testing 20 models on more than a million prompts found LLMs accepted fabricated medical claims roughly 32% of the time, rising to about 46–47% when errors were embedded in hospital discharge notes and dropping to about 9% for social‑media style content.
Authoritative framing increased susceptibility to misinformation, with GPT models among the least prone and some other systems accepting false claims in up to around 63–64%, prompting calls for evidence‑checking guardrails, stress tests and regulatory scrutiny.