Overview
- A randomized Nature Medicine trial of 1,298 UK participants found that using GPT‑4o, Llama 3 or Command R+ did not help people make better health decisions than traditional resources such as internet search or the NHS website.
- When models were fed full clinical scenarios they identified conditions in about 94.9% of cases, yet with real users the systems identified relevant conditions in under 34.5% and recommended the correct action in under 44.2%.
- Researchers documented inconsistent and sometimes unsafe guidance, including similar brain‑bleed symptoms receiving opposite advice and chatbots supplying erroneous details such as partial US numbers or the Australian emergency number.
- A Lancet Digital Health analysis testing 20 models on more than a million prompts found LLMs accepted fabricated medical claims roughly 32% of the time, rising to about 46–47% when errors were embedded in hospital discharge notes and dropping to about 9% for social‑media style content.
- Authoritative framing increased susceptibility to misinformation, with GPT models among the least prone and some other systems accepting false claims in up to around 63–64%, prompting calls for evidence‑checking guardrails, stress tests and regulatory scrutiny.