Particle.news
Download on the App Store

Studies Find AI Chatbots Don’t Improve Patient Decisions and Can Echo False Medical Claims

Researchers urge real‑world trials and safeguards before using general‑purpose models for direct patient care.

Overview

  • A randomized Nature Medicine trial of 1,298 UK participants found that using GPT‑4o, Llama 3, or Command R+ did not help people identify conditions or choose safer next steps any better than internet search or usual resources.
  • Without human users, the models identified conditions in about 94.9% of test cases and chose correct actions 56.3% of the time, but with real users relevant conditions were identified in under 34.5% of cases and correct actions in under 44.2%.
  • Researchers documented dangerous inconsistencies, including two near‑identical descriptions of a subarachnoid hemorrhage receiving opposite guidance, with one user told to rest in a dark room and the other urged to seek emergency care.
  • A separate Lancet Digital Health study from Mount Sinai showed LLMs accepted fabricated medical claims roughly 32% overall, rising to about 46–47% when falsehoods appeared in hospital‑style discharge notes and dropping to about 9% for social‑media‑style posts.
  • Susceptibility varied widely across models—GPT‑based systems were among the least likely to accept false claims while some models agreed with up to about 63.6%—leading authors to call for evidence‑grounding checks, stress tests using clinical notes, and regulatory caution.