Overview
- An international study published Tuesday in Radiology tested 17 radiologists on 264 X-rays, half real and half generated by ChatGPT and the RoentGen diffusion model.
- Before being told fakes were present, only 41% of readers suspected any images were synthetic, and after disclosure their mean accuracy reached about 75% with individual scores from 58% to 92%.
- Top multimodal models also struggled, with reported detection accuracy ranging roughly from 57% to 85% for GPT-4o–generated images and from 52% to 89% on RoentGen chest images, and GPT-4o led but still missed fakes.
- Performance did not track with years of experience, though musculoskeletal specialists did better, and the team flagged cues like overly smooth bones, unnaturally straight spines, and uniform vessel patterns to help spot forgeries.
- The authors released a curated deepfake dataset with quizzes and called for invisible watermarks and cryptographic signatures at image capture to reduce risks to diagnoses, medical records, legal evidence, and hospital cybersecurity, warning that synthetic CT and MRI are likely next.