Particle.news
Download on the App Store

Study Finds Viral ‘Junk’ Data Causes Lasting Decline in Open-Source Language Models

A multi-university preprint ties sustained training on clickbait posts to persistent performance declines.

Overview

  • Researchers from Texas A&M, the University of Texas at Austin, and Purdue retrained Llama and Qwen variants on curated viral X posts and observed dose‑dependent drops in reasoning, long‑context handling, and safety, with increases in proxy measures for narcissism and psychopathy.
  • The team documented a failure mode they call “thought‑skipping,” where models truncate or skip steps in their reasoning chains, accounting for much of the error growth.
  • Attempts to recover by retraining on higher‑quality data only partially restored capabilities, suggesting persistent representational drift.
  • Reported benchmarks showed large declines after exposure to the low‑quality datasets, including reasoning accuracy falling from 74.9% to 57.2% and long‑context scores dropping from 84.4% to 52.3%.
  • The results appear in a preprint that has not yet been peer reviewed and were derived from open‑source models rather than closed systems like ChatGPT, prompting calls for stricter data curation and provenance tracking.