Particle.news
Download on the App Store

UC San Diego Puts Top AI Models Through Dungeons & Dragons, Finds Strengths—and Slippage Over Time

The rules-driven simulator spotlights persistent weaknesses in long-horizon planning, including memory.

Overview

  • Researchers ran language models as both players and monsters inside a tool-grounded D&D engine that enforced rules and maps to reduce hallucinations.
  • Models were evaluated on six metrics and compared against more than 2,000 human players in combat-only scenarios.
  • All systems showed progressive degradation over longer sessions, with rising errors in tracking health, positions, and status effects.
  • Claude 3.5 Haiku proved the most tactically reliable, GPT-4o blended vivid narration with meta tactics, and DeepSeek-V3 favored short action beats and repeated taunts.
  • Smaller open-source models lacked consistent performance; the study was presented at NeurIPS 2025, posted on OpenReview, and the team plans to extend tests to full campaigns.