Overview
- Researchers ran language models as both players and monsters inside a tool-grounded D&D engine that enforced rules and maps to reduce hallucinations.
- Models were evaluated on six metrics and compared against more than 2,000 human players in combat-only scenarios.
- All systems showed progressive degradation over longer sessions, with rising errors in tracking health, positions, and status effects.
- Claude 3.5 Haiku proved the most tactically reliable, GPT-4o blended vivid narration with meta tactics, and DeepSeek-V3 favored short action beats and repeated taunts.
- Smaller open-source models lacked consistent performance; the study was presented at NeurIPS 2025, posted on OpenReview, and the team plans to extend tests to full campaigns.