Particle News: ARC-AGI-3 Launch Exposes Sharp Gap Between Humans and Top AI Models

Overview

ARC Prize Foundation released ARC-AGI-3 this week, and public results show humans solving the environments reliably while leading models stayed under 1%, with Gemini 3.1 Pro at 0.37%, GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, and Grok-4.20 at 0%.
ARC-AGI-3 drops agents into simple, game-like worlds with no instructions and asks them to explore, infer goals and rules, form a plan, and adapt across steps, which targets the kind of fluid generalization that training-data recall cannot provide.
The benchmark scores Relative Human Action Efficiency, which rewards solving tasks in as few actions as skilled humans and heavily penalizes wandering and guesswork, making slow trial-and-error a losing strategy.
To blunt overfitting and brute-force memorization, most environments are private and runs are replayable, and the toolkit offers a standard API and UI for transparent evaluation that researchers can integrate with their agents.
A methodological dispute erupted after a Duke-built custom harness drove Claude from 0.25% to 97.1% on one variant (TR87), but ARC maintainers said the input format is not the limiting factor and kept official scores tied to the standard setup.
Coverage also diverged on prize terms, with reports citing either a $1 million purse or $2 million across Kaggle tracks with open-sourcing requirements, underscoring confusion even as the foundation pushes for reproducible, public solutions.
The weak model showing is already being read as a reality check on sweeping AGI claims and is expected to push labs toward agents that build simple world models and learn continuously to close the gap in 2026.