Overview
- ARC Prize Foundation released ARC-AGI-3 this week, and public results show humans solving the environments reliably while leading models stayed under 1%, with Gemini 3.1 Pro at 0.37%, GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, and Grok-4.20 at 0%.
- ARC-AGI-3 drops agents into simple, game-like worlds with no instructions and asks them to explore, infer goals and rules, form a plan, and adapt across steps, which targets the kind of fluid generalization that training-data recall cannot provide.
- The benchmark scores Relative Human Action Efficiency, which rewards solving tasks in as few actions as skilled humans and heavily penalizes wandering and guesswork, making slow trial-and-error a losing strategy.
- To blunt overfitting and brute-force memorization, most environments are private and runs are replayable, and the toolkit offers a standard API and UI for transparent evaluation that researchers can integrate with their agents.
- A methodological dispute erupted after a Duke-built custom harness drove Claude from 0.25% to 97.1% on one variant (TR87), but ARC maintainers said the input format is not the limiting factor and kept official scores tied to the standard setup.
- Coverage also diverged on prize terms, with reports citing either a $1 million purse or $2 million across Kaggle tracks with open-sourcing requirements, underscoring confusion even as the foundation pushes for reproducible, public solutions.
- The weak model showing is already being read as a reality check on sweeping AGI claims and is expected to push labs toward agents that build simple world models and learn continuously to close the gap in 2026.