Overview
- Google published the initial leaderboard listing Gemini 3.1 Pro Preview at about 72.4%, ahead of Claude Opus 4.6 (66.6%) and GPT‑5.2 Codex (62.5%).
- The first release measures models without external tools, with success rates spanning roughly 16% to 72% and a low of 16.1% for Gemini 2.5 Flash.
- Tasks are drawn from real Android issues and pull requests in public GitHub projects and are validated by unit or instrumentation tests for practical correctness.
- Coverage includes Android‑specific areas such as Jetpack Compose, Coroutines and Flows, Room, Hilt, navigation migrations, Gradle configurations, and SDK breaking‑change handling.
- Google open‑sourced the methodology, dataset, and test harness with contamination controls and external validation, and it plans to expand the task set in future releases.