Particle News: Google Launches Android Bench to Rank AI Models on Real Android Coding Tasks

Overview

Google published the initial leaderboard listing Gemini 3.1 Pro Preview at about 72.4%, ahead of Claude Opus 4.6 (66.6%) and GPT‑5.2 Codex (62.5%).
The first release measures models without external tools, with success rates spanning roughly 16% to 72% and a low of 16.1% for Gemini 2.5 Flash.
Tasks are drawn from real Android issues and pull requests in public GitHub projects and are validated by unit or instrumentation tests for practical correctness.
Coverage includes Android‑specific areas such as Jetpack Compose, Coroutines and Flows, Room, Hilt, navigation migrations, Gradle configurations, and SDK breaking‑change handling.
Google open‑sourced the methodology, dataset, and test harness with contamination controls and external validation, and it plans to expand the task set in future releases.