Particle.news
Download on the App Store

Google Launches Android Bench to Rank AI Models on Real Android Coding Tasks

The public benchmark uses transparent, GitHub-based tests to help developers choose effective AI assistance.

Overview

  • Google published the initial leaderboard listing Gemini 3.1 Pro Preview at about 72.4%, ahead of Claude Opus 4.6 (66.6%) and GPT‑5.2 Codex (62.5%).
  • The first release measures models without external tools, with success rates spanning roughly 16% to 72% and a low of 16.1% for Gemini 2.5 Flash.
  • Tasks are drawn from real Android issues and pull requests in public GitHub projects and are validated by unit or instrumentation tests for practical correctness.
  • Coverage includes Android‑specific areas such as Jetpack Compose, Coroutines and Flows, Room, Hilt, navigation migrations, Gradle configurations, and SDK breaking‑change handling.
  • Google open‑sourced the methodology, dataset, and test harness with contamination controls and external validation, and it plans to expand the task set in future releases.