Overview
- Oumi, analyzing 4,326 queries for The New York Times with the SimpleQA test, found Google’s AI Overviews were 85% accurate with Gemini 2 and 91% with Gemini 3.
- Extrapolating that error rate to roughly five trillion yearly searches suggests tens of millions of wrong answers every hour.
- Google called the methodology flawed and pointed to issues in the OpenAI-built benchmark, while noting its ranking and safety systems screen poor sources.
- Separate reporting cites a Google internal test that found Gemini 3 outputs were incorrect 28% of the time, though Google says AI Overviews perform better than the raw model.
- Oumi reported weak grounding in the citations, with unsubstantiated links in 37% of Gemini 2 responses and 56% with Gemini 3, and frequent references to Facebook and Reddit.