Particle News: Study Finds Top AI Models Disagree on Two‑Thirds of Real‑World Fact Checks

Overview

Lenz Research tested GPT‑5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 user‑submitted fact‑check claims and found disagreement on 672 items, or 67 percent.
The study measured inter‑model reliability with Krippendorff’s alpha at 0.639, a score below common thresholds that social scientists use to judge dependable agreement.
A large share of the splits were substantive: about 343 claims had models differing by two or more verdict categories on the four‑bucket rubric (True, Mostly True, Misleading, False).
The models only reached unanimous consensus on 328 claims and produced almost no unanimous middle‑ground verdicts, showing strong convergence at extremes but fracture on ambiguous cases.
The findings warn journalists, fact‑checkers, platform engineers, and investors that single‑model outputs are not reliable ground truth and that human review or multi‑model validation will be needed for high‑risk uses.