Overview
- Lenz Research tested GPT‑5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 user‑submitted fact‑check claims and found disagreement on 672 items, or 67 percent.
- The study measured inter‑model reliability with Krippendorff’s alpha at 0.639, a score below common thresholds that social scientists use to judge dependable agreement.
- A large share of the splits were substantive: about 343 claims had models differing by two or more verdict categories on the four‑bucket rubric (True, Mostly True, Misleading, False).
- The models only reached unanimous consensus on 328 claims and produced almost no unanimous middle‑ground verdicts, showing strong convergence at extremes but fracture on ambiguous cases.
- The findings warn journalists, fact‑checkers, platform engineers, and investors that single‑model outputs are not reliable ground truth and that human review or multi‑model validation will be needed for high‑risk uses.