Particle News: OpenAI and Paradigm Release EVMbench as an Open-Source Test for AI on EVM Smart-Contract Security

Overview

EVMbench evaluates AI agents across three tasks—detecting vulnerabilities, patching code, and executing controlled exploits—using a reproducible harness with containerized local EVMs and deterministic replay.
The dataset comprises 120 high-severity cases drawn from about 40 professional audits, largely from Code4rena competitions, with additional scenarios from the Tempo payments-focused blockchain audits.
OpenAI reported that GPT-5.3-Codex achieved a 72.2% success rate in exploit mode versus 31.9% for GPT-5, while scores for detection and patching lagged across models.
Reported leaderboards noted strong third-party performance on some detection metrics, including Anthropic’s Claude Opus 4.6 topping an average “detect award” measure in one report.
OpenAI published the code and data, committed $10 million in API credits to support security work, and cautioned that the benchmark omits timing-based and multi-chain attacks and does not fully capture real-world complexity as smart contracts secure $100B+ in assets and DeFi hacks persist.