OpenAI along with crypto firm Paradigm have launched EVMbench, a benchmark to assess AI agents’ skills in identifying, fixing, and exploiting serious vulnerabilities in smart contracts.
EVMbench includes 120 curated vulnerabilities from 40 security audits, primarily from open code audit competitions on platforms like Code4rena.
Three Evaluation Modes:
EVMbench tests AI agents in three different capability modes, each focused on a unique stage of the smart contract security lifecycle.
Detect: Agents audit a smart contract repository and are scored on recall of ground-truth vulnerabilities and associated audit rewards.
Patch: Agents modify vulnerable contracts and must preserve intended functionality while eliminating exploitability, verified through automated tests and exploit checks.
Exploit: Agents execute end-to-end fund-draining attacks against deployed contracts on a sandboxed blockchain environment, with grading performed programmatically via transaction replay and on-chain verification.
OpenAI created a Rust-based tool for consistent evaluation, deploying contracts in a controlled manner and preventing unsafe RPC methods. All exploitation tasks occur in a secure local Anvil environment, not on live networks.
EVMbench shows that the Frontier model performs differently across tasks. In exploit mode, GPT-5.3-Codex scored 72.2%, a significant increase from GPT-5’s 31.9% six months ago.
Limitation:
Open AI stated that EVMbench does not fully capture the challenges of real-world smart contract security. The vulnerabilities it includes come from Code4rena audits. Although they are realistic and severe, many heavily used crypto contracts face even more scrutiny and might be tougher to exploit.
OpenAI has allocated $10 million in API credits for its Cybersecurity Grant Program to boost research in defensive security, focusing on open-source software and critical infrastructure.
InfoSecBulletin Cybersecurity for mankind
