SWE-bench Verified leaderboard: Mythos 93.9%, GPT-5.5 88.7%, Opus 4.7 87.6%, Cursor 86%
The May 2026 SWE-bench Verified leaderboard now has 44 evaluated models. Claude Mythos Preview leads at 93.9% — the first model to clear 90% on the canonical real-GitHub-issue-fix benchmark. GPT-5.5 follows at 88.7%, Claude Opus 4.7 (Adaptive) at 87.6%, GPT-5.3-Codex at 85.0%, and Cursor's Composer 2.5 at around 86%.
The 90% threshold is the inflection. At sub-90% rates, agent failures on real bug fixes were common enough that human review was structurally required. Above 90%, the failure pattern flips: agents are right by default, and the human review step becomes audit rather than authorship.
Expect downstream re-architecting of dev workflows over Q3 2026. Companies that built their CI around "agent proposes, human merges" will start asking whether the human-in-the-loop step can be replaced with an automated test gate. See our analysis → on what 93%-class agents change about software supply chains.
MarkTechPost — Best AI agents for software dev May 2026 → · Morphllm — 14 best AI coding agents 2026 →