// news · agents2026-06-20source: decodethefuture / aiagentsquare / prosus

Agent evaluation consolidates around six benchmarks for 2026 — GAIA, SWE-Bench Verified, OSWorld, Tau²-Bench, WebArena, METR HCAST

The agent-evaluation literature has consolidated around six benchmarks for production decisions through H2 2026: GAIA (general assistant), SWE-Bench Verified (real GitHub bug-fixes), OSWorld (computer-use), Tau²-Bench (tool-user-policy adherence), WebArena (multi-step browser), and METR HCAST / Time Horizons (longest 50%-completion task). The consolidation matters more than any single number — it lets vendors and buyers speak the same evaluation language.

The substantive piece is that agent benchmarks now mean something consistent across vendor pitches. Through 2024-2025 every vendor reported on a different evaluation suite, often custom-built, making cross-vendor comparison effectively impossible without running your own evals. The six-benchmark consolidation lets procurement teams ask 'show me your scores on GAIA / SWE-Bench Verified / OSWorld' and get directly comparable numbers. The benchmarks also segment cleanly by workload shape: GAIA + Tau²-Bench for general assistants, SWE-Bench + OSWorld for engineering and computer-use, WebArena for browser automation, METR HCAST for the long-horizon capability frontier.

The competitive read against the May 2026 leaderboard is that the spread between top-3 vendors is narrow — Claude Mythos Preview at 68.7%, GPT-5.4 Pro at 65.8%, Claude Opus 4.6 at 64.5%. The 4.2-point spread between #1 and #3 suggests vendor selection for general-agent workloads is now driven more by pricing, latency, and ecosystem fit than raw capability ranking. The historical 14% baseline two years ago is the comparison that matters — agent capability has expanded by ~50 percentage points in 24 months on these benchmarks.

See our analysis →

Decode The Future — AI Agent Benchmarks 2026: 6 Tests That Matter → · AI Agent Square — AI Agent Benchmarks 2026: Performance, Accuracy & Cost Compared → · Prosus — State of AI Agents 2026: Autonomy is Here →