// news · agents2026-06-29source: arxiv / voltagent

WorkBench Revisited re-ran benchmark on 21 models released 2023-2026 — single modern agent harness using native tool-calling, Claude Opus 4.8 reaches 89% completion with only 2.5% unintended harm

WorkBench Revisited arXiv paper (2606.13715) re-runs the benchmark on 21 models released 2023-2026 spanning four vendors and both proprietary + open-weight options, under a single modern agent harness using native tool-calling. Claude Opus 4.8 reaches 89% task completion with only 2.5% unintended harmful action — capability and safety co-rise rather than trade off.

The substantive piece is the standardized-harness cross-vendor evaluation methodology. Pre-WorkBench-Revisited model comparisons typically used inconsistent harnesses + scaffolding choices that confounded capability-versus-methodology attribution. The single modern agent harness using native tool-calling across 21 models provides consistent methodology for direct cross-model comparison.

The competitive read against the earlier coverage of Claude Opus 4.8's 89% completion is that today's expanded coverage adds the 2.5% unintended-harm metric. Combined capability + safety co-rise empirical finding substantively contradicts the alignment-tax narrative that has shaped procurement discussions for years.

See our analysis →

arXiv — WorkBench Revisited: Workplace Agents Two Years On (2606.13715) → · VoltAgent — Awesome AI Agent Papers 2026 →