OSWorld leaderboard reveals dramatic computer-use vendor capability spread — OpenAI Operator at 38%, Claude Sonnet 4.6 at 72.5%, Coasty at 82% (above human baseline)
Recent OSWorld benchmark runs expose a wider-than-expected capability spread among computer-use agents. OpenAI Operator scored 38%, Claude Sonnet 4.6 reached 72.5%, and specialized agent Coasty hit 82% — above human baseline. The 44-point spread between top and bottom commercial offerings is the largest spread observed in any 2026 agent benchmark and forces procurement teams to actually compare instead of defaulting to brand recognition.
The substantive piece is the capability bifurcation in computer-use specifically. The 2026 six-benchmark consolidation includes OSWorld for computer-use workloads; the spread on this benchmark is significantly wider than the GAIA/SWE-Bench/Tau²-Bench spreads. Computer-use specifically rewards the integration between model capability, scaffolding architecture, and screen-understanding pipelines. The 44-point spread reflects which vendors have invested in the computer-use specific stack and which have shipped general-purpose agents that under-perform on this workload.
The procurement read for H2 2026 computer-use deployments is that vendor selection actually matters on this workload class. Defaulting to the headline frontier-model vendor (OpenAI Operator at 38%) under-performs against specialized offerings (Coasty at 82%). The complication: Coasty is a vertical agent product, not a general-purpose model — the comparison isn't apples-to-apples. Stanford's AI Index Report shows OSWorld task success improved 12% to 66% across 2025-2026 generally.
Coasty — OSWorld Benchmark 2026: 82% Real, 73% Exploited — Why Your Computer Use Agent Choice Matters → · O-Mega — 2025-2026 AI Computer-Use Benchmarks & Top AI Agents → · RapidClaw — AI Agent Benchmarks 2026: SWE-bench, GAIA →