// news · research-papers2026-06-20source: arxiv / magazine.sebastianraschka

'End of Software Engineering' arXiv paper claims small models with agent scaffolding outperform Llama 3.1 405B by 22.76% on automated SWE benchmarks

A June 2026 arXiv paper (2606.05608) reports that small models inside well-designed agent scaffolds achieve a 22.76% relative improvement over Llama 3.1 405B on automated software engineering benchmarks. The framing — small-model-plus-scaffolding > monolithic-large-model — has implications for compute economics, vendor selection, and the agent-vs-foundation-model investment thesis.

The substantive piece is the scaffolding-over-scale claim. The dominant 2024-2025 thesis was that scale was the primary lever for software-engineering capability — bigger model, better SWE-Bench scores. The paper argues that a smaller model inside a well-designed agent loop (planning, code-execution, error-feedback, retry) substantively outperforms a much larger model used in single-shot mode. The 22.76% relative improvement against Llama 3.1 405B is the headline number; the broader claim is that agent-architecture optimization is now a higher-leverage investment than model-scale optimization for SWE workloads.

The competitive read against OpenCode's 160K-star, 7.5M-MAU adoption is that the open-source agent-scaffolding category is moving fast precisely because the scaffolding-vs-scale tradeoff favors investment in scaffolding. Anthropic's Claude Code, OpenAI's Codex, and Cursor's IDE-agent all benefit from the same dynamic — the lab providing the model isn't the bottleneck on capability, the agent-loop architecture around the model is.

See our analysis →

arXiv — The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm → · Sebastian Raschka — LLM Research Papers: The 2026 List (January to May) →