// news · agents2026-06-26source: kilitechnology / arxiv

Holistic Agent Leaderboard (Kapoor 2026) required $40K to evaluate agents on 9 benchmarks with limited scaffold variation — the structural cost problem of comprehensive agent evaluation

The Holistic Agent Leaderboard (Kapoor et al. 2026) required approximately $40,000 to evaluate agents on 9 benchmarks — despite considering at most 2 scaffolds per benchmark and only 1 run per scaffold–model configuration. The structural cost problem represents H2 2026 to 2027 agent-evaluation infrastructure economics that affects research-organization and procurement-evaluation budgets.

The substantive piece is the $40K-per-comprehensive-evaluation economics. Pre-HAL agent benchmarks operated at smaller scope (single benchmark per evaluation) at lower cost. Comprehensive agent evaluation across 9 benchmarks at $40K per cycle establishes the operational-cost floor for systematic agent procurement evaluation. The cost compounds quickly — running 2 scaffolds + 2 runs per scaffold + 3 model variants approaches $1M for fully-comprehensive evaluation.

The competitive read against Efficient Benchmarking's 44-70% task-reduction methodology is that H2 2026 agent-evaluation cost-reduction methodology becomes economically necessary for procurement-evaluation velocity. Vendors and procurement teams can't afford $40K+ per comprehensive evaluation cycle; cost-efficient methodology (intermediate-pass-rate protocols) enables more-frequent vendor re-evaluation against changing capability landscape.

See our analysis →

Kili Technology — AI Benchmarks 2026: Top Evaluations and Their Limits → · arXiv — Efficient Benchmarking of AI Agents (2603.23749) →