// news · research-papers2026-05-30source: arxiv / arc prize / nips

ARC-AGI-2 reaches 24% via hundreds-of-thousands of synthetic examples — perception bottleneck accounts for 80% of failures, reasoning remains knowledge-bound

The ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that abstract reasoning benchmarks remain knowledge-bound rather than reasoning-bound. A separate study finds approximately 80% of model failures in abstract reasoning benchmarks stem from perception errors, not from reasoning shortcomings — suggesting ARC-style benchmarks conflate the two challenges.

The perception-vs-reasoning decomposition is the methodological insight. Treating ARC-AGI scores as a pure reasoning benchmark misattributes 80% of model failures to a reasoning-capability gap that may actually be a perception-capability gap. Models that fail to extract the correct grid-pattern representation from the puzzle image will fail regardless of reasoning capacity over the (incorrectly perceived) inputs.

The implication for AGI-progress measurement is structural. The ARC of Progress towards AGI living survey captures the field as of February 2026; the 24% score on ARC-AGI-2 is the highest publicly reported, but the perception-bottleneck finding suggests this number understates pure-reasoning capability and overstates the difficulty of the remaining gap to human-level performance. The benchmark's discriminative power for the next capability jumps is reduced.

arXiv — Your Reasoning Benchmark May Not Test Reasoning → · arXiv — The ARC of Progress towards AGI Living Survey →