// blog · analysis · research-papers2026-05-256 min read

Less is More — LIMO and the reasoning-efficiency thesis as the cumulative case for research automation

The LIMO paper showed 800 curated reasoning examples beat 100,000 mixed examples. The reproduction at frontier-model scale confirms the result holds. Combined with five concrete AI math milestones in 35 days, the cumulative empirical case for AI as research infrastructure is no longer aspirational — it's measured.

The data-quality vs data-quantity debate has been one of the longest-running unresolved questions in deep-learning research. The pretraining intuition is that more is better — train on every token available, let the model figure out which patterns matter. The fine-tuning intuition is that quality matters — give the model a small number of high-signal examples and let it generalize. Both have empirical support across different tasks.

The LIMO paper made the strong-form claim for reasoning specifically: 800 carefully curated high-quality reasoning examples produce better downstream reasoning performance than 100,000+ mixed-quality examples. The original paper used smaller open-source models. The May 2026 reproduction at frontier-model scale (Gemini 3.1 Pro and Claude Opus 4.7 ablations) confirms the result holds at the largest model sizes — it's not a quirk of small-model training dynamics.

Why this matters beyond reasoning

The narrow conclusion is that reasoning-capability training has a data-curation moat rather than a data-collection moat. Labs that figure out the curation methodology produce better reasoners with less data. That's directly useful for the next 12 months of reasoning-model development.

The broader conclusion is that the data-quantity-scaling era for capability development is approaching diminishing returns for the specific capabilities frontier-AI customers care about. Through 2023-2024 every capability improvement looked like scale-driven. Through 2025-2026 the improvements look increasingly like curation-driven, architecture-driven (sparse MoE), and methodology-driven (constitutional-AI feedback loops, integer-code interpretability). The labs that compete on "we have more training data" are racing to a position that may not matter anymore.

The cumulative research-automation case

LIMO plus four concrete AI math milestones between April 21 and May 25 — AlphaEvolve's production records, FrontierMath Tier 4, WorldReasonBench, OpenAI's Erdős disproof with Tim Gowers's companion paper — produces a five-milestone empirical case for AI as research infrastructure. Five independent results from five different labs in 35 days is not coincidence. It's a phase transition in what frontier models can do.

The relevant question is no longer whether AI can do research-grade mathematics; it's how fast the trajectory accelerates. Jack Clark's Cosmos Lecture (covered in earlier cycles) put a 60%+ probability on recursive self-improvement by end-2028. The math milestones are exactly the data points that probability is conditioned on. LIMO is the methodology that makes that recursive cycle compute-efficient — the model can generate and curate its own next training set without requiring exponentially more data.

What's still uncertain

The honest gap is whether the LIMO finding generalizes beyond reasoning. The 800-vs-100,000 efficiency gap was measured on reasoning-style tasks where there's a clear logical structure to evaluate. Other capability surfaces (multimodal grounding, agent tool-use, code generation in unfamiliar codebases, creative writing) may not show the same data-quality-dominates-quantity dynamic. The labs that try to apply LIMO-style curation to those surfaces will discover whether it generalizes; the answer isn't obvious.

The second gap is reproducibility of the curation methodology. LIMO's 800 examples were carefully filtered by domain experts. Scaling that filtering to thousand-of-examples regimes across multiple capability surfaces requires methodology automation — using one model to filter examples for training another model. That introduces feedback loops the field doesn't have good characterizations of yet. The Anthropic constitutional-AI feedback loop work (40% alignment-failure reduction) is the closest analog, but applied to safety rather than capability.

The bigger picture

What May 2026 shows is that AI research has shifted from "scale solves everything" to "the methodology you wrap around scale matters more than the scale itself." That's a methodology revolution as much as a capability one. The labs that internalize the shift — investing in curation infrastructure, interpretability tooling, evaluation methodology — outperform the labs still racing on parameter count. Open-weight labs like Mistral (at frontier parity with closed at smaller model sizes) have shown that the methodology shift is real and accessible. The next year will show whether closed-frontier labs adapt to the new game or keep playing the old one.

arXiv — LIMO Less is More for Reasoning → · Tech Jacks Solutions — AI Math Reasoning Milestones 30 Days → · Phys.org — AI breakthrough in math problem decades →