// news · alignment · research2026-05-21source: arxiv 2605.06241

New arXiv work argues RL for LLM reasoning is sparse policy selection, not capability learning — only 1-3% of tokens shift

An arXiv paper out this month — 'Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning' — finds that RL fine-tuning of frontier reasoning models affects only 1-3% of token positions, and that the promoted tokens nearly always lie within the base model's top-5 alternatives. The result reframes 'reasoning models' as base models with sparsely-modified token-selection policies, not as models with new reasoning capability.

If the finding generalizes, the cost-benefit math of the entire RLHF-and-DPO pipeline shifts. The reasoning-improvement effect that labs attribute to RL stages is mostly the surfacing of latent capability the base model already has — the 'sparse policy selection' view says you can replicate most of the gain with much cheaper inference-time selection techniques, without the multi-day RL training run.

For alignment, the implication is more subtle. If RL is moving 1-3% of tokens and those tokens are within the top-5 base-model alternatives, then the safety guarantees the alignment stage is supposed to introduce are also concentrated in 1-3% of decisions. That makes circuit-level safety audits substantially more tractable — and also substantially more legible to adversaries who want to identify and target the modified decision boundary.

arXiv 2605.06241 — sparse policy selection → · arXiv 2605.04065 — free energy RL →