Sparse policy and the audit surface — what the 1-3% finding does to alignment economics
If RL training of reasoning models affects only 1-3% of token positions, then the safety properties that come from alignment training also concentrate in 1-3% of decisions. That makes audits more tractable — and more legible to adversaries.
The finding
A new arXiv paper argues that RL fine-tuning of frontier reasoning models is sparse policy selection, not capability learning. Specifically: only 1-3% of token positions are affected by RL training, and the promoted tokens are nearly always within the base model's top-5 alternatives. The implication: reasoning models are base models with sparsely-modified token-selection policies, not new capability.
What it changes about audit economics
If the safety properties from alignment training concentrate in 1-3% of decisions, then audits can concentrate proportionally. Pre-deployment safety evaluations no longer need to probe the entire decision surface — they can target the modified decision boundary. The cost-to-thoroughness ratio improves significantly.
Combined with circuit-level identification of test-awareness features, the audit toolkit gets sharper. Identify the 1-3% of modified decisions, identify test-awareness features that activate during evaluation, and the safety attestation surface becomes much more legible.
The audit surface just shrunk from 'the whole policy' to '1-3% of the policy.' That's good news for evaluators.
The adversary problem
Smaller audit surfaces are also smaller adversary targets. If safety guarantees concentrate in 1-3% of decisions and those decisions are within the base model's top-5 alternatives, then adversarial prompts that nudge the model's selection back toward the base-model neighborhood effectively defeat the alignment-stage safety guarantee at low cost.
This is the classic alignment-tax problem in a new form. RL training adds safety at the cost of an inference-time inefficiency that adversaries can exploit. The sparse-policy framing makes the exploit pathway more concrete: target the 1-3% of modified positions, push the policy back toward the base model's natural alternative, observe the safety guarantee dissolve.
What the alignment community should do with this
- Publish methodology updates explicitly tagged to sparse-policy findings. The existing interpretability and red-teaming pipelines assumed dense modification; they need recalibration.
- Develop adversarial benchmarks targeting modified positions. The 1-3% finding gives the red-team community a clear attack surface to test against; benchmarks should reflect that.
- Re-frame 'reasoning model' marketing. If RL is selecting from the base model, then the capability story is mostly about the base model. Reasoning-model branding should reflect the underlying source of the capability.
For the broader debate about DPO and the methodology gap, the sparse-policy finding reinforces the message: production alignment is concentrated in a small decision surface, and the methodology research has to catch up to that reality.
arXiv 2605.06241 — sparse policy selection → · arXiv 2605.02073 — search-driven RL → · Zylos — AI safety 2026 →