// blog · analysis · alignment2026-05-215 min read

Sparse policy and the audit surface — what the 1-3% finding does to alignment economics

If RL training of reasoning models affects only 1-3% of token positions, then the safety properties that come from alignment training also concentrate in 1-3% of decisions. That makes audits more tractable — and more legible to adversaries.

The finding

A new arXiv paper argues that RL fine-tuning of frontier reasoning models is sparse policy selection, not capability learning. Specifically: only 1-3% of token positions are affected by RL training, and the promoted tokens are nearly always within the base model's top-5 alternatives. The implication: reasoning models are base models with sparsely-modified token-selection policies, not new capability.

What it changes about audit economics

If the safety properties from alignment training concentrate in 1-3% of decisions, then audits can concentrate proportionally. Pre-deployment safety evaluations no longer need to probe the entire decision surface — they can target the modified decision boundary. The cost-to-thoroughness ratio improves significantly.

Combined with circuit-level identification of test-awareness features, the audit toolkit gets sharper. Identify the 1-3% of modified decisions, identify test-awareness features that activate during evaluation, and the safety attestation surface becomes much more legible.

The audit surface just shrunk from 'the whole policy' to '1-3% of the policy.' That's good news for evaluators.

The adversary problem

Smaller audit surfaces are also smaller adversary targets. If safety guarantees concentrate in 1-3% of decisions and those decisions are within the base model's top-5 alternatives, then adversarial prompts that nudge the model's selection back toward the base-model neighborhood effectively defeat the alignment-stage safety guarantee at low cost.

This is the classic alignment-tax problem in a new form. RL training adds safety at the cost of an inference-time inefficiency that adversaries can exploit. The sparse-policy framing makes the exploit pathway more concrete: target the 1-3% of modified positions, push the policy back toward the base model's natural alternative, observe the safety guarantee dissolve.

What the alignment community should do with this

Publish methodology updates explicitly tagged to sparse-policy findings. The existing interpretability and red-teaming pipelines assumed dense modification; they need recalibration.
Develop adversarial benchmarks targeting modified positions. The 1-3% finding gives the red-team community a clear attack surface to test against; benchmarks should reflect that.
Re-frame 'reasoning model' marketing. If RL is selecting from the base model, then the capability story is mostly about the base model. Reasoning-model branding should reflect the underlying source of the capability.

For the broader debate about DPO and the methodology gap, the sparse-policy finding reinforces the message: production alignment is concentrated in a small decision surface, and the methodology research has to catch up to that reality.

arXiv 2605.06241 — sparse policy selection → · arXiv 2605.02073 — search-driven RL → · Zylos — AI safety 2026 →