RLHF 2.0 methodology cuts alignment-tax performance penalty by 60% vs first-generation RLHF
Recent results show RLHF 2.0 — the iteration that combines preference modeling with constitutional self-play and process supervision — reduces the alignment-tax penalty by approximately 60% compared to first-generation methods. The structural implication: safety training no longer requires substantial capability concessions.
The alignment-tax framing has been load-bearing for the frontier-vs-safety tradeoff narrative since 2023. If the tax shrinks by 60%, the false-choice framing of "more aligned or more capable" collapses for most evaluations. Labs can ship both at once without the trade-off that justified deferring safety work to post-deployment.
For policy debate, the result undercuts one of the major counter-arguments to mandatory red-team and capability-hold requirements. "Safety training degrades capability" has been the load-bearing claim. If RLHF 2.0 holds at scale, that claim has to be retired. See our analysis → on the constitutional self-play substrate.