// news · research · alignment · theory2026-05-18source: arxiv

'Alignment Waste' paper formalizes why safety doesn't transfer between architectures

A new arXiv preprint formalizes a phenomenon researchers had observed informally: alignment artifacts (RLHF policies, constitutional rules, refusal heuristics) are neither transferable to new model architectures nor correctable without expensive retraining.

The paper proves that under reasonable assumptions on feedback-based alignment methods, the artifacts produced are tightly coupled to the substrate model — meaning a constitution that works for Claude 4.5 cannot be directly ported to a different base architecture without losing most of its safety properties.

The implications are practical: every new foundation model release effectively forces alignment work to restart. This is non-trivial cost that has been underweighted in industry roadmaps that assume linear safety progress.

arXiv preprint server → · Zylos Research — alignment survey 2026 →