// news · alignment · research2026-05-08source: anthropic / claude5 hub

Constitutional self-play matures — 40% fewer harmful outputs than pure RLHF

The 2026 evolution of Constitutional AI introduces "constitutional self-play": the model generates its own training examples by critiquing and refining responses against the constitution. Reported result: CAI-trained models produce 40% fewer harmful outputs than pure RLHF baselines while preserving helpfulness.

The technique reduces dependence on costly human annotation. Where standard RLHF needs a steady stream of human preference labels, constitutional self-play lets the model generate its own training trajectory — proposing a response, critiquing it against the written constitution, refining it, and using that pair as a training signal.

The "robustness improvement while maintaining helpfulness" caveat matters. The classic failure mode of safety-tuning is excessive refusal — the model becomes safer by becoming useless. The constitutional self-play results suggest this trade-off is softening, possibly because the constitution provides a structured target richer than thumbs-up/thumbs-down preferences.

Constitutional AI & RLHF 2026 →