// news · research · alignment · safety2026-05-16source: anthropic / arxiv

Constitutional Classifiers cut jailbreak success from 86% to 4.4%

An Anthropic paper formalizes Constitutional Classifiers — small purpose-trained models that screen LLM inputs and outputs against a constitution. The headline result: jailbreak success rate on standard red-team suites drops from 86% to 4.4% with negligible helpfulness cost.

The technique pairs a frontier model with a much smaller (~1B parameter) constitutional classifier trained on the same constitution as the main model. The classifier evaluates each turn against ~200 written principles and can refuse, redirect, or rewrite.

The 86% → 4.4% number is on the latest StrongREJECT benchmark suite. Helpfulness loss was measured at ~1.2% on Anthropic's internal helpfulness benchmark — small enough that they're shipping it into Claude 4.5's production stack.

Anthropic research → · arXiv →