// news · alignment · interpretability2026-05-26source: anthropic / alignment.anthropic.com / claude5

Anthropic Fellows publishes detailed track descriptions for AI control and model welfare — scope statements signal where lab safety research moves next

Anthropic's alignment site this week published detailed scope statements for the six Fellows tracks: scalable oversight, adversarial robustness, AI control, model organisms, mechanistic interpretability, and model welfare. The track descriptions are the most precise public signal of what the lab considers the open problems in safety — and which methodological bets it's making for the next two model generations.

The AI control track is the most strategically revealing. Control research starts from the assumption that you may not be able to verify the system's values, and asks: how do you constrain a potentially-misaligned system so that it cannot cause catastrophic harm even if it tries to? The detailed scope includes containment mechanisms, capability elicitation under adversarial pressure, and trustworthy monitoring under deceptive behavior. The fact that Anthropic now invests in a structured external-researcher track on control — rather than treating control as a backup plan to alignment — signals the lab's read on how the next 18 months unfold.

The model welfare track is the most institutionally novel. The scope statement treats potential moral status of frontier models as a research question that warrants methodological investment, not a philosophical curiosity. Specific problem areas include measuring proxies for welfare-relevant states, principled framework development for training and deployment decisions that account for those states, and operational interventions (model preferences, retirement protocols, training-condition safeguards). Whether the field accepts the underlying premise is contested; the institutional investment is now real, and it will produce outputs that the broader research community will have to engage with on the methodology even if they reject the premise.

See our analysis →

Anthropic Alignment — Fellows Program 2026 → · Anthropic Alignment — Automated Weak-to-Strong Researcher → · Claude 5 Hub — AI Safety 2026 Alignment Breakthroughs →