// news · interpretability · alignment2026-05-24source: openai / alignment.anthropic / arxiv

Anthropic discloses 10-million-feature evaluation pipeline — dictionary learning at scale becomes the production safety surface

The joint Anthropic-OpenAI alignment-evaluation findings, published last summer and now showing up in 2026 production processes, disclose that Anthropic monitors approximately 10 million neural features during stress-test evaluations using dictionary learning. Features map to human-interpretable concepts: deception, sycophancy, bias, power-seeking, concealment. The dictionary-learning-at-scale approach is now the production safety surface, not a research demo.

The contrast with OpenAI's chain-of-thought monitoring is the architectural news. OpenAI's deception classifier flags 0.17% of o1-preview responses with 92% accuracy on internal validation — operating at the trajectory level, watching the model's reasoning trace. Anthropic's 10M-feature dictionary works at the activation level, looking inside the model rather than at what it produces. The Hinton/Sutskever joint paper this morning's AM cycle covered makes the case that CoT monitoring is fragile under capability scaling; Anthropic's feature-level approach is the alternative that doesn't depend on a faithful reasoning trace.

The scaling question is whether dictionary learning still finds clean features at model sizes beyond Claude 3 Sonnet. Anthropic's March 2025 circuit-tracing work introduced cross-layer transcoders that read from one layer's residual stream but can output to all subsequent MLP layers — the architectural piece that lets SAE-derived features remain stable as model depth increases. If that scales through the Claude 5 / Opus successor generation, feature-level interpretability becomes a real production capability rather than a research curiosity.

See our analysis →

OpenAI — Findings from pilot Anthropic-OpenAI alignment evaluation → · Alignment Anthropic — Anthropic-OpenAI alignment evaluation findings → · Medium / IntuitionLabs — Understanding Mechanistic Interpretability in AI Models →