// blog · analysis · interpretability2026-05-245 min read

Ten million features, and the safety surface that still has no conclusions

Anthropic monitors approximately 10 million neural features during stress-test evaluations. OpenAI runs chain-of-thought monitoring at the trajectory level. ICLR 2026 concedes mechanistic interpretability is "credible wins but still early, fragile, and incomplete." The methodology is real; the production-grade conclusions aren't yet.

Mechanistic interpretability had a narrative problem through 2024-2025: the same circuits-and-features story got recycled at three successive top-tier conferences without a discernible step-function in real-world utility. The 2026 cadence has shifted that story. Anthropic deploying 10-million-feature monitoring in production safety evals is methodology-at-scale. The ICLR 2026 survey is the field's first honest accounting of how far that methodology actually goes.

Anthropic's 10M-feature evaluation pipeline works at the activation level: dictionary learning extracts human-interpretable features (deception, sycophancy, bias, power-seeking, concealment) from model activations and monitors their firing rates under stress-test conditions. OpenAI's chain-of-thought monitoring works at the trajectory level: a classifier reads the model's reasoning trace and flags potentially deceptive patterns. Both methods produce signal. Neither produces conclusions.

What "signal but no conclusions" means in practice

You can detect that a feature for "deception" fires more than baseline on a class of prompts. You cannot yet say with certainty that the model is being deceptive (vs. modeling a deceptive character, vs. activating on lexically-related material that isn't deception). You can flag a CoT trace as "potentially deceptive" at 0.17% rate with 92% accuracy. You cannot yet tell whether that 92% is high enough for the operational decision the safety case requires.

The ICLR 2026 survey's framing of the field as "credible wins but still early, fragile, and incomplete" is the honest version of where the methodology actually is. The induction heads, IOI circuits, greater-than circuits, and SAE-based feature discoveries are real and reproducible. The production-scale promise — "we can audit any model behavior to its causal circuit" — has not been delivered.

Why this matters for regulation

The EU AI Act high-risk-system gating expected from the AI Office consultation closing this summer is going to require evidence-based risk management for designated high-risk AI systems. "Evidence" in the regulatory framework includes interpretability documentation. If the field is still "credible wins but fragile," regulators have to choose: gate approval on incomplete methodology (and either approve too freely or block too aggressively), or wait for the methodology to mature (and have nothing to enforce against in the interim). Both are uncomfortable.

The Anthropic bet — 10M-feature monitoring deployed in production now, on the assumption that the methodology will mature into a real audit surface by the time enforcement starts — is the wager that the gap between "signal" and "conclusions" closes faster than regulators set the threshold. If the bet pays out, mechanistic interpretability becomes the production safety surface. If it doesn't, the regulatory regime gets built on the methodology that exists, with all its known fragility.

What to watch in the next twelve months

Two things. First: whether cross-layer transcoders and the Anthropic March 2025 circuit-tracing methodology scale through the Claude 5 / Opus successor generation. The current SAE-based feature stability is good through Claude 3 Sonnet scale; it has not been demonstrated to hold through 10×-scale model successors. Second: whether the "feature for X fires" signal can be combined with the "CoT trace looks like Y" signal in a way that produces actionable safety calls rather than just two independent monitoring layers. The integration question is where the field needs to go; it's not where it is yet.

OpenAI — Anthropic-OpenAI alignment evaluation findings → · arXiv — ICLR 2026 Mechanistic Interpretability → · Medium / Adnan Masood — Mechanistic Interpretability Explained →