// news · interpretability · alignment2026-06-03source: venturebeat / anthropic / openai

OpenAI, Anthropic, and Google DeepMind co-sign chain-of-thought monitorability paper — the interpretability surface is fragile and may be closing

Researchers from OpenAI, Anthropic, Google DeepMind, and the UK AI Security Institute jointly published "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety," arguing that the legibility of reasoning-model chain-of-thought is currently the most accessible interpretability surface the field has — and that future training regimes pushing reasoning into latent space could eliminate it.

The interesting piece is that this paper exists at all. OpenAI, Anthropic, and DeepMind have spent 2025-2026 in increasingly explicit competitive posture — divergent safety narratives, conflicting evals practice, separate frontier-model launch cadences. A joint paper signed across all three labs (plus UK AISI) is a coordination event, not just a research note. The shared message is that chain-of-thought reasoning is currently human-readable, that this readability is a contingent property of how today's reasoning models are trained, and that there is no architectural guarantee it survives the next generation.

The technical claim is sharper than the press framing. Models like GPT-o-series, Claude with extended thinking, and DeepMind's reasoning variants use CoT as working memory because the task forces them to — complex multi-step reasoning has to be externalized somewhere, and current architectures externalize it in tokens humans can read. The paper warns that latent-reasoning architectures (recurrent thought vectors, continuous planners, compressed reasoning states) would route the same computation through opaque channels. Once that transition happens, the CoT-monitoring layer that today's safety teams rely on disappears, and there is no obvious replacement at the same maturity level.

The faithfulness gap is the other receipt the paper carries. Anthropic's own circuit-tracing work showed Claude 3.7 Sonnet mentioning relevant hints in its CoT only 25% of the time — meaning the visible reasoning often is not the causal reasoning. The joint paper acknowledges this directly: CoT monitorability is necessary but not sufficient, and treating it as a load-bearing safety guarantee would be a mistake. The call to action is twofold — preserve CoT legibility as a deliberate training choice, and accelerate deeper interpretability work (SAE-based circuit tracing, feature-level intervention) so that the field has a fallback when the surface-level window closes.

See our analysis →

VentureBeat — OpenAI, Google DeepMind and Anthropic sound alarm: 'We may be losing the ability to understand AI' → · OpenAI — Evaluating chain-of-thought monitorability → · Anthropic — Tracing thoughts in a language model →