// news · interpretability · alignment2026-05-10source: anthropic research

Anthropic uses mechanistic interpretability in Claude Sonnet 4.5 pre-deployment safety review

Anthropic's interpretability team is now part of the pre-deployment review pipeline. For Claude Sonnet 4.5, researchers used the open-source circuit tracer and feature-level inspection to look for dangerous capabilities, deceptive tendencies, and undesired goals before model release.

The shift is meaningful: previously, interpretability work fed back into research timelines but not into release gates. With Sonnet 4.5 it sits in the actual checklist — a feature-level audit alongside the standard red-team evals. The microscope tool reportedly "reveals whole sequences of features," moving past the layer-by-layer SAE decomposition era into causal path analysis.

DeepMind's Gemma Scope 2, also referenced in the broader 2026 interpretability landscape, follows a similar trajectory — automated mechanistic tooling that can keep up with the pace of model deployment. Both indicate the field is past the artisanal phase. If interpretability is going to function as a deployment gate, the tooling has to scale; in 2026 it finally does.

Anthropic Interpretability →