// blog · analysis · interpretability2026-06-13source: analysis / ai-blogs.org

Developmental interpretability and the post-mechinterp era — the methodological pivot that follows DeepMind's SAE deprioritization

Developmental interpretability — studying how circuits form during training rather than dissecting frozen models — is emerging as the methodological successor to mechanistic interpretability's SAE-dominant phase. The pivot is structural, not contested.

The interpretability community is in transition. SAE-based mechanistic methods drove the field for two years; the methodology has scaling limits that became visible at frontier scale. The developmental-interpretability review frames the answer.

What developmental interpretability is

Rather than extracting features from a frozen model's activations, developmental interpretability tracks how circuits and representations form during training. The unit of analysis is the training trajectory — how features emerge, when they consolidate, how they relate to data distribution shifts at each checkpoint. The hypothesis is that understanding formation reveals more about safety-relevant properties than dissecting the static endpoint.

Why this matters for safety

If you understand how a circuit forms, you can reason about whether the formation is stable, whether different data distributions would produce different circuits, and whether the circuit will persist under continued training. That's a different category of safety guarantee than "this feature was present at the end of training." The methodological gain matters for the test-environment distinction problem — formation analysis doesn't depend on the model failing to detect the test context.

The Anthropic vs DeepMind divergence

Anthropic continues to invest in mechanistic interpretability (the microscope-as-procurement-deliverable bet); DeepMind deprioritized SAE work last month. Developmental interpretability is the third path — neither doubling down on SAE nor abandoning mechanistic framing. For the 2027 interpretability landscape, this is the methodological program that's likely to anchor the field.

The composability bet

Combine developmental interpretability with layer-importance ranking via token prediction refinement: identify essential layers, study how those layers' circuits form, validate against deployed model behavior. That's a complete methodological stack — and it's the stack the next 12-18 months of research will build out.

ArXiv — A Review of Developmental Interpretability in Large Language Models → · Zylos Research — AI Safety, Alignment, and Interpretability in 2026 →