Developmental interpretability review crystallizes the post-mechinterp methodology — training-trajectory analysis becomes the answer to SAE deprioritization
A review of developmental interpretability in large language models — published this cycle — frames the methodology as the answer to mechanistic interpretability's scaling limits. Rather than dissecting a frozen model's activations, developmental interpretability tracks how circuits and representations form during training. The framing positions developmental methods as the natural successor to SAE-based mechanistic work that DeepMind deprioritized last month.
The substantive piece is the methodological pivot. Mechanistic interpretability spent years extracting features from sparse autoencoders applied to frozen models; developmental interpretability instead studies the training trajectory — how features emerge, when they consolidate, how they relate to data distribution shifts. The hypothesis is that understanding formation reveals more about safety-relevant properties than dissecting the static endpoint.
The competitive frame is that Anthropic's microscope-as-procurement-deliverable bet and DeepMind's SAE deprioritization sit on opposite sides of the same question — does mechanistic interpretability scale to production safety value? Developmental interpretability is the methodological third path: take the training-time framing as primary rather than committing to mechanistic-vs-behavioral framing at evaluation time.
ArXiv — A Review of Developmental Interpretability in Large Language Models → · Zylos Research — AI Safety, Alignment, and Interpretability in 2026 → · ArXiv — Mechanistic Interpretability for AI Safety — A Review →