Developmental interpretability arxiv review (2508.15841) formalizes the subfield's distinction from mechanistic interpretability — two-pronged methodology consensus emerges
The arxiv review 'A Review of Developmental Interpretability in Large Language Models' (2508.15841) formalizes developmental interpretability — studying how internal representations form during training — as a distinct subfield from mechanistic interpretability. The two-pronged methodology consensus (developmental + mechanistic) gives the discipline a clear research-question taxonomy through 2027.
The substantive piece is the methodology-distinction formalization. Pre-2026, 'interpretability research' was a single category mixing two structurally-different research approaches: (a) what's inside the model now (mechanistic — circuit tracing, attention pattern analysis, feature visualization), and (b) how the model got that way (developmental — training-trajectory studies, capability-emergence analysis, scaling-law interpretability). The new arxiv review formalizes the distinction, creating a cleaner research-question taxonomy that lets the field address both questions in parallel rather than confusing them.
The structural implication for the broader discipline-formalization pattern is that mature scientific disciplines develop sub-disciplinary distinctions as their research output grows; the developmental-vs-mechanistic distinction is the kind of taxonomy that emerging fields don't yet need. Its arrival is a signal of the field's maturation rather than fragmentation.
ArXiv — A Review of Developmental Interpretability in Large Language Models → · ArXiv — Mechanistic Interpretability for AI Safety -- A Review → · ArXiv — An Approach to Technical AGI Safety and Security →