// news · interpretability2026-06-23source: arxiv

'Residual Stream Analysis with Multi-Layer SAEs' arXiv 2409.04185 — methodology paper introduces multi-layer SAE pattern for cross-layer feature tracing

The Residual Stream Analysis with Multi-Layer SAEs arXiv paper (2409.04185) introduces a methodology for cross-layer feature tracing using multi-layer sparse autoencoders. Pre-multi-layer SAE methodology applied autoencoders to single layers in isolation; the multi-layer pattern enables tracing how features propagate across transformer layers.

The substantive piece is the cross-layer feature-tracing methodology. Single-layer SAE analysis identifies features at a specific transformer layer but doesn't reveal how those features propagate forward to subsequent layers or where they originated from earlier layers. The multi-layer SAE pattern addresses both directions — letting researchers trace feature evolution across the full transformer depth.

The competitive read against DeepMind's SAE deprioritization is that methodology refinements like multi-layer SAEs may address some of the limitations that motivated DeepMind's deprioritization. The H2 2026 interpretability-methodology direction needs more such refinements — single-layer SAE baselines may underperform safety-relevant tasks, but multi-layer or domain-specific SAE variants may close the gap. The bifurcation between deprioritization and continued investment will sharpen as methodology variants are evaluated systematically.

See our analysis →

arXiv — Residual Stream Analysis with Multi-Layer SAEs (2409.04185) → · arXiv — A Survey on Sparse Autoencoders →