Zylos Research publishes 2026 mech interp landscape survey
Zylos Research released a comprehensive survey of mechanistic interpretability progress through Q2 2026. Headline finding: sparse autoencoders are now reliably extracting interpretable circuits at the scale of frontier models, but downstream uses in alignment remain mostly speculative.
The survey catalogs ~340 papers and tools from the year. SAE-based feature extraction has matured to the point where a frontier model can be decomposed into hundreds of thousands of named features within hours of compute, down from weeks in early 2025.
The honest gap remains: extracting features is not the same as using them. Most production safety stacks still rely on output-level filtering, not circuit-level intervention. The Anthropic SAE-based pre-deployment gate (May 10) is one of the first counterexamples.