ICLR 2026's code-correctness SAE paper establishes the domain-specific interpretability template — where mech-interp goes after the general-purpose SAE deprioritization
DeepMind's general-purpose SAE deprioritization closed one research direction. The De La Salle University ICLR 2026 paper on code-correctness SAEs opens another — domain-specific interpretability with concrete methodology and clear capability-domain coverage claims. The template generalizes; the research-direction bifurcation is now visible.
The ICLR 2026 code-correctness SAE paper from De La Salle University establishes a clear domain-specific interpretability template. The methodology — t-statistics for direction selection, separation-score validation, steering-and-attention analysis, weight-orthogonalization for causal verification — is reproducible and applies to any capability domain where direction identification matters.
The research-direction bifurcation
DeepMind's general-purpose SAE deprioritization said SAEs underperform baselines at safety-relevant general tasks (detecting harmful intent). The De La Salle code-correctness paper says SAEs work well for narrow capability-domain analysis. Both can be true simultaneously — the bifurcation is between general-purpose and domain-specific applications of the same methodology.
What this means for safety-engineering procurement
Safety teams choosing where to invest interpretability-tooling effort should now match methodology choice to use case. General-purpose harmful-intent detection should weight non-SAE methods (red-teaming, capability evals, formal-methods). Domain-specific capability analysis (code correctness, mathematical reasoning, multilingual behavior) should weight SAE-based interpretability. The narrow-vs-broad bifurcation is the practical lens for H2 2026 interpretability-tooling investment decisions.
The talent-market implication
The interpretability-engineer talent market that grew rapidly through H1 2026 should also bifurcate. Generalist mech-interp expertise becomes less of a clear hire-priority; domain-specific interpretability skills (code, math, multilingual, safety-relevant) become more valuable. Safety-team hiring should match candidate skill profiles to the specific interpretability sub-domains the team actually needs.
ICLR 2026 — Mechanistic Interpretability of Code Correctness via Sparse Autoencoders → · arXiv — SAFER: Probing Safety in Reward Models with Sparse Autoencoder →