// news · interpretability2026-06-22source: iclr / arxiv

ICLR 2026 publishes 'Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders' from De La Salle University — domain-specific interpretability template

The ICLR 2026 publication of the De La Salle University Code Correctness SAE paper by Kriz Tahimic and Charibeth Cheng establishes a domain-specific interpretability template. The methodology — t-statistics for direction selection, separation scores, steering analysis, attention analysis, weight orthogonalization — generalizes to other capability domains.

The substantive piece is the domain-specific-interpretability template establishment. The methodology applied to code correctness in this paper generalizes — t-statistics for identifying capability-relevant directions in LLM representations, validation via steering and attention analysis, weight-orthogonalization for causal verification. The template provides a clear pattern for interpretability researchers studying other capability domains (mathematical reasoning, multilingual, safety-relevant, etc.).

The competitive read against DeepMind's SAE deprioritization is that SAE methodology is bifurcating between general-purpose interpretability (where DeepMind concluded SAEs underperform baselines) and domain-specific interpretability (where the De La Salle paper demonstrates clear value). The H2 2026 interpretability research direction will likely sustain investment in domain-specific applications even as general-purpose SAE work declines.

See our analysis →

ICLR 2026 — Mechanistic Interpretability of Code Correctness via Sparse Autoencoders → · arXiv — Mechanistic Interpretability with Sparse Autoencoder Neural Operators →