// news · interpretability2026-06-23source: openreview / arxiv

ICLR 2026 Code Correctness SAEs paper implementation details — uses pre-trained GemmaScope autoencoders to decompose activations into interpretable latents, filters general language patterns

The ICLR 2026 'Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders' paper uses pre-trained GemmaScope autoencoders to decompose activations into interpretable latents at each layer, filtering out general language patterns. The work reveals interpretable causal mechanisms underlying natural language processing — entity recognition mechanisms, specialized extraction heads, structured circuits for factual recall.

The substantive piece is the GemmaScope-based implementation pattern as reusable methodology. The De La Salle University paper uses pre-trained Gemma autoencoders rather than training new ones from scratch — substantially reducing the compute cost of applying SAE interpretability to new domains. The pattern (pre-trained scope autoencoders + domain-specific filtering + causal-mechanism analysis) generalizes to any capability domain where Gemma scope autoencoders are available.

The competitive read against Anthropic's emotion-vectors approach is that the field has two complementary causal-interpretability methodology families — Anthropic's concept-vector identification + causal-steering validation, and the De La Salle pattern using pre-trained scope autoencoders + filtering. Both produce causal interpretability results; both require different infrastructure investments. The H2 2026 to 2027 interpretability research direction will likely run both methodology families in parallel.

See our analysis →

OpenReview — Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders → · arXiv ICLR 2026 — ICLR 2026 Mechanistic Interpretability paper →