Sparse Crosscoders enable cross-layer and cross-model feature comparison — same features compared across Claude 3 Sonnet, Opus, and earlier checkpoints
Sparse Crosscoders extend sparse-autoencoder methodology to map activations across multiple layers and multiple models into a single shared feature space. Researchers can now ask: does this feature exist in Claude 3 Sonnet, Claude 3 Opus, and the earlier checkpoint X — and how has it changed across the model family? The methodology enables longitudinal interpretability across training cycles and model versions.
The technical advance is the shared-feature-space training. Standard SAEs are trained per-layer per-model — features in layer 12 of Claude 3 Sonnet are not directly comparable to features in layer 12 of Claude 3 Opus because the underlying activation spaces differ. Sparse Crosscoders train a single autoencoder with separate encoder weights per layer/model but a shared decoder, forcing the latent feature dictionary to be common across all sources. Once shared, the features can be cross-referenced: feature 42,318 fires the same way in Sonnet layer 12 and Opus layer 14, suggesting it represents the same concept at different depths in the two models.
The application for alignment research is what makes this matter beyond methodology. Anthropic's circuit-tracing work has shown that capabilities like deception, sycophancy, and concealment have identifiable feature signatures in production models. Crosscoders make those signatures versionable: when Claude 4.5 ships, the alignment team can verify whether the deception feature in Claude 3 Sonnet still exists in 4.5, whether it has been suppressed, whether it has been replaced by an analogous feature, or whether it now has different dependencies in the circuit graph. That longitudinal view is necessary infrastructure for any safety claim that compares across model generations — and it just became cheap enough to run continuously rather than as a one-off study.
Medium / Adnan Masood — Mechanistic Interpretability Explained → · arXiv — ICLR 2026 Mechanistic Interpretability → · IntuitionLabs — Understanding Mechanistic Interpretability in AI Models →