// news · interpretability · research2026-05-20source: interpretability research

Complete Replacement Models combine transcoders + Lorsas to fully sparsify language models

A new class of interpretability methods — Complete Replacement Models (CRMs) — combines transcoder MLP replacements with localized SAE variants (Lorsas) to fully sparsify a transformer's representation. Where SAEs alone left residual dense pathways, CRMs aim to decompose the entire forward pass into named, sparse circuits.

The methodological claim is significant. SAE-based interpretability has produced credible wins on induction heads, IOI circuits, and feature discovery, but the residual dense pathways meant any complete causal story for a transformer's behavior still had unexplained pieces. CRMs are an attempt to close that gap by replacing the dense components with sparse-by-construction substitutes.

If CRMs validate at scale, the practical implication is that mechanistic interpretability moves from research-stage to production safety tool. Companies could enumerate the circuits a model uses for a given task class, attest which circuits handle sensitive inputs, and audit the named components against safety policies. That's a different regulatory surface than today's black-box evaluation regime.

Medium — mechanistic interpretability 2026 → · OpenMOSS — Language Model SAEs GitHub →