When the Coffee Feature Activates on Coffins — new arxiv paper on feature extraction and steering exposes brittleness in residual-stream features
A new arxiv paper titled "When the Coffee Feature Activates on Coffins" examines feature extraction and steering in sparse-autoencoder-trained residual-stream interpreters and exposes brittleness in feature identification — features named for coffee can fire on conceptually distant inputs in ways that complicate intervention design. The paper is methodologically important because it characterizes the failure modes that production-grade mech-interp deployment needs to handle.
The technical finding is the substantive piece. The paper demonstrates with concrete examples that sparse-autoencoder-trained interpretability features — the workhorse methodology for current mech-interp deployment — can fire on inputs that share latent-space proximity with the labeled concept but do not share the surface-level concept. A "coffee" feature firing on "coffins" is the catchy example; the broader pattern is that interpretability features encoded in residual-stream representation share structure across surface-distant concepts in ways that complicate the human-interpretation step of the pipeline.
The methodological consequence is what makes the paper consequential beyond the curiosity-finding. Production-grade mech-interp deployment — like Anthropic's feature-steering use in Sonnet 4.5 pre-deployment review — depends on the feature labels accurately characterizing what the feature responds to. The paper's finding is that the label-to-activation correspondence is not as robust as the prior methodology assumed, which has consequences for intervention design. Steering on a feature labeled "coffee" but actually firing on death-and-grief-adjacent contexts produces unintended intervention effects. The paper's contribution is the characterization of the failure mode plus a proposed methodology for tighter label-to-activation validation that addresses the brittleness directly.
ArXiv — When the Coffee Feature Activates on Coffins paper → · ArXiv — Mechanistic Interpretability for AI Safety A Review → · MDPI — Survey on Mechanistic Interpretability in Generative AI →