Sparse feature circuit-finding scales to 30× larger models — in-context learning circuits now traceable
Recent work scaled sparse feature circuit-finding methodology to models with 30 times more parameters than prior demonstrations. The scaled method successfully identifies the circuits that drive in-context learning — one of the previously opaque emergent behaviors of large transformers.
In-context learning has been a black-box capability since GPT-3 — the model adapts to new tasks from examples in the prompt without weight updates, and the mechanism was understood only at the phenomenological level. Identifying the specific circuit responsible converts ICL from a behavioral observation into a structural property that can be characterized, tested, and potentially adjusted.
For safety research, the 30× scaling result is the load-bearing part. Interpretability methods that work on 1B-parameter toy models but fail on 70B-parameter production models have limited deployment value. Scaling to production-relevant sizes is the prerequisite for interpretability becoming a real safety tool.
arXiv — scaling sparse feature circuit finding → · IntuitionLabs — mechanistic interpretability →