Unified Theory of Sparse Dictionary Learning paper formalizes spurious minima in mech interp
An arXiv preprint (2512.05534, last updated May 2) proposes a unified theoretical framework for sparse dictionary learning in mechanistic interpretability, characterizing the piecewise biconvex optimization landscape and proving the existence and characterization of spurious local minima.
The contribution is theoretical scaffolding for a field that has been moving fast on empirical results without solid foundations. The paper shows that the optimization landscape SAEs traverse has structural properties that explain why some training runs converge to interpretable features and others get trapped in spurious minima.
Practical implication: the paper provides an initialization criterion that empirically reduces the spurious-minima problem on standard SAE training runs by ~40%. Frontier labs have already begun adopting variants of the criterion in their next-generation SAE training pipelines.