SAE Neural Operators paper accepted to ICLR 2026 — generalizing SAEs across model scales
Mechanistic Interpretability with Sparse Autoencoder Neural Operators (arXiv 2509.03738), accepted at ICLR 2026, generalizes the SAE methodology to operate as a neural operator that transfers learned dictionaries across models of different scales without retraining.
The technique solves a practical pain point: SAEs are expensive to train, and have historically had to be retrained separately for each base model. Neural-operator SAEs can be trained once on a small model and applied (with adaptation) to a larger model from the same family.
The transferability claim has been independently verified on the Llama and Qwen model families. Expect the technique to drop the cost of frontier-scale interpretability work by an order of magnitude, which would unblock several research directions previously bottlenecked on SAE compute.