// news · interpretability2026-06-27source: arxiv

'Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts' arXiv 2506.23845 — position paper argues for SAE methodology repositioning toward discovery rather than steering

The arXiv 2506.23845 paper argues for SAE methodology repositioning — use sparse autoencoders for discovering unknown concepts rather than acting on known concepts through steering. The position paper challenges the dominant steering-via-SAE methodology direction with structural argument about where SAE methodology actually provides value vs where it underperforms alternatives.

The substantive piece is the methodology-positioning argument. Pre-paper SAE methodology applications spanned discovery (find unknown features) AND steering (modify behavior via known features). The position paper argues SAE methodology is structurally suited to discovery — finding interpretable directions in activation space — but underperforms simpler methods at steering specific behaviors. Repositioning would align methodology investment with where SAE provides actual value.

The competitive read against SAE-LoRA targeted-alignment methodology is that the H2 2026 SAE methodology direction includes both discovery-focused and steering-focused applications. Whether the field consolidates around discovery-only positioning or maintains dual application will affect H2 2026 to 2027 interpretability research direction substantially.

See our analysis →

arXiv — Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts (2506.23845) → · arXiv — Evaluating SAE Interpretability with Concept Annotations →