'Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models' arXiv 2505.17769 — methodology paper for scalable LLM interpretation
The ITDA arXiv paper (2505.17769) introduces inference-time decomposition of activations as a scalable approach to interpreting large language models. The inference-time methodology addresses the scalability constraint that training-time interpretability methods (SAEs, dictionary learning) impose — providing interpretability without the substantial compute cost of training new interpretability infrastructure per model.
The substantive piece is the inference-time-versus-training-time methodology distinction. Pre-ITDA interpretability methods (sparse autoencoders, dictionary learning, concept-bottleneck approaches) required training-time investment — train interpretability infrastructure per model. Inference-time decomposition operates on already-trained models without per-model interpretability-training requirement. The scalability advantage matters substantially for production interpretability deployment.
The competitive read against the broader 2026 interpretability methodology landscape is that scalability constraints have become primary concern for production interpretability deployment. DeepMind's SAE deprioritization partly reflected scalability constraints. ITDA-class methodology may address scalability dimension without abandoning interpretability methodology entirely — substantively different research direction than methodology-refinement on training-time SAE approaches.
arXiv — Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models (2505.17769) → · arXiv — A Survey on Sparse Autoencoders →