// news · research · interpretability2026-05-19source: arxiv 2605.13930

TopK Sparse Autoencoders extract interpretable clinical features from EEG foundation models

An arXiv preprint (2605.13930, submitted May 13) applies TopK Sparse Autoencoders to three EEG foundation models — SleepFM, REVE, LaBraM — and successfully extracts sparse feature dictionaries that align with clinical taxonomies including abnormality, age, sex, and medication state.

The result is significant for two reasons. First, it demonstrates that the sparse-autoencoder interpretability machinery developed for LLMs transfers cleanly to a different modality (biomedical signals). Second, it benchmarks monosemanticity across three architectures — letting researchers compare which transformer designs produce more interpretable internal features.

Findings: SleepFM produces more monosemantic features than REVE or LaBraM at matched model sizes, suggesting architecture choices made for medical applications happen to also favor interpretability. The paper is the first widely-noted application of SAE methods outside the LLM-mech-interp community.

arXiv 2605.13930 → · arXiv listing — May 2026 cs.AI →