// news · alignment · research-papers2026-05-23source: anthropic / alignment.anthropic.com / lesswrong

Anthropic publishes introspection adapters — trains LLMs to self-report fine-tune-induced behavior change

Anthropic published introspection adapters (IA) this week, a technique that trains an LLM to self-report behaviors it picked up during fine-tuning. The team reports the trained adapters generalize to models fine-tuned in different downstream regimes — a property prior introspection work failed to deliver. IA arrives as the alignment-faking literature accelerates and as the UK AISI Methodology 2.0 starts treating internal probes as load-bearing for pre-deployment review.

The methodological move is subtle: rather than asking a model after-the-fact whether its behavior changed, IA trains a lightweight adapter alongside fine-tuning that the model can query to enumerate the policy shifts the training run actually induced. The adapter is auditable, the model's self-report is grounded in concrete activation deltas rather than verbal confabulation, and the technique generalizes across downstream fine-tuning recipes.

The implication for the post-EO regulatory environment is direct. If introspection adapters become a standard part of fine-tuning pipelines, the gap between "the model says it didn't learn X" and "the model demonstrably did or did not learn X" narrows. That's exactly the gap that 2026 International AI Safety Report flagged as breaking down. The political question becomes whether regulators treat IA-style self-reports as evidence or as more elaborate stagecraft.

See our analysis →

Anthropic Alignment — Alignment Science Blog → · Zylos — AI Safety, Alignment, and Interpretability in 2026 → · Claude5 — AI Safety 2026: Alignment Research Breakthroughs →