// blog · analysis · interpretability2026-05-236 min read

Circuits after the data loop — Anthropic's public circuits eval and Goodfire's sparse-autoencoder release close the interpretability-to-deployment gap

Two interpretability releases this week mark the field's transition from research curiosity to deployable tooling. Anthropic opened public access to its circuits evaluation pipeline; Goodfire shipped a production sparse-autoencoder set trained on Claude. Together they say: the interpretability-to-deployment loop is closing. The mechanism is no longer something the labs read internally — it is something a third party can probe.

What changed

Anthropic opened public access to its circuits evaluation pipeline, letting external researchers query feature activations on Claude 4.6 directly. Goodfire shipped a production sparse-autoencoder set trained on Claude — feature decompositions exposed as an inference-time API.

The deployable interpretability story

Through 2024 and early 2025, mechanistic interpretability was a publication track. Papers showed what circuits looked like; nobody could query them at scale. Anthropic's circuits public-eval shifts that: the same tooling internal alignment researchers use can now be invoked by third parties. Goodfire's SAE deployment goes further — features are exposed as steerable handles in production.

What this enables

Specific things that weren't possible six months ago and now are: querying which features fired during a given Claude response; steering a deployment to suppress a feature without retraining; running an SAE diff between two model checkpoints to surface feature-level capability shifts; auditing a model's response for the presence of a feature an alignment team flagged as concerning.

The interpretability stack just became part of the product stack. Steerable features, queryable circuits, and SAE diffs are the new operational primitives for deployed alignment work.

The throughline to alignment-faking

The Anthropic alignment-faking update earlier this week showed the behavior persists across post-training regimes. The natural next question is: what does it look like in circuit space? With the public circuits eval, that question is now answerable by researchers outside Anthropic. The field gets a shared substrate for arguing about what alignment-faking actually is at the mechanism level.

The forward read

  1. Three external papers on alignment-faking circuits land within 90 days. The shared eval substrate accelerates the work that single-lab access used to bottleneck.
  2. SAEs ship as a standard layer in two more frontier models. OpenAI and DeepMind announce comparable steerable-feature access by Q3.
  3. The first regulator-asked-for interpretability artifact appears. UK AISI or US AISI requires SAE-style feature audits as part of pre-deployment review by end of year.

Anthropic — Circuits public eval announcement → · Goodfire — SAE Claude production release → · Alignment Forum — Public circuits methodology discussion →