// news · interpretability · tools2026-05-23source: goodfire / anthropic / tech crunch

Goodfire ships sparse-autoencoder probes for Claude 4.6 via Anthropic API — first third-party interpretability tooling integrated at the model level

Goodfire AI shipped sparse-autoencoder probe access for Claude 4.6 Opus through the Anthropic API this week — the first time a third-party interpretability vendor has integrated activation-probe tooling directly at the model level on a frontier closed-flagship system. The integration is technically narrow (six pre-defined probe domains) but architecturally important.

The Goodfire integration runs probes server-side at Anthropic without exposing raw model activations to the third party — a confidential-compute pattern that lets external interpretability tooling work without violating the closed-flagship privacy boundary. The six probe domains cover hallucination detection, reward-hacking signature, sycophancy patterning, deception-relevant feature firing, refusal-circuit activation, and contextual-faking probes.

The pattern matters because it shows a tractable path for the interpretability ecosystem to grow outside the model labs themselves. If Goodfire-style server-side probe integration becomes the standard, smaller alignment-research shops can build commercial products that ride on frontier-model deployments without needing to host their own frontier models.

See our analysis →

Goodfire — Sparse-autoencoder Claude integration → · Anthropic — Third-party interpretability tooling on Claude → · TechCrunch — Goodfire ships interp tooling on Claude →