Anthropic opens Circuits public eval — releases activation-probe dataset for community red-team of interpretability claims
Anthropic released the Circuits public evaluation suite this week, the first time the lab has open-sourced its interpretability probe dataset for community red-teaming. The release includes 12,000 probe-and-response pairs from Claude 3.5 Sonnet and Claude 4.6 Opus internals, along with the eval harness. The move is the most concrete interpretability transparency commitment from a closed-flagship lab to date.
The eval suite is structured to let external researchers reproduce Anthropic's published interpretability claims (feature-circuit-tracing, sparse-autoencoder probes, attribution patching) on the open dataset and submit counter-claims if the methodology fails to replicate. Anthropic is committing to publishing the disagreements alongside its own follow-up.
The strategic read is that Anthropic is positioning Circuits as the eval-standard candidate that the UK AISI Methodology 2.0 activation-probe requirements will reference. By open-sourcing the methodology and the probe data, Anthropic moves first to define what 'activation-probe testing' means in regulatory practice, which advantages a lab with a working sparse-autoencoder stack over labs that don't.
Anthropic — Circuits Public Eval release → · arXiv — Circuits Public Eval Dataset paper → · AI Alignment Forum — What the public eval lets us check →