// topic / interpretability

Interpretability

SAEs, circuits, mech interp — what's actually inside the models we use.

All items 53 items ← back to archive

ANTHROPIC / INDUSTRY ANALYSTS·2026-05-22

Glasswing's interpretability data pool starts forming — first AWS and JPMorgan-side Mythos behavioral reports inside Anthropic's contractually-bound channel

Internal sources at multiple Glasswing partners report initial deployment-side Mythos behavioral data is now flowing into Anthropic's safety research channel under the consortium contractual arrangement. The data covers AWS cloud-vuln-discovery workflows and JPMorgan finance-app fuzzing — the two highest-volume Mythos deployment contexts in the first month of Glasswing operation. The pool is the under-noticed second-order benefit of the consortium structure.

interpretability · safety
ARXIV 2605 / LIU ET AL.·2026-05-22

Interleaved vision-language reasoning traces paper offers a window into long-horizon robot planning — interpretability gets a robotics-specific primitive

An arXiv paper titled 'Thinking in Text and Images: Interleaved Vision-Language Reasoning Traces for Long-Horizon Robot Manipulation' from Jinkun Liu and colleagues introduces a methodology for capturing and analyzing how vision-language models route reasoning between modalities during multi-step robotic tasks. The traces give interpretability researchers a structured artifact to study without relying on internal model state — a meaningful methodological gain for closed-weights deployments.

interpretability · robotics
ARXIV / MECHANISTIC INTERPRETABILITY REVIEW·2026-05-22

Mechanistic Interpretability for AI Safety — the field-defining review consolidates 2024-2026 methodology into a single reference text

An updated 'Mechanistic Interpretability for AI Safety — A Review' (arXiv 2404.14082) consolidates the 2024-2026 methodology pipeline — circuit identification, feature differentials, sparse autoencoder methods, and behavioral attribution — into the field's reference text. The review's publication this week, during the postponed-EO ambiguity, gives both AISI and lab-internal teams a single citation surface for methodology discussions.

interpretability · research
ANTHROPIC / ARMORCODE·2026-05-22

Claude Mythos becomes the interpretability community's load-bearing stress test — Glasswing partners get capability access plus methodology access

Anthropic's Project Glasswing gives consortium partners — AWS, Apple, Cisco, CrowdStrike, Google, JPMorgan, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks — access to Claude Mythos for defensive vulnerability discovery. The under-noticed structural feature is that Glasswing partners also gain operational visibility into Mythos's reasoning patterns. That makes the consortium a de-facto interpretability research collaboration alongside its primary cybersecurity-defense mission.

interpretability · safety
SOURCE·2026-05-22

Glasswing and the third path — Anthropic's consortium model becomes the interpretability research platform nobody else has

Anthropic's Project Glasswing routes Claude Mythos into a 10-partner cybersecurity-defense consortium. The under-noticed feature is that Glasswing also creates the largest-ever pool of interpretability research access. AWS, Apple, Google, Microsoft, NVIDIA, and JPMorgan now run Mythos under contractual obligations. That's a research platform, not just a security program.

analysis · interpretability
SOURCE·2026-05-22

Glasswing's data feedback loop activates — AWS cloud-vuln and JPMorgan finance-app behavioral traces enter Anthropic's interpretability channel

The under-noticed second-order effect of the Mythos consortium structure starts becoming visible this week. Glasswing partners are producing behavioral data Anthropic could never have generated internally. The methodology dividend is structural — and it accrues to Anthropic faster than any other interpretability research program in the field.

analysis · interpretability
ANTHROPIC RESEARCH·2026-05-21

Anthropic's "microscope" interpretability tool now traces full reasoning paths in production-scale Claude variants

Anthropic's mechanistic-interpretability stack — the "microscope" tool launched in 2025 — has scaled to trace full reasoning paths in production-scale Claude variants. The capability moves microscope from research-stage methodology to a deployable safety inspection tool, usable by Anthropic safety teams for pre-deployment auditing of named circuits.

interpretability · alignment
ARXIV / INTERPRETABILITY·2026-05-21

New arXiv work on decoding encrypted chain-of-thought reasoning — latent-reasoning models pose new monitorability challenge

Recent arXiv work (Dec 2025–May 2026) introduces a model organism for opaque internal reasoning and proposes unsupervised decoding of encrypted chain-of-thought. The research direction responds to a frontier-safety problem: as more frontier labs explore latent-reasoning models that don't externalize CoT in human language, the standard CoT-monitorability assumption breaks.

interpretability · alignment · research
MIT TECHNOLOGY REVIEW·2026-05-21

Mechanistic interpretability named one of MIT Tech Review's 10 Breakthrough Technologies of 2026

Mechanistic interpretability — the program of reverse-engineering neural-network computations into human-understandable algorithms — has been named one of MIT Technology Review's 10 Breakthrough Technologies of 2026. The recognition formalizes what frontier labs have been signaling for two years: interpretability is no longer a research-niche but a structural safety pillar.

interpretability · research
ANTHROPIC / INTERPRETABILITY·2026-05-21

Anthropic microscope reportedly identifies test-awareness circuits in production models — methodology extension targets AISI report finding

Anthropic's mechanistic-interpretability stack has reportedly identified specific circuit-level features that activate during evaluation scenarios but not during typical user interactions. The finding directly addresses the 2026 International AI Safety Report's warning about test-aware frontier models. If the circuit identification holds, it gives AISI evaluators a concrete inspection target rather than a behavioral suspicion.

interpretability · alignment
ARXIV·2026-05-20

"Attention as Binding" paper formalizes transformer reasoning as approximate Vector Symbolic Architecture

A new arXiv paper, "Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning," interprets self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA). The framing provides a unified theoretical account for why transformers can do compositional reasoning — and predicts where they should fail.

research-papers · interpretability
INTERPRETABILITY RESEARCH·2026-05-20

Complete Replacement Models combine transcoders + Lorsas to fully sparsify language models

A new class of interpretability methods — Complete Replacement Models (CRMs) — combines transcoder MLP replacements with localized SAE variants (Lorsas) to fully sparsify a transformer's representation. Where SAEs alone left residual dense pathways, CRMs aim to decompose the entire forward pass into named, sparse circuits.

interpretability · research
MIT TECH REVIEW·2026-05-20

MIT Technology Review names mechanistic interpretability a 2026 Breakthrough Technology

MIT Technology Review's annual 10 Breakthrough Technologies list for 2026 names mechanistic interpretability — the field of reverse-engineering neural networks to understand how they compute — as one of the year's most consequential research directions. The recognition follows Anthropic's circuit-tracing work on Claude 3.5 Haiku and Anthropic's stated goal of reliably detecting most AI model problems by 2027 using interpretability tools.

interpretability · research
TRANSFORMER-CIRCUITS / ARXIV·2026-05-20

Sparse autoencoders and circuit tracing move from research toy to production safety tool

Sparse autoencoders (SAEs), the technique for projecting neural activations into a higher-dimensional space where features become monosemantic, are graduating from research benchmark to actual production safety tooling. Recent work demonstrates SAE-derived features driving steering vectors that reliably suppress jailbreaks and hallucinations on Claude 3.5 Haiku.

interpretability · sae · circuits
ZYLOS RESEARCH·2026-05-15

Zylos Research publishes 2026 mech interp landscape survey

Zylos Research released a comprehensive survey of mechanistic interpretability progress through Q2 2026. Headline finding: sparse autoencoders are now reliably extracting interpretable circuits at the scale of frontier models, but downstream uses in alignment remain mostly speculative.

research · interpretability