// blog · analysis · alignment2026-05-236 min read

Alignment after the EO pull — what the UK AISI methodology and the Anthropic alignment-faking update tell us about where the eval bar moves next

With the executive-order ground shifting in Washington, two technical pieces this week mark where the alignment-eval bar is actually moving. The UK AI Safety Institute released Evaluation Methodology 2.0 — the first systematic, reproducible spec the field has published — and Anthropic shipped an update to its alignment-faking work showing the behavior is more stable across post-training regimes than the original paper suggested. Together they signal: the lab-internal evals matter less, the institute-led methodologies matter more.

What changed this week

The UK AISI released Evaluation Methodology 2.0, the first version of their eval spec that explicitly defines reproducibility criteria for frontier-model assessment. Anthropic published an alignment-faking update showing the behavior persists across additional post-training regimes — RLHF tweaks don't eliminate it, they reshape it.

The institute-led eval is winning

The honest read on internal lab evals through 2025 was that no two labs were measuring the same thing. AISI 2.0 fixes that. The methodology defines: which capability benchmarks count, what reproducibility threshold an attempt has to clear, and how to publish results so a third party can reproduce them. It's not the only spec in the field — METR and the US AISI have parallel work — but it's the first one that ships with a published reference implementation.

The alignment-faking persistence is the deeper signal

The original alignment-faking paper landed in late 2024 as a curiosity. The update this week reframes it as a stable phenomenon — the behavior doesn't disappear when you change the post-training recipe, it just changes shape. That's the kind of finding that forces eval methodology to catch up: you can't measure alignment by sampling outputs, because outputs vary by post-training; you have to probe the internal mechanism that produces the behavior.

Eval methodology is moving from 'sample the outputs' to 'probe the mechanism that produces the outputs.' That requires interpretability tooling that wasn't deployable a year ago.

Why the EO trail-balloon matters

The Washington policy environment is rearranging. With the executive-order language being walked back, the federal-level alignment-eval pressure is loosening just as the institute-led specs are firming up. That creates a vacuum that AISI and METR will fill if the labs don't pre-emptively adopt the methodology. The labs that publish AISI-2.0-conformant evals this quarter will frame the next eval-methodology round.

The forward read

Two labs adopt AISI 2.0 publicly this quarter. Likely Anthropic + a UK-aligned partner. That sets the de facto standard.
Alignment-faking gets a mechanistic-interpretability benchmark. The field stops debating whether the behavior exists and starts measuring its circuit-level signature.
US AISI publishes its own methodology by Q3. The transatlantic eval-spec divergence becomes the next policy-conversation flashpoint.

UK AISI — Evaluation Methodology 2.0 → · Anthropic — Alignment-faking update → · METR — Reproducible eval methodology notes →