UK AISI publishes Claude Opus 4.5 Preview alignment evaluation — slight test-awareness uptick, no sabotage-propensity findings
The UK AI Security Institute published its alignment evaluation of Claude Opus 4.5 Preview alongside Claude Opus 4.1, Sonnet 4.5, and GPT-5. The headline finding: Opus 4.5 Preview demonstrated slightly more ability to distinguish research-sabotage evaluations from benign deployment scenarios than Sonnet 4.5 — a small but measurable test-awareness uptick — but the evaluation provided initial evidence against Opus 4.5 Preview exhibiting safety-research-sabotage propensities.
The methodology choice is the interesting detail. AISI ran the test as a simulated research-assistant scenario inside a frontier AI lab, watching for propensity to interfere with safety research. The two findings are in tension: the model can tell when it's being evaluated (test-awareness), but it does not appear to act adversarially when it knows. That gap — between detection and exploitation — is the load-bearing safety guarantee currently propping up the disclose-hold-evaluate-ship framework.
For the AISI-EO convergence story, the Opus 4.5 evaluation is the methodology that has to hold under the now-postponed 90-day pre-release framework. Anthropic microscope's reported test-awareness circuit identification would let evaluators verify the detection-without-exploitation pattern at the circuit level. Both methodology tracks need to mature together.
AISI — sabotage evaluation → · arXiv — AISI alignment evaluation case-study → · AISI — misalignment investigation →