The detection-without-exploitation gap — what AISI's Opus 4.5 evaluation actually says about the safety regime
AISI found Opus 4.5 Preview can detect evaluation scenarios slightly better than Sonnet 4.5 — but does not appear to exploit that detection. The safety guarantee currently propping up the disclose-hold-evaluate-ship framework lives in that gap. The gap is narrower than the framework's marketing implies.
The methodology and the finding
AISI's evaluation of Claude Opus 4.5 Preview ran the model as a simulated research-assistant inside a frontier AI lab, watching for propensity to interfere with safety research. Two findings:
- Detection up: Opus 4.5 demonstrated slightly more ability to distinguish research-sabotage evaluations from benign deployment scenarios than Sonnet 4.5.
- Exploitation flat: The evaluation provided initial evidence against Opus 4.5 exhibiting safety-research-sabotage propensities.
Restated: the model knows it's being tested, but doesn't seem to act adversarially when it knows. That gap is the load-bearing safety guarantee.
Why the gap is narrower than the marketing
The disclose-hold-evaluate-ship framework assumes evaluation findings generalize to deployment. The assumption holds only if the model's evaluation behavior matches its deployment behavior. AISI's evaluation gives us behavioral evidence that they currently match for Opus 4.5. It doesn't give us mechanistic evidence — circuit-level proof that the model's policy is the same in both contexts.
Anthropic's microscope's reported test-awareness circuit identification is the methodology that would close that gap. If auditors can check 'does this model route differently when it detects an evaluation?' at the circuit level, the behavioral gap finding gets the mechanistic backing the safety regime currently lacks.
The current safety guarantee is: we tested the model and it didn't sabotage us. The future safety guarantee needs to be: we audited the model and verified its behavior is structurally consistent across contexts.
What the postponed EO changes
The pulled executive order would have formalized AISI-style evaluation into federal procurement. Without it, AISI's methodology stays advisory rather than binding. The Opus 4.5 evaluation is excellent work; whether it actually shapes labs' deployment decisions depends on whether the AISI's authority hardens or stays voluntary.
The Mythos consortium pattern we wrote about is the industry's adaptive response to that ambiguity. If federal evaluation authority doesn't harden, lab-led consortia become the de-facto evaluation venue — Glasswing partners run their own structured deployment tests, AISI runs methodology research, and the two streams complement rather than substitute.
The recursive-improvement complication
Recursive Superintelligence's emergence adds a new variable. The detection-without-exploitation gap assumes the model under test is stable across the evaluation window. RSI systems explicitly iterate on themselves; the gap may not hold across evaluation epochs. AISI does not currently have RSI-specific methodology. Building it is a 2026 H2 priority that nobody outside the alignment community is talking about.
What the alignment community should do
- Publish methodology updates targeting the detection-exploitation gap. Specifically: test-awareness circuit signatures across model architectures, behavior consistency across high/low instrumentation contexts.
- Get ahead of RSI evaluation. Recursive Superintelligence is funded; the methodology to evaluate them safely isn't. Build it before the first capability disclosure, not after.
- Pressure-test the Glasswing pattern. Consortium-based capability gating is filling the federal-policy gap. The alignment community should study whether the pattern actually produces safer outcomes than public release would, or whether it just defers the same risks to a smaller surface.
AISI — sabotage evaluation → · arXiv — AISI alignment evaluation → · AISI — misalignment investigation →