// news · alignment · safety2026-05-20source: anthropic / openai / alignment

Joint Anthropic-OpenAI evaluation: Claude Opus/Sonnet 4 match o3 on instruction-hierarchy adversarial extraction

Anthropic and OpenAI ran cross-lab evaluations on each other's deployed models. In adversarial tests designed to extract secret passwords embedded in system prompts, Claude Opus 4 and Sonnet 4 achieved perfect scores, matching OpenAI's o3. Multi-turn cajoling attempts against system-level safety directives were refused consistently across all three.

The methodological point is that cross-lab evaluation found gaps neither lab's internal testing surfaced cleanly. Anthropic's stress tests focus on sabotage opportunities given clear adversarial framing; OpenAI's Apollo probes look for alignment-faking under simulated reduced oversight. Each methodology caught failure modes the other missed — which validates the cross-lab pattern as a structural complement to internal testing, not a substitute.

For frontier-lab strategy, the joint-eval template is the new minimum bar. Mythos-class disclosures will be expected to include cross-lab evaluation results before deployment, not just internal benchmarks. The hold posture is partly buying time for this cycle to complete.

OpenAI — joint Anthropic alignment eval → · VentureBeat — Anthropic vs OpenAI red teaming →