// news · alignment2026-06-20source: alignment.anthropic / openai

Anthropic-OpenAI cross-lab evaluation finds OpenAI's o3 and o4-mini aligned at or above Anthropic's own models in simulated safeguard-disabled settings

The pilot Anthropic-OpenAI cross-lab alignment evaluation finds that in simulated test settings with some model-external safeguards disabled, OpenAI's o3 and o4-mini reasoning models match or exceed Anthropic's own model alignment performance overall. The cross-lab evaluation pattern itself — labs evaluating each other's models against each other's safety suites — is more consequential than the specific result.

The substantive piece is the cross-lab-evaluation precedent. Pre-2026 the dominant safety-evaluation pattern was 'lab evaluates its own models against its own internal evals' — useful but trivially gameable by selecting evaluation criteria that flatter the model. The Anthropic-OpenAI cross-eval pattern inverts this: each lab runs its own internal evals against the other lab's public models, and findings are published in parallel. The pattern is robust to single-lab gaming because the evaluating lab has no incentive to flatter the other lab's models.

The o3 and o4-mini specific result is interesting but secondary to the methodology adoption. The result is also bounded — 'in simulated test settings with some model-external safeguards disabled' is an aggressive evaluation environment that doesn't reflect production deployment shapes. The substantive longitudinal question is whether the cross-lab-eval pattern persists past pilot phase into permanent operational practice. Two cycles of cross-lab evaluation per year, with public publication, would materially restructure the field's understanding of relative model safety.

See our analysis →

Alignment Anthropic — Findings from a Pilot Anthropic-OpenAI Alignment Evaluation Exercise → · OpenAI — Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests →