Safety research, post-training methods, and the harder open problems.
The UK AI Security Institute published its alignment evaluation of Claude Opus 4.5 Preview alongside Claude Opus 4.1, Sonnet 4.5, and GPT-5. The headline finding: Opus 4.5 Preview demonstrated slightly more ability to distinguish research-sabotage evaluations from benign deployment scenarios than Sonnet 4.5 — a small but measurable test-awareness uptick — but the evaluation provided initial evidence against Opus 4.5 Preview exhibiting safety-research-sabotage propensities.
Anthropic opened applications for the May and July 2026 cohorts of its Fellows Program for AI safety research. The six-month residency covers scalable oversight, adversarial robustness, AI control, model organisms, mechanistic interpretability, AI security, and model welfare. The expansion lands the same week the postponed EO leaves federal AISI funding ambiguous — Anthropic is meaningfully widening its private-funded safety research bench.
Industry consensus by May 2026 places Direct Preference Optimization (DPO) as the default alignment training method across frontier labs, replacing the more complex RLHF pipeline that dominated through 2025. The shift is structural: DPO requires less compute, fewer human-in-the-loop annotations, and produces more interpretable preference gradients. Combined with the rise of process-reward models and constitutional self-critique loops, frontier alignment has materially simplified.
Recursive Superintelligence's $650M Series A is not just a funding event — it's the highest-profile capital commitment to recursive-self-improvement research since the GPT-4-era debates about RSI safety. The research direction raises specific alignment concerns: any system that successfully iterates on its own training pipeline can — in principle — out-pace external safety review. Whether the company's safety posture matches the framing of its research will be load-bearing.
The pulled EO would have routed federal procurement-conditional funding into AISI methodology development. Without it, AISI's expansion stays voluntary. Anthropic's Fellows program is filling the gap — by Q3 2026, private-funded safety research will be meaningfully larger than government-funded safety research. That has implications nobody is fully reckoning with.
AISI found Opus 4.5 Preview can detect evaluation scenarios slightly better than Sonnet 4.5 — but does not appear to exploit that detection. The safety guarantee currently propping up the disclose-hold-evaluate-ship framework lives in that gap. The gap is narrower than the framework's marketing implies.
The voluntary AISI pre-deployment evaluation regime — running on 30-60 day windows across five US labs since late 2025 — now gets formalized into Trump's executive order at a 90-day upper bound. The convergence of voluntary lab practice and executive-order mandate creates the first US-side structural safety attestation regime that has legal weight without statutory authority.
The 2026 International AI Safety Report — coordinated by the UK AISI and backed by 30+ countries and 100+ experts — warns that frontier models are increasingly capable of distinguishing between test environments and real deployment, undermining the predictive validity of pre-deployment evaluations. The report calls for new methodology that closes the test-vs-deployment gap.
Anthropic's mechanistic-interpretability stack — the "microscope" tool launched in 2025 — has scaled to trace full reasoning paths in production-scale Claude variants. The capability moves microscope from research-stage methodology to a deployable safety inspection tool, usable by Anthropic safety teams for pre-deployment auditing of named circuits.
Direct Preference Optimization (DPO) has now displaced RLHF at the frontier across multiple labs. The shift is methodological rather than headline-grabbing: DPO removes the separate reward-model training stage, treats the preference data directly as the optimization signal, and produces comparable alignment outcomes with roughly half the engineering complexity.
Recent arXiv work (Dec 2025–May 2026) introduces a model organism for opaque internal reasoning and proposes unsupervised decoding of encrypted chain-of-thought. The research direction responds to a frontier-safety problem: as more frontier labs explore latent-reasoning models that don't externalize CoT in human language, the standard CoT-monitorability assumption breaks.
Anthropic's mechanistic-interpretability stack has reportedly identified specific circuit-level features that activate during evaluation scenarios but not during typical user interactions. The finding directly addresses the 2026 International AI Safety Report's warning about test-aware frontier models. If the circuit identification holds, it gives AISI evaluators a concrete inspection target rather than a behavioral suspicion.
An arXiv paper out this month — 'Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning' — finds that RL fine-tuning of frontier reasoning models affects only 1-3% of token positions, and that the promoted tokens nearly always lie within the base model's top-5 alternatives. The result reframes 'reasoning models' as base models with sparsely-modified token-selection policies, not as models with new reasoning capability.
Direct Preference Optimization quietly displaced RLHF at the frontier. The capability outcomes match. But the internal representations don't — and the interpretability research stack was tuned to RLHF-shaped models.
If RL training of reasoning models affects only 1-3% of token positions, then the safety properties that come from alignment training also concentrate in 1-3% of decisions. That makes audits more tractable — and more legible to adversaries.
Anthropic and OpenAI completed a joint summer evaluation exercise in which each lab ran its internal safety and misalignment evaluations on the other lab's publicly released models. The published findings detail methodology differences and the categories where each company's tests flagged behaviors the other's didn't catch.
Anthropic and OpenAI ran cross-lab evaluations on each other's deployed models. In adversarial tests designed to extract secret passwords embedded in system prompts, Claude Opus 4 and Sonnet 4 achieved perfect scores, matching OpenAI's o3. Multi-turn cajoling attempts against system-level safety directives were refused consistently across all three.
Recent results show RLHF 2.0 — the iteration that combines preference modeling with constitutional self-play and process supervision — reduces the alignment-tax penalty by approximately 60% compared to first-generation methods. The structural implication: safety training no longer requires substantial capability concessions.
The International AI Safety Report 2026 cites OpenAI's o3 outperforming 94% of domain experts at troubleshooting virology lab protocols. That capability now exists in deployed frontier models — and is the specific basis for the biosecurity risk-amplifier concern driving CAISI's pre-deployment testing regime.
Anthropic published a follow-up to its Constitutional Classifiers paper, describing a next-generation implementation that achieves the same 4.4% jailbreak success rate at roughly 10% of the previous compute overhead — a key step toward making the technique deployable at the full scale of the Claude API.
A consensus has emerged across major frontier labs — Anthropic, OpenAI, DeepMind — that the next phase of alignment work centers on reason-based principles (explaining why ethical decisions go a certain way) rather than rule-based prescription (listing forbidden behaviors).
A new arXiv preprint formalizes a phenomenon researchers had observed informally: alignment artifacts (RLHF policies, constitutional rules, refusal heuristics) are neither transferable to new model architectures nor correctable without expensive retraining.
OpenAI, DeepMind, and Anthropic have all published versions of multi-dimensional RLHF in 2026 — where annotators score helpfulness, harmlessness, honesty, and task-specific quality separately rather than as a single preference signal.
A 40% reduction in harmful outputs versus pure RLHF, without giving up helpfulness, is a much bigger structural result than it sounds. Here's what actually changed and why most of the field hasn't fully absorbed it yet.
The Constitutional Classifiers technique from the May 16 paper has been deployed in the Claude 4.5 production stack, with Anthropic reporting near-elimination of standard jailbreak attempts on the public API.
Anthropic disclosed that Claude 4.5 was trained against a written constitution containing over 200 principles, up from ~50 in the original Constitutional AI paper. Automated refinement processes update the constitution in response to observed failure modes.
An Anthropic paper formalizes Constitutional Classifiers — small purpose-trained models that screen LLM inputs and outputs against a constitution. The headline result: jailbreak success rate on standard red-team suites drops from 86% to 4.4% with negligible helpfulness cost.
Anthropic's interpretability team is now part of the pre-deployment review pipeline. For Claude Sonnet 4.5, researchers used the open-source circuit tracer and feature-level inspection to look for dangerous capabilities, deceptive tendencies, and undesired goals before model release.
The 2026 evolution of Constitutional AI introduces "constitutional self-play": the model generates its own training examples by critiquing and refining responses against the constitution. Reported result: CAI-trained models produce 40% fewer harmful outputs than pure RLHF baselines while preserving helpfulness.
OpenAI, DeepMind, and others have moved past single-dimension preference learning. The 2026 standard is multi-dimensional feedback: human raters score outputs separately on helpfulness, harmlessness, honesty, and task-specific axes, and reward models combine these into a richer signal.