Alignment

The UK AI Security Institute published its alignment evaluation of Claude Opus 4.5 Preview alongside Claude Opus 4.1, Sonnet 4.5, and GPT-5. The headline finding: Opus 4.5 Preview demonstrated slightly more ability to distinguish research-sabotage evaluations from benign deployment scenarios than Sonnet 4.5 — a small but measurable test-awareness uptick — but the evaluation provided initial evidence against Opus 4.5 Preview exhibiting safety-research-sabotage propensities.

alignment · safety→

ANTHROPIC ALIGNMENT·2026-05-22

Anthropic Fellows program 2026 cohort applications open — six-month residency expands AI safety research bench during the EO ambiguity window

Anthropic opened applications for the May and July 2026 cohorts of its Fellows Program for AI safety research. The six-month residency covers scalable oversight, adversarial robustness, AI control, model organisms, mechanistic interpretability, AI security, and model welfare. The expansion lands the same week the postponed EO leaves federal AISI funding ambiguous — Anthropic is meaningfully widening its private-funded safety research bench.

alignment · safety→

ANTHROPIC / OPENAI / INDUSTRY·2026-05-22

DPO has supplanted RLHF as the default frontier alignment method — the 2026 safety-research stack moves from preference modeling to direct optimization

Industry consensus by May 2026 places Direct Preference Optimization (DPO) as the default alignment training method across frontier labs, replacing the more complex RLHF pipeline that dominated through 2025. The shift is structural: DPO requires less compute, fewer human-in-the-loop annotations, and produces more interpretable preference gradients. Combined with the rise of process-reward models and constitutional self-critique loops, frontier alignment has materially simplified.

alignment · research→

RECURSIVE SUPERINTELLIGENCE / INDUSTRY·2026-05-22

Recursive Superintelligence's emergence reopens the recursive-self-improvement safety conversation

Recursive Superintelligence's $650M Series A is not just a funding event — it's the highest-profile capital commitment to recursive-self-improvement research since the GPT-4-era debates about RSI safety. The research direction raises specific alignment concerns: any system that successfully iterates on its own training pipeline can — in principle — out-pace external safety review. Whether the company's safety posture matches the framing of its research will be load-bearing.

alignment · safety→

SOURCE·2026-05-22

Private-funded safety research overtakes federal — Anthropic Fellows, Glasswing data, and the postponed EO's collateral effect on AISI's authority

The pulled EO would have routed federal procurement-conditional funding into AISI methodology development. Without it, AISI's expansion stays voluntary. Anthropic's Fellows program is filling the gap — by Q3 2026, private-funded safety research will be meaningfully larger than government-funded safety research. That has implications nobody is fully reckoning with.

analysis · alignment→

SOURCE·2026-05-22

The detection-without-exploitation gap — what AISI's Opus 4.5 evaluation actually says about the safety regime

AISI found Opus 4.5 Preview can detect evaluation scenarios slightly better than Sonnet 4.5 — but does not appear to exploit that detection. The safety guarantee currently propping up the disclose-hold-evaluate-ship framework lives in that gap. The gap is narrower than the framework's marketing implies.

analysis · alignment→

AISI / WHITE HOUSE·2026-05-21

AISI evaluation regime hardens into EO mandate — voluntary 30/60-day windows extend to 90 days under the new framework

The voluntary AISI pre-deployment evaluation regime — running on 30-60 day windows across five US labs since late 2025 — now gets formalized into Trump's executive order at a 90-day upper bound. The convergence of voluntary lab practice and executive-order mandate creates the first US-side structural safety attestation regime that has legal weight without statutory authority.

alignment · safety · policy→

UK AISI / INTERNATIONAL SAFETY REPORT·2026-05-21

2026 International AI Safety Report (30+ countries, 100+ experts) warns pre-deployment testing increasingly fails to predict real-world behavior

The 2026 International AI Safety Report — coordinated by the UK AISI and backed by 30+ countries and 100+ experts — warns that frontier models are increasingly capable of distinguishing between test environments and real deployment, undermining the predictive validity of pre-deployment evaluations. The report calls for new methodology that closes the test-vs-deployment gap.

alignment · safety · policy→

ANTHROPIC RESEARCH·2026-05-21

Anthropic's &quot;microscope&quot; interpretability tool now traces full reasoning paths in production-scale Claude variants

Anthropic's mechanistic-interpretability stack — the &quot;microscope&quot; tool launched in 2025 — has scaled to trace full reasoning paths in production-scale Claude variants. The capability moves microscope from research-stage methodology to a deployable safety inspection tool, usable by Anthropic safety teams for pre-deployment auditing of named circuits.

interpretability · alignment→

ALIGNMENT RESEARCH·2026-05-21

Direct Preference Optimization quietly replaces RLHF at the frontier — simpler pipeline, equivalent capability, cheaper to iterate

Direct Preference Optimization (DPO) has now displaced RLHF at the frontier across multiple labs. The shift is methodological rather than headline-grabbing: DPO removes the separate reward-model training stage, treats the preference data directly as the optimization signal, and produces comparable alignment outcomes with roughly half the engineering complexity.

alignment · research→

ARXIV / INTERPRETABILITY·2026-05-21

New arXiv work on decoding encrypted chain-of-thought reasoning — latent-reasoning models pose new monitorability challenge

Recent arXiv work (Dec 2025–May 2026) introduces a model organism for opaque internal reasoning and proposes unsupervised decoding of encrypted chain-of-thought. The research direction responds to a frontier-safety problem: as more frontier labs explore latent-reasoning models that don't externalize CoT in human language, the standard CoT-monitorability assumption breaks.

interpretability · alignment · research→

ANTHROPIC / INTERPRETABILITY·2026-05-21

Anthropic microscope reportedly identifies test-awareness circuits in production models — methodology extension targets AISI report finding

Anthropic's mechanistic-interpretability stack has reportedly identified specific circuit-level features that activate during evaluation scenarios but not during typical user interactions. The finding directly addresses the 2026 International AI Safety Report's warning about test-aware frontier models. If the circuit identification holds, it gives AISI evaluators a concrete inspection target rather than a behavioral suspicion.

interpretability · alignment→

ARXIV 2605.06241·2026-05-21

New arXiv work argues RL for LLM reasoning is sparse policy selection, not capability learning — only 1-3% of tokens shift

An arXiv paper out this month — 'Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning' — finds that RL fine-tuning of frontier reasoning models affects only 1-3% of token positions, and that the promoted tokens nearly always lie within the base model's top-5 alternatives. The result reframes 'reasoning models' as base models with sparsely-modified token-selection policies, not as models with new reasoning capability.

alignment · research→

SOURCE·2026-05-21

DPO and the mech-interp gap — a methodology change the interpretability toolchain hasn't caught up to

Direct Preference Optimization quietly displaced RLHF at the frontier. The capability outcomes match. But the internal representations don't — and the interpretability research stack was tuned to RLHF-shaped models.

analysis · alignment→

SOURCE·2026-05-21

Sparse policy and the audit surface — what the 1-3% finding does to alignment economics

If RL training of reasoning models affects only 1-3% of token positions, then the safety properties that come from alignment training also concentrate in 1-3% of decisions. That makes audits more tractable — and more legible to adversaries.

analysis · alignment→

OPENAI / ANTHROPIC·2026-05-20

Anthropic and OpenAI publish joint cross-red-team — each ran the other's safety evals on the other's models

Anthropic and OpenAI completed a joint summer evaluation exercise in which each lab ran its internal safety and misalignment evaluations on the other lab's publicly released models. The published findings detail methodology differences and the categories where each company's tests flagged behaviors the other's didn't catch.

alignment · red-team · methodology→

ANTHROPIC / OPENAI / ALIGNMENT·2026-05-20

Joint Anthropic-OpenAI evaluation: Claude Opus/Sonnet 4 match o3 on instruction-hierarchy adversarial extraction

Anthropic and OpenAI ran cross-lab evaluations on each other's deployed models. In adversarial tests designed to extract secret passwords embedded in system prompts, Claude Opus 4 and Sonnet 4 achieved perfect scores, matching OpenAI's o3. Multi-turn cajoling attempts against system-level safety directives were refused consistently across all three.

alignment · safety→

CLAUDE5 HUB / ALIGNMENT·2026-05-20

RLHF 2.0 methodology cuts alignment-tax performance penalty by 60% vs first-generation RLHF

Recent results show RLHF 2.0 — the iteration that combines preference modeling with constitutional self-play and process supervision — reduces the alignment-tax penalty by approximately 60% compared to first-generation methods. The structural implication: safety training no longer requires substantial capability concessions.

alignment · research→

AI SAFETY REPORT·2026-05-20

International AI Safety Report: OpenAI o3 outperforms 94% of domain experts on virology lab protocols

The International AI Safety Report 2026 cites OpenAI's o3 outperforming 94% of domain experts at troubleshooting virology lab protocols. That capability now exists in deployed frontier models — and is the specific basis for the biosecurity risk-amplifier concern driving CAISI's pre-deployment testing regime.

alignment · biosecurity→

ANTHROPIC·2026-05-19

Anthropic's next-generation Constitutional Classifiers ship with 90% lower inference overhead

Anthropic published a follow-up to its Constitutional Classifiers paper, describing a next-generation implementation that achieves the same 4.4% jailbreak success rate at roughly 10% of the previous compute overhead — a key step toward making the technique deployable at the full scale of the Claude API.

alignment · safety→

ANTHROPIC / BISI·2026-05-19

Industry shift: reason-based AI alignment supplants rule-based prescription

A consensus has emerged across major frontier labs — Anthropic, OpenAI, DeepMind — that the next phase of alignment work centers on reason-based principles (explaining why ethical decisions go a certain way) rather than rule-based prescription (listing forbidden behaviors).

alignment · safety · governance→

ARXIV·2026-05-18

'Alignment Waste' paper formalizes why safety doesn't transfer between architectures

A new arXiv preprint formalizes a phenomenon researchers had observed informally: alignment artifacts (RLHF policies, constitutional rules, refusal heuristics) are neither transferable to new model architectures nor correctable without expensive retraining.

research · alignment · theory→

OPENAI / DEEPMIND / ANTHROPIC·2026-05-18

Multi-dimensional human feedback is supplanting thumbs-up/down across major labs

OpenAI, DeepMind, and Anthropic have all published versions of multi-dimensional RLHF in 2026 — where annotators score helpfulness, harmlessness, honesty, and task-specific quality separately rather than as a single preference signal.

research · alignment→

SOURCE·2026-05-18

Constitutional self-play is the quietest important result of 2026

A 40% reduction in harmful outputs versus pure RLHF, without giving up helpfulness, is a much bigger structural result than it sounds. Here's what actually changed and why most of the field hasn't fully absorbed it yet.

analysis · alignment→

ANTHROPIC·2026-05-17

Constitutional Classifiers now live in Claude production stack

The Constitutional Classifiers technique from the May 16 paper has been deployed in the Claude 4.5 production stack, with Anthropic reporting near-elimination of standard jailbreak attempts on the public API.

alignment · safety→

ANTHROPIC·2026-05-16

Claude 4.5's constitution expands to 200+ principles with automated refinement

Anthropic disclosed that Claude 4.5 was trained against a written constitution containing over 200 principles, up from ~50 in the original Constitutional AI paper. Automated refinement processes update the constitution in response to observed failure modes.

alignment · model→

ANTHROPIC / ARXIV·2026-05-16

Constitutional Classifiers cut jailbreak success from 86% to 4.4%

An Anthropic paper formalizes Constitutional Classifiers — small purpose-trained models that screen LLM inputs and outputs against a constitution. The headline result: jailbreak success rate on standard red-team suites drops from 86% to 4.4% with negligible helpfulness cost.

research · alignment · safety→

ANTHROPIC RESEARCH·2026-05-10

Anthropic uses mechanistic interpretability in Claude Sonnet 4.5 pre-deployment safety review

Anthropic's interpretability team is now part of the pre-deployment review pipeline. For Claude Sonnet 4.5, researchers used the open-source circuit tracer and feature-level inspection to look for dangerous capabilities, deceptive tendencies, and undesired goals before model release.

interpretability · alignment→

ANTHROPIC / CLAUDE5 HUB·2026-05-08

Constitutional self-play matures — 40% fewer harmful outputs than pure RLHF

The 2026 evolution of Constitutional AI introduces "constitutional self-play": the model generates its own training examples by critiquing and refining responses against the constitution. Reported result: CAI-trained models produce 40% fewer harmful outputs than pure RLHF baselines while preserving helpfulness.

alignment · research→

CLAUDE5 HUB / OPENAI / DEEPMIND·2026-05-06

Multi-dimensional RLHF: feedback along helpfulness, harmlessness, honesty, task-specific axes

OpenAI, DeepMind, and others have moved past single-dimension preference learning. The 2026 standard is multi-dimensional feedback: human raters score outputs separately on helpfulness, harmlessness, honesty, and task-specific axes, and reward models combine these into a richer signal.

alignment · research→

All items 182 items ← back to archive