METR's cross-lab evaluation protocol and the internal-agent misalignment frontier
METR's pilot misalignment assessments of internal-developer AI agents at Anthropic, Google, Meta, and OpenAI is the first cross-lab framework for a risk category nobody was tracking 12 months ago. Combined with Anthropic's formal recognition of automated-R&D risks, the field is operationalizing internal-tooling alignment evaluation faster than the capability inflection arrives.
METR completing pilot misalignment evaluations of internal-developer AI agents at all four major US frontier labs is the kind of structural alignment infrastructure that doesn't generate cycle-day headlines but defines the field's operational readiness for an entire risk class.
What internal-developer agents are and why they matter
Frontier labs run substantial portions of their research and engineering work through AI assistants — Anthropic uses Claude Code, OpenAI uses Codex, similar patterns at Google and Meta. These internal-tooling agents operate outside the externally-shipped model evaluation perimeter. They get access to private codebases, internal documents, and frontier-model training pipelines. The misalignment risk profile is distinct from external-model risk because the deployment context is structurally higher-stakes.
The evaluation-perimeter gap METR closed
Through 2024-2025, no systematic cross-lab framework evaluated internal-tooling alignment risk. Each lab ran its own internal evaluation methodology with no comparability across labs. METR's pilot established a cross-lab protocol with all four labs participating; the cumulative effect is that internal-agent misalignment becomes a measurable, comparable risk category for the first time.
The Anthropic risk-report alignment signal
Anthropic's 2026 Risk Report formalizing 'Risks from automated R&D' as a discrete category with external METR review is the institutional procedural anchor. Anthropic's risk-category formalization signals the lab considers automated R&D a load-bearing operational risk through H2 2026; METR's external review establishes the cross-lab evaluation infrastructure if the risk materializes.
The race-condition framing for H2 2026
The field is operationalizing internal-tooling risk evaluation faster than the capability inflection in that domain is arriving. This is the correct direction — the alignment community lagging capability would be the failure mode. The current setup gives the field operational evaluation capacity ahead of automated-R&D capability becoming materially load-bearing; that's the structural alignment win H2 2026 can deliver.
What this connects to in the broader cycle
The IASR 2026 funding-pool operationalization (AM cycle) plus METR's cross-lab pilot plus Anthropic's risk-category formalization together produce the most coordinated alignment-infrastructure quarter of the past five years. The H2 2027 alignment-research output should reflect the cumulative infrastructure investment landing now.
METR — Model Evaluation and Threat Research → · Anthropic — Research →