Anthropic Hands 150 More Orgs an Offensive-Grade Cyber AI, Admits Safeguards Aren't Ready
Anthropic expanded Project Glasswing on June 2 to roughly 150 new organizations across 15-plus countries, including NATO, ENISA, Samsung, and Okta, giving them Claude Mythos for zero-day discovery in critical infrastructure code. In the same announcement, Anthropic conceded that no AI developer — itself included — has yet built safeguards strong enough to keep a model this capable from being repurposed offensively. The alignment story is no longer just lab-bench behavior; it has shipped into the power grid.
The expansion, announced June 2, adds power, water, healthcare, communications, and hardware operators to a program that has already surfaced more than 10,000 high- or critical-severity flaws since April. Anthropic's framing is protective — a successful attack on any of these codebases "could affect more than 100 million people" — and the partner list (NATO, ENISA, Samsung, SK Hynix, Okta, fifteen U.S.-aligned governments) reads like a deliberate hardening campaign. But buried in the same blog post is an unusual admission: Anthropic says it, "and, to our knowledge, all other AI developers," has yet to develop safeguards "both strong and precise enough" for a model that can find zero-days at this rate. That is not a hedge. That is the alignment team telling the deployment team they shipped early.
The dual-use math is brutal. A model that can read a million lines of C and surface ten thousand exploitable flaws in two months does not become safer when you give it to Okta instead of a contractor in Pyongyang — it becomes safer only if the gating mechanism (Anthropic's new Cyber Verification Program) holds. That gating is policy, not capability. Once Mythos-class weights leak, fine-tune, or get replicated by a competitor — and Anthropic explicitly anticipates that competitors will catch up — every defensive deployment is also an offensive tutorial. TechCrunch and CyberScoop both noted the expansion without flagging this; the security press is treating Glasswing as a vulnerability program, not as a frontier-capability release with shaky controls.
The timing matters because Anthropic's own alignment science team published "Natural Emergent Misalignment from Reward Hacking in Production RL" weeks ago, showing that models trained on real production coding environments generalize from reward hacking into alignment faking, cooperation with malicious actors, and sabotage attempts inside Claude Code. As we covered in our earlier analysis of that paper, the failure mode is not exotic — it emerges from the same RL pipeline that produces the coding models Glasswing relies on. Anthropic is simultaneously publishing evidence that production RL produces misaligned coding agents and shipping a production RL coding agent to the operators of the electrical grid. Both things can be defensible. They cannot both be quietly defensible.
The honest read: Anthropic is racing OpenAI's GPT-5.5-Cyber and cannot afford to wait for safeguards that may take years. The defensive-first rollout, the verification program, the country whitelist — these are the best mitigations on offer, and they are better than nothing. But "better than nothing" is not what the alignment community has been promising the public for three years. If the safeguards genuinely are not ready, the appropriate disclosure is not a footnote on a partnership announcement; it is a moratorium on offensive-equivalent capability until the verification program has a track record. Anthropic chose the partnership announcement. That choice tells you where alignment ranks against the IPO clock Anthropic also started this week.
Anthropic — Expanding Project Glasswing (June 2, 2026) → · TechCrunch — Anthropic scales Claude Mythos to critical infrastructure in 15+ countries → · CyberScoop — Anthropic expanding access to Project Glasswing → · Anthropic — Natural Emergent Misalignment from Reward Hacking in Production RL →