Agents

Google DeepMind shares progress on AlphaEvolve, a Gemini-powered coding agent, with applications now extending across multiple scientific and technical domains.

agents→

CURSOR / SHAREUHACK·2026-05-22

Cursor Composer 2.5 becomes the in-IDE default — Build in Parallel + cloud agent dev environments + MS Teams clear the procurement bar

Cursor's Composer 2.5 (May 18 release) matched Opus 4.7 and GPT-5.5 on coding benchmarks at $0.50/M input / $2.50/M output. The new version added cloud agent dev environments, Microsoft Teams integration, and Build in Parallel — concurrent sub-agent execution on the same git working tree. The combination is the strongest model-agnostic in-IDE offer currently available.

agents · tools→

COGNITION / LUSHBINARY·2026-05-22

Devin 3 hits 90% on SWE-bench Verified — Cognition completes Windsurf acquisition at $250M and bundles Devin inside the IDE

Cognition's Devin 3 model now clears 90% on SWE-bench Verified — the first SWE-bench score consistently above the 90% threshold from any autonomous engineering agent. Cognition has completed its acquisition of Windsurf (the remaining stake after Google's earlier $2.4B acqui-hire of the founders) for $250M. The combination bundles Devin Cloud and Devin Terminal CLI inside the Windsurf IDE; Windsurf Pro raised to $20/month with a new $200/month Max tier.

agents · tools→

GOOGLE / TECHCRUNCH·2026-05-22

Gemini 3.5 Flash becomes default in the Gemini app and AI Mode in Search — Google bets the next wave on agents, not chatbots

Google flipped Gemini 3.5 Flash to default across both the Gemini app and AI Mode in Search globally this week. The model outperforms 3.1 Pro on coding and agentic benchmarks while running 4× faster on output tokens per second. The default-tier flip is the operational signal Google has been telegraphing since I/O — the new product surface is agentic, and Flash is the price point Google wants users to inhabit.

frontier-models · agents→

GOOGLE / BLOG.GOOGLE·2026-05-22

Gemini Spark runs on dedicated cloud VMs — the persistent personal agent moves from local extension to always-on cloud service

Google's Gemini Spark, the personal AI agent introduced at I/O, runs on dedicated virtual machines in Google Cloud and stays available 24/7 — even when the user's device is off. Spark is powered by Gemini 3.5 Flash via the full Antigravity pipeline, has cross-app access to the user's Gmail, Calendar, Drive, Photos, and YouTube history, and autonomously runs multi-step tasks on the user's behalf.

agents · frontier-models→

INDUSTRY / MCP ECOSYSTEM·2026-05-22

MCP server registry explosion continues — over 800 production MCP servers indexed as the agent-tool integration protocol consolidates

The Model Context Protocol (MCP) server registry now indexes over 800 production-quality MCP servers across enterprise SaaS, devtools, cloud infrastructure, and internal tooling integrations. The 2026 H1 cadence has been roughly 100-150 new servers per month — MCP has effectively become the OAuth-for-AI-agents standard, with most enterprise software vendors now shipping or planning an MCP integration as the default agent-access surface.

tools · agents→

COGNITION / WINDSURF / TOOLRADAR·2026-05-22

Windsurf 2.0 + Devin bundling clarifies — quota-priced autonomous engineering vs per-token model routing now the defining IDE-tools dichotomy

Windsurf 2.0 ships with Devin Cloud and Devin Terminal CLI bundled inside the IDE; Pro raised from $15 to $20/month, with a new Max tier at $200/month including unlimited Devin Cloud agent runs. The Adaptive Model Router auto-selects between Devin and the IDE's standard coding models based on task complexity. The Cognition-Windsurf integration is the cleanest 'autonomous engineering as a bundled SKU' offer currently on the market.

agents · tools→

SOURCE·2026-05-22

The default agent tier shifts — Gemini 3.5 Flash becomes the always-on model behind Spark, Search, and Antigravity

Google flipped Gemini 3.5 Flash to default in the Gemini app and AI Mode in Search globally. Spark runs on dedicated cloud VMs powered by 3.5 Flash. Antigravity 2.0 already ships Flash as default backend. Three product surfaces, one model — Google's bet is that the agent layer wins by making the cheapest model the universal default.

analysis · agents→

SOURCE·2026-05-22

The two-vendor coding-agent split is now real — quota-bundled autonomous engineering vs per-token model routing

Devin 3 hits 90% SWE-bench Verified. Cognition completes Windsurf at $250M. Cursor Composer 2.5 ships Build in Parallel. The agent-IDE market just settled into a clean two-vendor split with materially different pricing models. Both are defensible. Procurement teams can finally pick on operating model, not capability.

analysis · agents→

GOOGLE / ANTIGRAVITY·2026-05-21

Google Antigravity 2.0 bundles Gemini 3.5 Flash by default — Google enters the in-IDE agent category seriously

Google's Antigravity 2.0 release bundles Gemini 3.5 Flash as the default backend and lands as a credible third entrant to the in-IDE agent category alongside Cursor and Windsurf. The pairing of Antigravity's IDE workflow with Flash-tier pricing makes Google the first major-lab vendor to package model and IDE as a single subscription rather than as separate procurement decisions.

tools · agents · industry→

GOOGLE / ANTIGRAVITY·2026-05-21

Google Antigravity 2.0 wires Gemini 3.5 Flash as default backend — first major-lab IDE-plus-model bundled SKU

Google's Antigravity 2.0 IDE now ships with Gemini 3.5 Flash as the default backend, bundling model and IDE under a single Google AI subscription. The pairing makes Google the first major-lab vendor to integrate model and IDE as one procurement decision rather than two. With Flash hitting 76.2% Terminal-Bench, the bundling is no longer a capability compromise.

tools · agents→

CURSOR·2026-05-21

Cursor 2.5 ships Build in Parallel + Microsoft Teams integration — coding-agent UX consolidates around concurrent execution

Cursor's 2.5 release added Build in Parallel (concurrent sub-agent execution on the same code state), Microsoft Teams integration, and matched Opus 4.7 and GPT-5.5 on benchmarks at $0.50/M input / $2.50/M output. The Teams integration is the procurement-friendly part of the release — enterprise buyers running M365 get IDE collaboration without a separate identity layer.

agents · tools→

CURSOR·2026-05-21

Cursor Composer 2.5 ships multi-agent orchestration — parallel sub-agents for refactor, test, doc generation in one IDE session

Cursor's Composer 2.5 update adds multi-agent orchestration: a planner agent decomposes a task into sub-tasks, then dispatches parallel sub-agents for refactor, test-writing, and documentation generation against the same code state. The update lands as a direct competitive response to Claude Code's terminal-native multi-agent workflows and Devin's cloud-agent pattern.

agents · tools→

GOOGLE / CNBC·2026-05-21

Gemini Spark personal agent enters beta — Google launches 24/7 task-running agent across connected apps

Google launched Gemini Spark, a 24/7 personal AI agent that can reason across connected Google apps, into beta this week alongside Gemini 3.5 Flash. Initial availability is restricted to Google AI Ultra subscribers and a small trusted-tester cohort. Spark joins OpenAI's Operator and Anthropic's Claude Cowork in the same-week launch cadence — the personal-agent tier is now a saturated market.

agents · frontier-models→

MCP ECOSYSTEM·2026-05-21

MCP server registry crosses 4,000 published servers — protocol-level lock-in compounds

The Model Context Protocol server registry crossed 4,000 published servers in May 2026 — roughly a 6× growth since the start of the year. The vast majority are open-source and community-maintained, covering everything from cloud-provider APIs to enterprise SaaS integrations. The growth confirms MCP as the de facto integration standard for agentic tooling.

tools · agents→

COGNITION / WINDSURF·2026-05-21

Windsurf 2.0 Cascade agents + Spaces task management mature — pricing pivots to quota-based at $20/mo Pro, $200/mo Max

Cognition's Windsurf 2.0 — launched April 15 and refined through May — now ships Cascade agents and Spaces task management as the default workflow surface. The pricing model also pivoted from credit-based to quota-based on March 19: $20/month Pro (up from $15), with a new $200/month Max tier. Devin Cloud and Devin Terminal CLI ship bundled into every paid tier.

tools · agents→

COGNITION / WINDSURF·2026-05-21

Windsurf 2.0 bundles Devin Cloud + Devin Terminal CLI into the IDE — autonomous agents become a default IDE feature

Cognition's Windsurf 2.0 release bundles Devin Cloud and Devin Terminal CLI inside the IDE itself. The change makes autonomous cloud agents a first-class IDE feature rather than a separate product. After Devin's price drop to $20/month Core + ACU usage, the bundled experience eliminates the friction that kept most developers on Cursor's editing-first workflow.

agents · tools · industry→

SOURCE·2026-05-21

Agent orchestration becomes the moat — the model layer is no longer where lock-in lives

When Cursor and Windsurf both ship multi-agent IDE workflows in the same week, the strategic question stops being &quot;which model is best&quot; and starts being &quot;which orchestration layer captures the developer.&quot;

analysis · agents→

SOURCE·2026-05-21

Agent surface bifurcation — three distinct moats, three different races

Gemini Spark ships personal agents to consumers. Cursor 2.5 ships parallel sub-agents to IDEs. Windsurf 2.0 ships autonomous cloud agents bundled with Devin. Three product categories, three different moats, three different races. The 'agent market' is becoming three markets.

analysis · agents→

COGNITION·2026-05-20

Cognition slashes Devin price from $500/mo to $20/mo Core + $2.25/ACU — autonomous coding tier pricing resets

Cognition cut Devin's entry price from $500/month Team to $20/month Core plus $2.25 per Agent Compute Unit. The previous floor was the cleanest moat in autonomous coding agents; the new floor is competitive with Copilot/Cursor's $20 tier. The category just collapsed from premium to mass-market pricing in a single move.

agents · pricing · industry→

GITHUB / MICROSOFT·2026-05-20

GitHub Copilot agent mode reaches GA on JetBrains — multi-IDE agentic coding now baseline

GitHub Copilot's agent mode is now generally available on JetBrains in addition to VS Code, completing the multi-IDE rollout that started in late 2025. Combined with the March 2026 agentic code review release, Copilot now spans context-gathering, autonomous PR drafting, and review-stage gating across the two largest IDE ecosystems.

agents · tools · industry→

INDUSTRY ANALYSIS·2026-05-20

The 2026 default developer stack: Cursor for editing + Claude Code for autonomous tasks

Professional-developer survey data converges on a clear 2026 default: Cursor for in-IDE editing, Claude Code as a terminal-native agent for complex multi-file tasks. The single-tool-rules-all framing has dissolved into a multi-tool workflow where each agent owns a different surface area.

tools · agents · industry→

GOOGLE / CNBC·2026-05-20

Google ships Gemini 3.5 Flash and Spark agent — finally a credible answer to ChatGPT and Claude

Google used the May 19-20 I/O keynote to ship Gemini 3.5 Flash (half-to-one-third the price of frontier peers, now default in the Gemini app and AI Mode search globally) plus Gemini Spark — a general-purpose agent that reasons across connected apps and takes action on the user's behalf. Spark is in beta for Google AI Ultra subscribers and trusted testers starting next week.

frontier-models · agents · google→

INDUSTRY / MCP ECOSYSTEM·2026-05-20

MCP-native becomes the new baseline for agent tooling — Claude Code, Cursor, Codex all support; Copilot partial

Model Context Protocol (MCP) support has become the baseline qualifier for serious agent tooling in 2026. Claude Code is fully MCP-native; Cursor and Codex support MCP servers via config; GitHub Copilot has partial support; most autonomous agents (Devin, Replit Agent) are still building their MCP layers. The protocol is consolidating into a de facto standard.

tools · agents→

MULTIPLE LABS·2026-05-20

Multi-agent orchestration becomes table stakes — 8 majors shipped parallel-agent modes in one cycle

Within a two-week window in February 2026, every major coding agent shipped multi-agent capabilities: Grok Build (8 parallel agents), Windsurf (5 parallel agents), Claude Code Agent Teams, Codex CLI (Agents SDK), Devin (parallel cloud sessions). May 2026 followups: GPT-5.3-Codex-Spark on Cerebras WSE-3 hits 1,000+ tokens/second per agent.

agents · orchestration→

SWE-BENCH / AGGREGATED·2026-05-20

SWE-bench Verified leaderboard: Mythos 93.9%, GPT-5.5 88.7%, Opus 4.7 87.6%, Cursor 86%

The May 2026 SWE-bench Verified leaderboard now has 44 evaluated models. Claude Mythos Preview leads at 93.9% — the first model to clear 90% on the canonical real-GitHub-issue-fix benchmark. GPT-5.5 follows at 88.7%, Claude Opus 4.7 (Adaptive) at 87.6%, GPT-5.3-Codex at 85.0%, and Cursor's Composer 2.5 at around 86%.

agents · benchmark→

SOURCE·2026-05-20

Agent-merge automation: what 93%-class agents change about software supply chains

When SWE-bench Verified clears 90%, the failure pattern flips. Agents are right by default; the human review step becomes audit rather than authorship. The CI redesign that follows is bigger than the model release.

analysis · agents · tools→

ARXIV 2510.06261·2026-05-19

AlphaApollo: deep agentic reasoning system decomposes complex tasks via foundation-model interleaving

AlphaApollo, described in a new arXiv preprint, presents a deep agentic reasoning architecture in which foundation models interleave explicit reasoning steps, tool queries, and tool outputs in a single unified loop. Initial benchmarks suggest substantial gains on long-horizon scientific reasoning tasks.

research · agents · reasoning→

ANTHROPIC·2026-05-19

Anthropic raises Claude Code weekly limits 50% through July 13 — fueled by SpaceX/Colossus capacity

Anthropic announced a temporary 50% increase in Claude Code weekly usage limits through July 13, 2026. The expansion stacks on top of the earlier doubling of the 5-hour limits (May 6) and is fueled by the SpaceX/Colossus 1 compute deal that came online in late April.

agents · tools→

ANTHROPIC / LEADERBOARDS·2026-05-19

Claude Code holds 78.4% SWE-bench Verified lead over Codex, Cursor, Devin, Replit

Updated SWE-bench Verified leaderboards confirm Claude Code at 78.4% — meaningfully ahead of OpenAI Codex at 71.0%, Cursor agent at 67.2%, Devin at 60.8%, and Replit Agent 3 at 54.1%. The 7-point gap to second place is the widest single-agent lead the benchmark has seen.

frontier-models · agents · benchmark→

GITHUB / MICROSOFT·2026-05-19

GitHub Copilot Pro and Pro+ move to AI Credits flex billing on June 1

GitHub Copilot Pro and Pro+ will move to AI Credits-based flex billing on June 1, 2026 — preserving the $10/month Pro and $39/month Pro+ price points but switching from unlimited usage to credit pools that draw against a monthly allocation.

tools · agents→

CURSOR·2026-05-19

Cursor Composer 2.5 ships May 18 — Opus 4.7 / GPT-5.5 parity at $0.50 input / $2.50 output per M tokens

Cursor released Composer 2.5 on May 18 — its own in-house coding model that benchmarks at parity with Claude Opus 4.7 and GPT-5.5 on SWE-bench Verified, at prices of $0.50 per million input tokens and $2.50 per million output. The release confirms Cursor as a vertically-integrated model builder, not just a tooling wrapper.

agents · tools · model→

WINDSURF·2026-05-19

Windsurf raises Pro to $20/month, ships new $200/month Max plan bundling Devin Cloud and CLI

Windsurf raised Pro from $15 to $20 per month and launched a new Max tier at $200/month that bundles Devin Cloud, the Devin Terminal CLI, and an Adaptive model router. The Max tier positions Windsurf as the only IDE bundling a full autonomous agent product at the high end.

tools · agents→

CURSOR / BLOOMBERG·2026-05-18

Cursor's revenue doubles in 90 days; $50B valuation trajectory emerging

Bloomberg reports that Cursor's revenue doubled in the most recent 90-day window, with active subscription seats well into the seven figures. Internal projections cited by sources suggest a $50B valuation in any 2026 fundraise — making Cursor the highest-valued private dev tools company.

agents · industry · tools→

CURSOR·2026-05-18

Cursor's long-running background agents reach scale with multi-repo workspaces

Cursor's long-running background agents — first shipped in early 2026 — have reached the scale where multi-repo agentic workspaces are routine. Users report running 8-16 concurrent agents across separate codebases for several hours unattended.

tools · agents→

COGNITION / REPLIT / CURSOR·2026-05-17

Devin, Replit Agent, and Cursor all converge on MCP-native architecture

The major autonomous coding agents have all shipped MCP-native support within the last 30 days: Devin (Cognition Labs), Replit Agent 3, and Cursor. Claude Code remains the reference implementation.

agents · tools · partnership→

REPLIT·2026-05-16

Replit Agent 3 ships 200-minute autonomous runs that deploy full-stack apps to a live URL

Replit shipped Agent 3 with a headline feature: 200-minute autonomous build sessions that culminate in a full-stack app deployed to a live URL — auth, database, frontend, and hosting all configured automatically.

tools · agents→

ARXIV 2512.14474·2026-05-12

Model-First Reasoning — explicit problem modeling cuts hallucinations in LLM agents

A May 2026 arXiv preprint introduces Model-First Reasoning (MFR): a paradigm where an LLM agent is required to construct an explicit problem model before proposing a solution. The reported effect is a sharp drop in hallucinated steps and a more inspectable trace.

research · research-papers · agents→

BLOGS.NVIDIA.COM·2026-05-12

NVIDIA and SAP partner on specialized enterprise agents

Joint effort to build specialized AI agents for enterprise workflows, with a stated emphasis on trustworthiness and reliability — the practical blockers slowing real production agent deployment.

agents · enterprise→

ANTHROPIC.COM·2026-05-05

Anthropic ships 10 financial-services agents + Claude Opus 4.7, plus $1.5B Blackstone-led JV

Anthropic launched a 10-agent finance pack deployable as Claude Cowork plugins, Claude Code, or headless Managed Agents — paired with Claude Opus 4.7 (64.37% on Vals AI Finance Agent benchmark, ahead of GPT-5.5's 59.96% and Gemini 3.1 Pro's 59.72%). One day earlier: a $1.5B JV with Blackstone, Hellman & Friedman, and Goldman Sachs.

agents · industry→

BLOGS.NVIDIA.COM·2026-04-28

NVIDIA ships Nemotron 3 Nano Omni — 30B hybrid Mamba-Transformer MoE (3B active), multimodal for agents

Nemotron 3 Nano Omni (April 28) unifies vision, audio, language, and text into one open multimodal model. The architecture is the interesting bit: a hybrid Mamba-Transformer MoE with 30B parameters and only 3B activated per forward pass.

open-source · models · agents→

NVIDIA DEVELOPER·2026-04-28

NVIDIA Nemotron 3 Super — 120B hybrid MoE (12B active) tuned for local agent deployment

NVIDIA's open Nemotron 3 Super lands as a 120B-parameter hybrid MoE with 12B active and a 1M-token context window. The explicit design target: local agent deployment with tool-augmented coding workloads.

open-source · models · agents→

CURSOR.COM·2026-04-02

Cursor 3 ships Agents Window — parallel multi-agent across multiple repos

Cursor 3 (April 2, 2026) introduces a dedicated Agents Window. Instead of one agent in one file, developers can run multiple agents across multiple repositories at the same time — each operating on its own task in its own context.

agents · tools→

OPENAI / MORPHLLM·2026-03-14

OpenAI Codex subagents reach GA — manager-worker model, up to 8 parallel

Codex's subagent feature went GA on March 14, 2026 with a manager-worker model supporting up to 8 parallel workers per task. As of May 2026 Codex still holds the top spot on the most-cited coding benchmark.

agents · tools→

NIST·2026-02-15

NIST launches dedicated standards initiative for autonomous AI agents

In February 2026, NIST opened a dedicated initiative to develop standards for autonomous AI agents — systems that take real-world actions without continuous human oversight. The framing is a direct response to incidents involving autonomous agents creating security vulnerabilities at scales existing frameworks weren't designed for.

policy · agents→

All items 203 items ← back to archive