Vision, audio, video, embodied — beyond text-only.
An arXiv paper titled 'Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models' (arXiv 2504.13351) introduces a prompting strategy where Vision Language Models progressively integrate information from each modality to refine task plans for robotic manipulation. The structural innovation is that the methodology works without retraining — it's a prompting protocol that elicits multimodal reasoning from existing VLMs.
Google's Gemini Omni (officially launched on or around May 19-20) becomes the first top-tier AI foundation model to ship native video generation paired with chat-based editing capabilities. The integration delivers a substantially different UX from the standalone-model pattern (Veo 3.1, Sora 2, Kling 3.0): users can iterate on video output through chat without re-routing to a separate generation tool.
Kuaishou's Kling 3 (released earlier in May with the storyboard mode update this week) formalizes multi-shot narrative video generation through a structured storyboard interface. Users specify shot sequences with per-shot prompts and continuity constraints; the model generates a connected narrative video maintaining character and setting consistency across the sequence. The capability is the production-tier baseline for narrative video generation.
An MIT CSAIL paper by Yichen Li and Antonio Torralba (arXiv 2510.02287) introduces a multimodal action-conditioned video generation approach that captures proprioception, kinesthesia, force haptics, and muscle activation as control signals. The architecture lets users condition video generation on fine-grained physical interaction signals rather than just text prompts — a meaningful step beyond the Sora/Veo/Kling text-to-video pattern.
Seedance 2.0 (released Feb 9, 2026) accepts up to twelve mixed inputs in a single generation: nine images, three video clips, three audio files. The multi-input architecture is structurally different from Veo 3.1, Sora 2, and Kling 3.0's predominantly text-to-video framing — and it holds the #1 spot on the Artificial Analysis Video Arena leaderboard for both text-to-video and image-to-video.
Gemini Omni ships native video plus chat editing in a single conversational surface. Seedance 2.0 accepts nine images, three video clips, and three audio files in a single generation. Two different architectural bets, two different production-creative outcomes, both reinforcing the consumer-vs-production bifurcation.
Kling 3's storyboard mode update formalizes multi-shot narrative video. The MIT action-conditioned video paper extends multimodal conditioning into physical-control signals. The production-creative video stack has settled into three tiers serving distinct workflow stages. Pipelining across them is increasingly the default, not the exception.
Google's Gemini 3.5 Flash hit 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, and 83.6% on MCP Atlas at launch this week. The numbers put Flash within striking distance of full-Pro frontier models on coding and agentic benchmarks while shipping at Flash-tier pricing. It's the first explicit demonstration that 'Flash' no longer means 'small/cheap/limited' — it means 'frontier capability with latency-and-cost optimizations.'
Google began rolling out Gemini Omni Flash to AI Plus, Pro, and Ultra subscribers on May 19 via the Gemini app and Flow creative studio. The Flash tier of Google's unified multimodal model is the first time a single model that natively accepts text+image+audio+video in one prompt is being delivered as a consumer subscription product rather than a research preview.
Kuaishou's Kling 3.0 added a multi-shot storyboard mode in May 2026, with native audio sync maintained across cuts. The release positions Kling as the first model to support an end-to-end short-film generation pipeline (multiple shots, continuous audio, scene continuity) inside a single model rather than as an orchestration of single-shot calls.
ByteDance's Seedance 2.0 holds the #1 spot on the Artificial Analysis Video Arena leaderboard with Elo 1269 text-to-video and Elo 1351 image-to-video — ahead of Kling 3.0, Google Veo 3, and OpenAI Sora 2 across both axes. The result lands as Sora's web product shuts down and as Kling 3.0 ships multi-shot storyboard mode.
OpenAI confirmed it is discontinuing the Sora web and app experiences, with the Sora API scheduled to follow later in 2026. The announcement clears product surface for a presumed unified-multimodal successor and concedes the standalone-video-generator product category to Veo, Kling, and Seedance.
Gemini 3.5 Flash hits 76.2% Terminal-Bench at Flash pricing. Seedance 2.0 takes the #1 spot on the Artificial Analysis video leaderboard. Two different labs, two different modalities, same architectural move: the cheap tier now ships frontier capability.
Google's Gemini Omni Flash shipped to subscribers. OpenAI killed Sora's web product. Kling 3.0 added multi-shot storyboard mode. Three signals, one architectural shift: unified-multimodal owns the consumer tier, pipeline-orchestration owns the production-creative tier.
Google announced Gemini Omni at I/O 2026 (May 19) — a unified multimodal model that accepts text, image, audio, and video in a single prompt and reasons across all four modalities to produce a video output. The release positions Google as the lead in the all-in-one-model approach to multimodal generation.
ByteDance's Seedance 2.0 (February 2026) accepts up to nine images, three video clips, and three audio files in a single generation — twelve total mixed inputs. By comparison, Sora 2 and Kling 3.0 take one to two image references; Veo 3.1 takes one to two images plus one to two video clips. Multimodal-input depth is the new differentiation axis.
OpenAI announced in March 2026 that the Sora web and app experiences would discontinue April 26, 2026, with the API following on September 24. The shutdown reflects shifting OpenAI strategy away from standalone video generation and toward integration of video capabilities into ChatGPT and its successors.
Google's Veo 3.1 generates true 4K (3840×2160) video at up to 60fps with synchronized audio — dialogue, ambient sound, and effects — generated alongside the video in a single pass. ByteDance's Seedance 2.0 raises the multimodal bar further: up to 9 images, 3 video clips, and 3 audio files as inputs to a single generation, plus native lip-sync in 8+ languages.
1X Technologies started shipping NEO units to early adopter customers at $20,000 outright or $499/month subscription. The deliveries follow the Hayward factory opening (May 15) and the publicly disclosed first-year production target of 10,000 units.
ByteDance's Seedance 2.0 currently sits at #1 on the Artificial Analysis Video Arena leaderboard across both text-to-video (Elo 1,269) and image-to-video (Elo 1,351) — ahead of Kling 3.0, Veo 3.1, and the now-deprecated Sora 2.
Google's Veo 3.1 ships native true-4K (3840×2160) output at up to 60fps, with synchronized audio — ambient sound, dialogue, sound effects — generated alongside the video in a single forward pass. This is the highest native resolution + framerate + audio combination from any production video model.
Google released Veo 3.1, the latest evolution of its Veo video generation line. The headline feature: 1-2 image references plus 1-2 video clip references per generation, optimized for conversion-oriented production rather than raw realism.
Seedance 2.0 ships unified multimodal video generation with up to twelve mixed inputs per generation: 9 images, 3 video clips, and 3 audio files. The flexibility makes it the most controllable video model on the market.
1X Technologies opened its NEO Factory in Hayward, California — described as America's first vertically-integrated humanoid robot factory. The 58,000-sq-ft facility targets 10,000 units in year one, scaling to 100,000 by end of 2027.
NVIDIA's open Nemotron 3 Nano Omni unifies vision, audio, and language processing in a single model, claiming up to 9x efficiency improvement for agent workloads versus equivalent stacks of specialist models.
Qwen 3.5 Omni (released March 30) is a native multimodal model handling text, audio, video, and real-time interaction. Real-time audio time-to-first-token comes in below 300ms with 95%+ ASR accuracy — the relevant numbers for actual voice-assistant deployment.