Multimodal

An arXiv paper titled 'Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models' (arXiv 2504.13351) introduces a prompting strategy where Vision Language Models progressively integrate information from each modality to refine task plans for robotic manipulation. The structural innovation is that the methodology works without retraining — it's a prompting protocol that elicits multimodal reasoning from existing VLMs.

research-papers · multimodal→

GOOGLE / JXP·2026-05-22

Gemini Omni positions as first frontier foundation model with native video generation plus chat-editing — Veo/Sora/Kling get a new competitor with deeper integration

Google's Gemini Omni (officially launched on or around May 19-20) becomes the first top-tier AI foundation model to ship native video generation paired with chat-based editing capabilities. The integration delivers a substantially different UX from the standalone-model pattern (Veo 3.1, Sora 2, Kling 3.0): users can iterate on video output through chat without re-routing to a separate generation tool.

multimodal · video→

KUAISHOU / AIMLAPI·2026-05-22

Kling 3 storyboard mode formalizes multi-shot narrative video — multi-shot consistency becomes the production-tier baseline

Kuaishou's Kling 3 (released earlier in May with the storyboard mode update this week) formalizes multi-shot narrative video generation through a structured storyboard interface. Users specify shot sequences with per-shot prompts and continuity constraints; the model generates a connected narrative video maintaining character and setting consistency across the sequence. The capability is the production-tier baseline for narrative video generation.

multimodal · video→

ARXIV 2510 / MIT CSAIL·2026-05-22

MultiModal Action Conditioned Video Generation — MIT CSAIL paper opens fine-grained multimodal control beyond text-to-video

An MIT CSAIL paper by Yichen Li and Antonio Torralba (arXiv 2510.02287) introduces a multimodal action-conditioned video generation approach that captures proprioception, kinesthesia, force haptics, and muscle activation as control signals. The architecture lets users condition video generation on fine-grained physical interaction signals rather than just text prompts — a meaningful step beyond the Sora/Veo/Kling text-to-video pattern.

multimodal · research-papers→

BYTEDANCE / AIMLAPI·2026-05-22

ByteDance Seedance 2.0's twelve-input multimodal architecture defines the production-creative ceiling — 9 images + 3 video + 3 audio in a single generation

Seedance 2.0 (released Feb 9, 2026) accepts up to twelve mixed inputs in a single generation: nine images, three video clips, three audio files. The multi-input architecture is structurally different from Veo 3.1, Sora 2, and Kling 3.0's predominantly text-to-video framing — and it holds the #1 spot on the Artificial Analysis Video Arena leaderboard for both text-to-video and image-to-video.

multimodal · video→

SOURCE·2026-05-22

The consumer-pipeline fork — Gemini Omni picks the unified path, Seedance 2.0 picks twelve-input multimodality

Gemini Omni ships native video plus chat editing in a single conversational surface. Seedance 2.0 accepts nine images, three video clips, and three audio files in a single generation. Two different architectural bets, two different production-creative outcomes, both reinforcing the consumer-vs-production bifurcation.

analysis · multimodal→

SOURCE·2026-05-22

The three-tier video stack settles — Kling 3 for narrative, Seedance 2.0 for multi-input, Gemini Omni for consumer iteration

Kling 3's storyboard mode update formalizes multi-shot narrative video. The MIT action-conditioned video paper extends multimodal conditioning into physical-control signals. The production-creative video stack has settled into three tiers serving distinct workflow stages. Pipelining across them is increasingly the default, not the exception.

analysis · multimodal→

GOOGLE / ANTIGRAVITY·2026-05-21

Gemini 3.5 Flash hits 76.2% Terminal-Bench 2.1 and 1656 GDPval Elo — frontier-class capability at Flash-tier price

Google's Gemini 3.5 Flash hit 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, and 83.6% on MCP Atlas at launch this week. The numbers put Flash within striking distance of full-Pro frontier models on coding and agentic benchmarks while shipping at Flash-tier pricing. It's the first explicit demonstration that 'Flash' no longer means 'small/cheap/limited' — it means 'frontier capability with latency-and-cost optimizations.'

multimodal · frontier-models→

GOOGLE / DEEPMIND·2026-05-21

Gemini Omni Flash begins rolling out to AI Plus/Pro/Ultra subscribers — unified multimodal becomes generally consumed

Google began rolling out Gemini Omni Flash to AI Plus, Pro, and Ultra subscribers on May 19 via the Gemini app and Flow creative studio. The Flash tier of Google's unified multimodal model is the first time a single model that natively accepts text+image+audio+video in one prompt is being delivered as a consumer subscription product rather than a research preview.

frontier-models · multimodal→

KUAISHOU / KLING·2026-05-21

Kling 3.0 multi-shot storyboard mode lands native audio sync across cuts — first end-to-end short-film pipeline in one model

Kuaishou's Kling 3.0 added a multi-shot storyboard mode in May 2026, with native audio sync maintained across cuts. The release positions Kling as the first model to support an end-to-end short-film generation pipeline (multiple shots, continuous audio, scene continuity) inside a single model rather than as an orchestration of single-shot calls.

multimodal · video→

BYTEDANCE / ARTIFICIAL ANALYSIS·2026-05-21

ByteDance Seedance 2.0 takes #1 on Artificial Analysis video-arena leaderboard — Elo 1351 image-to-video beats Kling, Veo, Sora

ByteDance's Seedance 2.0 holds the #1 spot on the Artificial Analysis Video Arena leaderboard with Elo 1269 text-to-video and Elo 1351 image-to-video — ahead of Kling 3.0, Google Veo 3, and OpenAI Sora 2 across both axes. The result lands as Sora's web product shuts down and as Kling 3.0 ships multi-shot storyboard mode.

multimodal · video→

OPENAI·2026-05-21

OpenAI discontinues Sora web/app experiences — API to follow in 2026, clearing surface for unified multimodal successor

OpenAI confirmed it is discontinuing the Sora web and app experiences, with the Sora API scheduled to follow later in 2026. The announcement clears product surface for a presumed unified-multimodal successor and concedes the standalone-video-generator product category to Veo, Kling, and Seedance.

multimodal · industry→

SOURCE·2026-05-21

The Flash multimodal tier arrives — Gemini 3.5 Flash and Seedance 2.0 redefine what 'cheap' delivers

Gemini 3.5 Flash hits 76.2% Terminal-Bench at Flash pricing. Seedance 2.0 takes the #1 spot on the Artificial Analysis video leaderboard. Two different labs, two different modalities, same architectural move: the cheap tier now ships frontier capability.

analysis · multimodal→

SOURCE·2026-05-21

Unified-vs-pipeline — the multimodal architecture bifurcation gets clearer

Google's Gemini Omni Flash shipped to subscribers. OpenAI killed Sora's web product. Kling 3.0 added multi-shot storyboard mode. Three signals, one architectural shift: unified-multimodal owns the consumer tier, pipeline-orchestration owns the production-creative tier.

analysis · multimodal→

GOOGLE / DEEPMIND·2026-05-20

Gemini Omni announced at Google I/O 2026 — unified multimodal model accepts text + image + audio + video in one prompt

Google announced Gemini Omni at I/O 2026 (May 19) — a unified multimodal model that accepts text, image, audio, and video in a single prompt and reasons across all four modalities to produce a video output. The release positions Google as the lead in the all-in-one-model approach to multimodal generation.

multimodal · frontier-models→

BYTEDANCE / SEEDANCE·2026-05-20

Seedance 2.0 accepts 12 mixed inputs per generation — multimodal-input depth is the new benchmark

ByteDance's Seedance 2.0 (February 2026) accepts up to nine images, three video clips, and three audio files in a single generation — twelve total mixed inputs. By comparison, Sora 2 and Kling 3.0 take one to two image references; Veo 3.1 takes one to two images plus one to two video clips. Multimodal-input depth is the new differentiation axis.

multimodal · video→

OPENAI / EWEEK·2026-05-20

OpenAI shuts down Sora — web/app gone April 26, API ending September 24

OpenAI announced in March 2026 that the Sora web and app experiences would discontinue April 26, 2026, with the API following on September 24. The shutdown reflects shifting OpenAI strategy away from standalone video generation and toward integration of video capabilities into ChatGPT and its successors.

multimodal · video · openai→

GOOGLE / BYTEDANCE·2026-05-20

Google Veo 3.1 ships true 4K at 60fps with native audio; ByteDance Seedance 2.0 lands 12-input fusion

Google's Veo 3.1 generates true 4K (3840×2160) video at up to 60fps with synchronized audio — dialogue, ambient sound, and effects — generated alongside the video in a single pass. ByteDance's Seedance 2.0 raises the multimodal bar further: up to 9 images, 3 video clips, and 3 audio files as inputs to a single generation, plus native lip-sync in 8+ languages.

multimodal · video→

1X TECHNOLOGIES·2026-05-19

1X NEO begins delivery to early adopters at $20,000 outright or $499/month subscription

1X Technologies started shipping NEO units to early adopter customers at $20,000 outright or $499/month subscription. The deliveries follow the Hayward factory opening (May 15) and the publicly disclosed first-year production target of 10,000 units.

robotics · multimodal→

ARTIFICIAL ANALYSIS·2026-05-19

Seedance 2.0 holds #1 on Artificial Analysis Video Arena across text-to-video and image-to-video

ByteDance's Seedance 2.0 currently sits at #1 on the Artificial Analysis Video Arena leaderboard across both text-to-video (Elo 1,269) and image-to-video (Elo 1,351) — ahead of Kling 3.0, Veo 3.1, and the now-deprecated Sora 2.

multimodal · video→

GOOGLE DEEPMIND·2026-05-19

Veo 3.1 outputs true 4K at 60fps with synchronized audio in a single pass

Google's Veo 3.1 ships native true-4K (3840×2160) output at up to 60fps, with synchronized audio — ambient sound, dialogue, sound effects — generated alongside the video in a single forward pass. This is the highest native resolution + framerate + audio combination from any production video model.

multimodal · video→

GOOGLE DEEPMIND·2026-05-18

Google Veo 3.1 ships with image + video reference inputs for conversion workflows

Google released Veo 3.1, the latest evolution of its Veo video generation line. The headline feature: 1-2 image references plus 1-2 video clip references per generation, optimized for conversion-oriented production rather than raw realism.

multimodal · video · model→

SEEDANCE·2026-05-17

Seedance 2.0 accepts twelve mixed inputs (images + video clips + audio) per generation

Seedance 2.0 ships unified multimodal video generation with up to twelve mixed inputs per generation: 9 images, 3 video clips, and 3 audio files. The flexibility makes it the most controllable video model on the market.

multimodal · video · model→

1X TECHNOLOGIES·2026-05-15

1X NEO factory opens in Hayward — first vertically-integrated US humanoid plant

1X Technologies opened its NEO Factory in Hayward, California — described as America's first vertically-integrated humanoid robot factory. The 58,000-sq-ft facility targets 10,000 units in year one, scaling to 100,000 by end of 2027.

robotics · multimodal→

NVIDIA BLOG·2026-04-28

NVIDIA Nemotron 3 Nano Omni — unified vision, audio, language for agents

NVIDIA's open Nemotron 3 Nano Omni unifies vision, audio, and language processing in a single model, claiming up to 9x efficiency improvement for agent workloads versus equivalent stacks of specialist models.

multimodal · open-source · models→

ALIBABA QWEN / MARKTECHPOST·2026-03-30

Alibaba Qwen 3.5 Omni — native multimodal text/audio/video with sub-300ms TTFT

Qwen 3.5 Omni (released March 30) is a native multimodal model handling text, audio, video, and real-time interaction. Real-time audio time-to-first-token comes in below 300ms with 95%+ ASR accuracy — the relevant numbers for actual voice-assistant deployment.

multimodal · open-source · models→

All items 150 items ← back to archive