// news · multimodal · frontier-models2026-05-26source: google / techcrunch / blog.google

Google Gemini Omni unifies image, audio, video, and text into a single multimodal output — video-generation API ships in coming weeks

Google's Gemini Omni, announced at I/O on May 19, ships as the first frontier model with truly unified multimodal generation: image plus audio plus video plus text accepted as input, and full-video output as the primary surface. The video-generation API ships in coming weeks. Combined with Veo 3.1's 4K-60fps fidelity, the production stack is the most complete creator-targeting multimodal release any lab has shipped.

The strategic architecture is what makes Omni different from prior text-to-video systems. Where Veo 3.1 is a single-modality output (audiovisual generation from text or image prompt), Omni is the multi-modality-in, multi-modality-out shape that lets a creator drop an audio reference, a still image, a partial video clip, and a text instruction into a single conversation and get a coherent video back. That's the input pattern professional creators actually have — they don't start with text alone, they start with reference material across formats — and Omni is the first model whose interface matches that reality.

The video-API release timing matters for the third-party tools ecosystem. Adobe's Firefly Video integration is the obvious downstream beneficiary; Adobe gets to advertise "powered by Gemini Omni" while keeping the Adobe creative-suite surface as the customer-facing layer. Smaller video-creation startups (Runway, Pika, Luma) face a harder strategic question: integrate Omni and compete on workflow value-add, or differentiate against Omni on style, control, or specialized use cases. The independent-tool space is about to fragment along that axis through Q3 2026.

See our analysis →

TechCrunch — Google Gemini Omni at I/O 2026 → · Google Blog — Search and I/O 2026 announcements → · ResultSense — Google launches Gemini Omni for multimodal AI video →