// news · multimodal · industry2026-05-23source: google / techcrunch / vo3ai

Google ships Gemini Omni at I/O 2026 — unified multimodal model accepts text, image, audio, video in one prompt, outputs video

Google unveiled Gemini Omni at Google I/O on May 19, a unified multimodal model that takes text, image, audio, and video in a single prompt and produces video, edited photos, and custom digital avatars as output. Gemini Omni Flash started rolling out the same day to AI Plus, Pro, and Ultra subscribers via the Gemini app and Google's Flow creative studio.

The architectural bet is unification. Where Veo 3.1 generates video and Sora used to generate video and others generate images, Omni treats the modalities as interchangeable both on input and output. The implication for content pipelines is that the prompt becomes the storyboard — you can hand the model a script, a reference photo, an audio track, and a rough animatic, and have it produce the finished cut.

The competitive consequence is that the "best video model" framing is being obsoleted. Veo 3.1 still wins on synchronized-audio generation; Kling still wins on 2-minute clip length. But Omni's unified-input model means the comparison moves from generation quality alone to whether the rest of the production workflow can be folded into a single model surface. That's the bet Google is making.

See our analysis →

TechCrunch — Google's Gemini Omni turns images, audio, and text into video → · Vo3AI — Gemini Omni — Google's Unified Multimodal Video Model → · Big News Network — Veo 4 vs Gemini Omni: Decoding Google's Video AI Strategy →