// news · multimodal · video2026-05-20source: google / bytedance

Google Veo 3.1 ships true 4K at 60fps with native audio; ByteDance Seedance 2.0 lands 12-input fusion

Google's Veo 3.1 generates true 4K (3840×2160) video at up to 60fps with synchronized audio — dialogue, ambient sound, and effects — generated alongside the video in a single pass. ByteDance's Seedance 2.0 raises the multimodal bar further: up to 9 images, 3 video clips, and 3 audio files as inputs to a single generation, plus native lip-sync in 8+ languages.

Veo 3.1's single-pass audio synthesis is the architectural advance. Previous video models generated silent video and bolted audio on afterward (with sync errors). Veo's joint training collapses the pipeline and produces matched audio-video from one inference call.

Seedance's 12-input fusion is the workflow advance. Creative directors mixing reference images, voice samples, and existing footage now have a single endpoint that accepts all of it. The output stays consistent with character identity, scene continuity, and audio style — which is what production workflows have been waiting for.

Pixflow — Best AI video generators 2026 → · InVideo — Kling vs Sora vs Veo vs Runway →