// news · multimodal · video2026-05-20source: bytedance / seedance

Seedance 2.0 accepts 12 mixed inputs per generation — multimodal-input depth is the new benchmark

ByteDance's Seedance 2.0 (February 2026) accepts up to nine images, three video clips, and three audio files in a single generation — twelve total mixed inputs. By comparison, Sora 2 and Kling 3.0 take one to two image references; Veo 3.1 takes one to two images plus one to two video clips. Multimodal-input depth is the new differentiation axis.

The capability shift matters because it changes what's possible for character consistency, scene continuity, and reference-driven generation. Twelve mixed inputs is enough to encode a full mood-board plus voice samples plus motion references — the kind of brief a video producer would assemble for a human VFX team. Seedance is becoming the first model where production-grade briefs translate cleanly into generation prompts.

For Sora 2 and Veo 3.1, the response will be input-depth parity. Expect the next-generation releases from both to ship with at least eight-input support. The competitive bar has moved.

AI/ML API — Seedance 2.0 vs 1.5 Pro → · Beginners in AI — Seedance 2.0 explained →