// news · multimodal · video · model2026-05-17source: seedance

Seedance 2.0 accepts twelve mixed inputs (images + video clips + audio) per generation

Seedance 2.0 ships unified multimodal video generation with up to twelve mixed inputs per generation: 9 images, 3 video clips, and 3 audio files. The flexibility makes it the most controllable video model on the market.

The headline use case is "match this style, transition to this scene, with this voice" in a single generation — historically requiring multiple stages of generation and editing. The model handles temporal continuity across the inputs.

Pricing is competitive with Sora 2 and undercuts Veo 3.1 for unified workflows. The convergence on twelve-input limits suggests this is hitting capacity boundaries on the underlying compute infrastructure.

MagicHour — multimodal video APIs 2026 →