// news · multimodal · open-source · models2026-03-30source: alibaba qwen / marktechpost

Alibaba Qwen 3.5 Omni — native multimodal text/audio/video with sub-300ms TTFT

Qwen 3.5 Omni (released March 30) is a native multimodal model handling text, audio, video, and real-time interaction. Real-time audio time-to-first-token comes in below 300ms with 95%+ ASR accuracy — the relevant numbers for actual voice-assistant deployment.

The model's design point is real-time multimodal interaction. The sub-300ms TTFT is what separates "voice assistant that feels alive" from "voice assistant with a perceptible delay" — the threshold is around 400-500ms before users start treating the system as broken. Qwen 3.5 Omni clears it comfortably.

Multimodal in this generation has settled into a competitive trio: Gemini 3 Deep Think (closed, video leader at 78.4% Video-MME), Claude Opus 4.7 (closed, long-document OCR leader), and Qwen 3.5 Omni (open, real-time audio leader). For builders shipping voice-first products under open-weight constraints, Qwen 3.5 Omni is the new default.

MarkTechPost: Qwen 3.5 Omni →