Alibaba Launches Qwen3.5-Omni Multimodal AI With Audio and Video Claims

The launch underscores Alibaba's bid for leadership in real-time, multimodal AI.

Overview

The system accepts text, images, audio and video in one model and generates fine, time-stamped captions to turn long clips into searchable notes.
Alibaba says the Omni-Plus variant sets 215 state-of-the-art results in audio and video tasks and beats Gemini 3.1 Pro on many audio measures.
Live voice gains interruption handling, voice cloning and voice control to keep dialogue smooth in real time.
The model includes web search and function calling so it can pull live information and use tools to complete tasks.
Developers can access the API on Alibaba Cloud’s Bailian in Plus, Flash and Light sizes, and a separate Qwen 3.6 Plus preview is now available on OpenRouter.