Alibaba Releases Qwen3.5-Omni: The AI That Sees, Hears, and Speaks 36 Languages

Audio waveforms and video frames flowing into AI processor chip

Alibaba just dropped Qwen3.5-Omni, an omnimodal AI model that simultaneously processes text, images, audio, and video — and talks back in real time across 36 languages. It can handle 10+ hours of continuous audio input, clone voices, and search the web while doing all of this.

What Makes It Special

Qwen3.5-Omni is natively end-to-end omnimodal — meaning it doesn’t bolt separate models together for different input types. It processes everything through a single architecture with a maximum sequence length of 256,000 tokens.

The Plus variant claims to surpass Gemini 3.1 Pro on general audio understanding, reasoning, and translation benchmarks. Alibaba says it achieved 215 state-of-the-art results across audio, audio-video understanding, reasoning, and interaction benchmarks.

The Training Scale

The model was trained on over 100 million hours of audio-visual data — a scale that puts it in a different weight class from most competitors. For context, that’s roughly 11,400 years of continuous audio.

Why This Matters

While the West focuses on coding assistants and workflow tools, China’s AI labs are pushing hard on multimodal capabilities. Qwen3.5-Omni represents a significant leap in what a single model can do — understanding and generating across every modality simultaneously, in dozens of languages.

The Bottom Line

Alibaba is building the kind of omnimodal AI that most Western companies are still promising. Whether this translates into real-world products or remains a benchmark-beating research achievement depends on adoption. But with 36 languages, 10-hour audio support, and voice cloning built in, Qwen3.5-Omni is the most capable multilingual AI model released this year.