Overview
- A newly posted arXiv report details a dual-track language-model design paired with two speech tokenizers to enable real-time synthesis and fine-grained control.
- The 12Hz tokenizer is built for ultra-low latency, with first audio packets reported at about 97 milliseconds.
- Training covers more than five million hours across ten languages to support multilingual and robust speech generation.
- The authors report state-of-the-art results on a multilingual TTS test set, InstructTTSEval, and a long-speech benchmark.
- Tongyi Lab says the full family is open-sourced under Apache 2.0, with weights, code, demos, and variants including VoiceDesign, CustomVoice, and Base at 0.6B and 1.7B parameters with fine-tuning support.