Overview
- The paper introduces Principled Coarse-Graining, which forms Acoustic Similarity Groups to replace exact token matching with group-level verification.
- A two-model speculative-decoding setup has a small proposer suggest tokens and a larger judge accept them if they fall within the correct acoustic group.
- Experiments report roughly 40% faster speech generation with a 4.09 human naturalness score, lower word error rates than prior speed-focused methods, and preserved speaker similarity.
- The approach is a decoding-time change that requires no retraining and adds about 37MB to store the acoustic groups, supporting practical on-device deployment.
- A stress test that swapped 91.4% of tokens within their groups caused only a +0.007 rise in word error rate and a −0.027 drop in speaker similarity, and coverage notes potential to make Siri responses quicker without confirming any rollout.