Overview
- Google introduced Multi‑Token Prediction drafters for its open Gemma 4 models, claiming up to a threefold speed boost in tokens per second.
- The approach uses a small drafter to guess multiple future tokens while the larger Gemma model verifies them in one pass to avoid wasted compute.
- Draft models reuse the target model’s activations and share the key‑value cache, and edge variants use an efficient embedder clustering technique to cut logit bottlenecks.
- The release ships under the Apache 2.0 license with weights available on Hugging Face and Kaggle, and it works with LiteRT‑LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama, and Google’s AI Edge Gallery.
- Google notes that real‑world gains vary by hardware and batching, with the 26B MoE model seeing routing limits on Apple Silicon at batch size one and larger batches unlocking roughly 2.2x speedups similar to results on Nvidia A100.