Particle.news
Download on the App Store

Google Releases Multi‑Token Prediction Drafters for Gemma 4 With Up to 3x Faster Inference

A paired drafter predicts several tokens at once to reduce latency with no reported loss in quality.

Overview

  • Google introduced Multi‑Token Prediction drafters for its open Gemma 4 models, claiming up to a threefold speed boost in tokens per second.
  • The approach uses a small drafter to guess multiple future tokens while the larger Gemma model verifies them in one pass to avoid wasted compute.
  • Draft models reuse the target model’s activations and share the key‑value cache, and edge variants use an efficient embedder clustering technique to cut logit bottlenecks.
  • The release ships under the Apache 2.0 license with weights available on Hugging Face and Kaggle, and it works with LiteRT‑LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama, and Google’s AI Edge Gallery.
  • Google notes that real‑world gains vary by hardware and batching, with the 26B MoE model seeing routing limits on Apple Silicon at batch size one and larger batches unlocking roughly 2.2x speedups similar to results on Nvidia A100.