Overview
- Google publicly released DiffusionGemma on June 10 with Apache 2.0 weights on Hugging Face, making the model available for developers and researchers to download.
- The model is a 26‑billion‑parameter mixture‑of‑experts that activates roughly 3.8 billion parameters at inference and denoises up to 256 tokens in parallel, and quantized checkpoints fit in about 18GB of VRAM.
- Google and NVIDIA report large single‑user speed gains—over 1,000 tokens per second on an H100 and 700+ on an RTX 5090—claiming up to roughly fourfold faster generation in low‑batch local runs, and NVIDIA shipped day‑one optimizations across RTX and DGX platforms.
- DeepMind frames DiffusionGemma as experimental and speed‑focused: it yields lower overall quality than standard Gemma 4, and key runtime pieces such as a lightweight drafter for speculative decoding and correct context defaults are not yet widely available in consumer runtimes.
- The release shifts local inference bottlenecks from memory bandwidth to compute, which could enable faster inline editing, code infilling, and agent loops, but broader adoption depends on community toolchain updates and independent benchmarks to validate performance and reliability.