Particle News: Google's TurboQuant Shrinks LLM KV Caches for Longer Contexts and Faster Inference

Overview

Google Research introduced TurboQuant to compress the key‑value cache that stores past token vectors during LLM inference.
Community packages and forks provide drop‑in use, including a pip install and a llama.cpp fork with Metal support.
Early tests report about 4–6x lower cache memory and 2–3x higher token throughput when VRAM is the bottleneck.
The two‑stage design rotates vectors to make them easy to quantize, then adds a 1‑bit sign sketch to correct the small remaining error.
Google expects to ship an official implementation in Q2 2026, with a deeper technical presentation planned for ICLR 2026.