Overview
- Google Research introduced TurboQuant to compress the key‑value cache that stores past token vectors during LLM inference.
- Community packages and forks provide drop‑in use, including a pip install and a llama.cpp fork with Metal support.
- Early tests report about 4–6x lower cache memory and 2–3x higher token throughput when VRAM is the bottleneck.
- The two‑stage design rotates vectors to make them easy to quantize, then adds a 1‑bit sign sketch to correct the small remaining error.
- Google expects to ship an official implementation in Q2 2026, with a deeper technical presentation planned for ICLR 2026.