Particle.news
Download on the App Store

Google's TurboQuant Shrinks LLM KV Caches for Longer Contexts and Faster Inference

Community tools now let developers test the training‑free method to ease memory limits.

Overview

  • Google Research introduced TurboQuant to compress the key‑value cache that stores past token vectors during LLM inference.
  • Community packages and forks provide drop‑in use, including a pip install and a llama.cpp fork with Metal support.
  • Early tests report about 4–6x lower cache memory and 2–3x higher token throughput when VRAM is the bottleneck.
  • The two‑stage design rotates vectors to make them easy to quantize, then adds a 1‑bit sign sketch to correct the small remaining error.
  • Google expects to ship an official implementation in Q2 2026, with a deeper technical presentation planned for ICLR 2026.