Particle News: Google Research Unveils TurboQuant to Cut LLM Memory Use and Boost Speed

Overview

Google Research, which announced TurboQuant on Wednesday, said the method shrinks a model’s key‑value cache by at least 6x without measured accuracy loss.
TurboQuant pairs PolarQuant, which stores each vector by its magnitude and direction to remove extra constants, with QJL, which turns the leftover error into a single sign bit.
Tests on open models like Gemma and Mistral across long‑context suites showed intact task quality, and 4‑bit TurboQuant delivered up to 8x faster attention on NVIDIA H100 GPUs.
The technique targets inference memory rather than model weights, requires no retraining or dataset tuning, and is slated for ICLR 2026 and AISTATS 2026 presentations.
Shares of Micron, Western Digital, and SanDisk fell on the news, reflecting investor concern that widespread use could curb demand for memory hardware.