Particle.news
Download on the App Store

Google Research Unveils TurboQuant to Cut LLM Memory Use and Boost Speed

Early results point to lower data‑center memory demand pending production validation.

Overview

  • Google Research, which announced TurboQuant on Wednesday, said the method shrinks a model’s key‑value cache by at least 6x without measured accuracy loss.
  • TurboQuant pairs PolarQuant, which stores each vector by its magnitude and direction to remove extra constants, with QJL, which turns the leftover error into a single sign bit.
  • Tests on open models like Gemma and Mistral across long‑context suites showed intact task quality, and 4‑bit TurboQuant delivered up to 8x faster attention on NVIDIA H100 GPUs.
  • The technique targets inference memory rather than model weights, requires no retraining or dataset tuning, and is slated for ICLR 2026 and AISTATS 2026 presentations.
  • Shares of Micron, Western Digital, and SanDisk fell on the news, reflecting investor concern that widespread use could curb demand for memory hardware.