Particle.news
Download on the App Store

Google’s TurboQuant Cuts KV Cache, Fuels Debate Over Memory Demand

The technique targets short‑term inference memory rather than training hardware.

Overview

  • Google says TurboQuant shrinks the key‑value cache to about one‑sixth and can speed inference up to 8x on Nvidia H100 GPUs with near‑original accuracy.
  • The KV cache is the short‑term memory that stores prior token results during a chat, so lighter caches let models handle longer inputs, answer faster, and serve more users on the same hardware.
  • Memory suppliers saw sharp share drops after the blog update, with Samsung down 4.7%, SK hynix down 6.2%, Micron down 6.9%, and SanDisk down 11%, reflecting worries about chip demand.
  • Analysts and academics say this is an efficiency gain that lowers inference costs and tends to expand AI use, including on‑device AI where memory is tight, rather than cutting long‑run memory needs.
  • The approach has not yet been proven on large proprietary models, with broader validation expected at ICLR 2026 and through a planned Q2 code release, while Nvidia readies a related cache method for the same venue.