Overview
- Google Research, which unveiled TurboQuant on Tuesday, says the method compresses the key‑value cache used during inference by at least sixfold with no measured loss in model outputs.
- TurboQuant pairs PolarQuant, which stores vector direction and magnitude in a way that avoids extra normalization data, with a 1‑bit Quantized Johnson‑Lindenstrauss step that corrects residual error without storing constants.
- Tests on open‑source models such as Gemma and Mistral across long‑context suites showed 3‑ to 4‑bit KV storage matching full precision, with up to 8x faster attention on NVIDIA H100 GPUs and higher recall in vector search than PQ and RabbiQ baselines.
- Shares fell after the news, with SK Hynix down about 6% and Samsung nearly 5% in South Korea on Thursday and U.S. names like Micron and SanDisk lower, as some investors bet on lower memory needs even as several analysts called the move profit‑taking and noted efficiency can expand total usage.
- The work targets inference working memory rather than model weights or training RAM and remains research pending April presentations at ICLR and AISTATS, which leaves real‑world adoption, integration costs, and effects on chip demand as the key open questions.