Overview
- Google published TurboQuant, a technique that compresses the key‑value cache—the short‑term memory that stores past tokens during inference—by about six times without retraining.
- Lab results showed up to an eightfold speed boost on Nvidia H100 GPUs and stable accuracy at three‑bit cache precision on models like Gemma and Mistral.
- Memory stocks fell after the announcement, with companies such as SanDisk and Micron sliding as investors questioned whether future AI workloads would need fewer chips.
- Morgan Stanley and other analysts said the effect is confined to inference caches and kept Overweight ratings on Micron and SanDisk, noting that high‑bandwidth memory used for training should be unaffected.
- KAIST co‑developer Han In‑su said the efficiency will widen AI use and may lift total memory demand over time, and developers have already ported TurboQuant to local frameworks as broader validation and conference presentations roll out this spring.