Overview
- Google detailed TurboQuant, which it unveiled Tuesday, saying it shrinks the key‑value cache that stores past token calculations during inference by at least 6x and can deliver up to 8x speed gains with no accuracy loss.
- TurboQuant pairs PolarQuant, which separates a vector’s size and direction, with QJL, which reduces leftover error to a single sign bit to avoid the extra constants that usually bloat quantized memory.
- Following Thursday’s selloff, Samsung fell about 5% and SK Hynix about 6% in Seoul, with Kioxia near 6% lower in Japan and U.S. peers Micron, SanDisk, and Western Digital also down.
- Brokerages and researchers said the pullback looks like profit‑taking and argued that lower inference costs can increase throughput per chip and broaden adoption, a path that could lift long‑term memory demand.
- The technique applies to inference caches rather than model weights and has been validated on open‑source models so far, with a fuller technical presentation scheduled for ICLR 2026 in April.