Overview
- Vendors have cut prices and reported very low cached-token rates, with DeepSeek making a 75% permanent cut to its V4‑Pro and reporting cached input costs as low as RMB 0.025 per 1M tokens while 01.ai was reported near $0.14 per 1M tokens.
- Engineers are using sparse Mixture‑of‑Experts (MoE) designs and lower‑precision training such as FP8 to cut active parameters from 671 billion to about 37 billion and to claim 90–97% reductions in inference compute per token.
- Trade and chip limits have shaped the shift: U.S. export controls on high‑end Nvidia hardware pushed Chinese teams to optimize on the export‑compliant H800 and prioritize software workarounds over raw chip access.
- Developers and platforms are adapting fast by routing requests, caching inputs, and using multi‑key proxy setups to exploit cheaper models, driving rapid volume growth on open routing services while facing frictions like per‑key rate limits and key management costs.
- If reported cost claims hold up, the moves could weaken some capital‑expenditure advantages of U.S. incumbents and lower the cost of AI for crypto and Web3 use cases, but the coverage notes that many figures are preliminary and operational limits remain.