Overview
- Chinese firms have cut API prices and engineered models that cost roughly 90–97% less to run per token by using sparse mixture-of-experts architectures and lower-precision training.
- DeepSeek reports training its V3 model for about $5.58 million and has permanently slashed V4-Pro prices by roughly 75%, with cached input costs falling to near zero in local currency terms.
- Mixture-of-experts (MoE) reduces the number of active model parameters per token, which sharply cuts the compute needed for each inference and drives the bulk of the cost gap.
- Developers are switching production traffic through OpenAI-compatible endpoints, routing layers, multi-key proxies and aggressive caching to capture savings, with some reporting large monthly bill drops.
- Adoption is limited by practical frictions such as per-key rate limits, peak-hour latency, content filters and data-residency concerns, and these limits have already pushed companies to add caps and other cost controls.