Particle.news
Download on the App Store

Chinese AI Models Undercut Western APIs by More Than 90%

Sparse mixture-of-experts designs combined with cheaper GPUs, training shortcuts and lower labor costs have cut per-token inference costs and prompted rapid production switching by developers.

Overview

  • Chinese firms have cut API prices and engineered models that cost roughly 90–97% less to run per token by using sparse mixture-of-experts architectures and lower-precision training.
  • DeepSeek reports training its V3 model for about $5.58 million and has permanently slashed V4-Pro prices by roughly 75%, with cached input costs falling to near zero in local currency terms.
  • Mixture-of-experts (MoE) reduces the number of active model parameters per token, which sharply cuts the compute needed for each inference and drives the bulk of the cost gap.
  • Developers are switching production traffic through OpenAI-compatible endpoints, routing layers, multi-key proxies and aggressive caching to capture savings, with some reporting large monthly bill drops.
  • Adoption is limited by practical frictions such as per-key rate limits, peak-hour latency, content filters and data-residency concerns, and these limits have already pushed companies to add caps and other cost controls.