Particle.news
Download on the App Store

Chinese AI Models Collapse Per‑Token Costs, Forcing Developers to Rearchitect How They Use LLMs

Major Chinese vendors cut API prices this week and developers are routing, caching, and unifying gateways to capture steep real‑world savings.

Overview

  • DeepSeek made a large V4 discount permanent and Xiaomi cut MiMo‑V2.5 API prices on May 22 and May 26 respectively, pushing cached input rates to near‑zero for some tiers and locking in much lower baseline charges.
  • Developers reported immediate, large bill reductions this week by changing default models and adding tiered routing and response caching, with published examples showing monthly bills falling from hundreds or thousands of dollars to tens of dollars.
  • Unified gateways and self‑hosted OpenAI‑compatible proxies now let teams access Chinese models with a single API key and non‑China billing, removing old payment barriers at the cost of extra key management and operational work.
  • Technical advances explain the cuts: matured mixture‑of‑experts architectures activate far fewer parameters per token and hierarchical KV‑cache optimizations reduce repeated‑context inference, while context window size and cache pricing reshape costs for agentic and large‑context workloads.
  • The immediate market effect is a shift from vendor loyalty to a price×performance calculation that forces product teams to choose models by cost per useful token, though practical frictions remain such as per‑key throughput limits, multi‑key management, and provider uptime differences.