Overview
- DeepSeek released a V4 preview Friday with a one million‑token context across two models, V4‑Pro at 1.6 trillion parameters with 49 billion active and V4‑Flash at 284 billion with 13 billion active, with per‑token rates published for input and output.
- A hybrid attention design compresses the key‑value cache about ninefold, dropping memory at one million tokens from roughly 83.9 GiB to 9.62 GiB, which makes long‑context runs feasible on fewer GPUs.
- Huawei confirmed its Ascend chips and supernode clusters fully support V4 at inference, while NVIDIA showed day‑zero Blackwell performance near 3,500 tokens per second per GPU.
- To pull in developers, DeepSeek is offering a 75% discount on V4‑Pro and cutting input cache‑hit fees to one‑tenth of prior levels, which lowers costs for repeated or similar requests.
- Independent benchmarks place V4‑Pro among leading open‑weight models yet still behind top closed systems such as GPT‑5.5, Gemini 3.1 Pro, and Claude Opus on composite scores.