Overview
- The model targets multi‑agent workflows with 120 billion parameters, 12 billion active at inference, and a 1‑million‑token context window to preserve task state.
- A hybrid Mamba–MoE design pairs SSM‑style Mamba layers with transformer reasoning, adds Latent MoE for greater accuracy, and uses multi‑token prediction to speed inference.
- Nvidia claims up to 5× higher throughput and up to 2× higher accuracy versus the prior Nemotron Super, with NVFP4 on Blackwell reported as up to 4× faster than FP8 on Hopper.
- Access is available on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face, with enterprise routes via Google Cloud Vertex AI and Oracle Cloud (AWS Bedrock and Microsoft Azure listed as coming), plus Dell and HPE offerings and NIM packaging for on‑prem or cloud deployment.
- Nvidia cites leaderboard gains on Artificial Analysis and DeepResearch; Wccftech reports an 85.6% PinchBench score and notes some workloads may run on a single GPU, while The New Stack highlights 478 tokens per second in third‑party tracking.