Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell: GB200 NVL72 MoE inference performance

NVIDIA GB200 NVL72 rack-scale NVLink system demonstrating GB200 NVL72 MoE inference performance with Blackwell GPUs

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell: GB200 NVL72 MoE inference performance

By Agustin Giovagnoli / January 7, 2026

NVIDIA is positioning Mixture‑of‑Experts (MoE) as the default pattern for frontier open‑source models and is aligning its Blackwell generation around accelerating these workloads at rack scale. The company’s GB200 NVL72 system and software stack aim to deliver transformative GB200 NVL72 MoE inference performance—translating to higher throughput and lower cost per token for production deployments [1][3].

Why MoE is becoming the go‑to for frontier LLMs

MoE activates only a small subset of specialized experts per token, enabling model capability to scale without a proportional increase in compute and energy. This architecture underpins emerging frontier models, with NVIDIA highlighting MoE momentum as organizations seek greater efficiency at massive parameter counts [3]. As MoE adoption rises, Blackwell‑class systems target the communication, precision, and scheduling bottlenecks that previously limited real‑time, large‑scale inference [3].

For a broader architectural backdrop, see NVIDIA’s Blackwell platform overview (external) on the company’s site.

How GB200 NVL72 MoE inference performance reshapes deployment

GB200 NVL72 integrates 36 Grace CPUs and 72 Blackwell GPUs into a single 72‑GPU NVLink domain that behaves like one large GPU. The rack‑scale NVLink topology, fifth‑generation NVLink, and liquid cooling are engineered to keep utilization high and latency low for expert routing and inter‑expert communication—core requirements for MoE serving at scale [1]. NVIDIA and partners showcase MoE‑tuned frameworks and runtimes as part of this rack‑aware approach [2].

Blackwell features that enable 10x‑class gains

Blackwell adds a second‑generation Transformer Engine with FP4 support and new microscaling Tensor Cores. Together with aggressive low‑precision formats like NVFP4, these features target major throughput and efficiency improvements for inference. NVIDIA reports roughly 10x higher MoE inference performance per watt per GPU and about 10x lower cost per token versus the prior Hopper generation for large MoE workloads [1][3].

These hardware advances intersect with the growing set of MoE models—such as DeepSeek‑V3, DeepSeek‑R1, Kimi K2 Thinking, and Mistral Large 3—that are designed to exploit expert sparsity during inference [3].

Software stack: MoE‑aware runtimes, kernels, and scheduling

Full‑stack optimization is central. NVIDIA points to MoE‑aware kernels, runtimes, and scheduling in frameworks including TensorRT‑LLM, SGLang, and vLLM as levers for extracting rack‑scale performance from GB200 NVL72. Partner deployments combine these software elements with decoding and serving optimizations to maximize throughput and utilization [2][3].

This ecosystem approach is why NVIDIA describes the broader vision—spanning Blackwell and NVIDIA Dynamo—as turning clusters into “intelligent inference systems,” where hardware–software co‑design at rack scale becomes the key competitive advantage [3].

NVFP4 W4A4 on Blackwell: practical 4‑bit MoE inference

Early hands‑on work with Blackwell GB10 demonstrates practical NVFP4 W4A4 MoE inference, including long context windows and high GPU utilization. The feasibility of 4‑bit inference in real workloads is a critical signal for production‑scale MoE, where precision strategies directly impact both latency and cost per token [4]. As these techniques mature, they strengthen the case for migrating latency‑sensitive and cost‑constrained MoE services onto Blackwell‑class hardware [4].

Performance and cost expectations for Blackwell MoE serving

NVIDIA cites up to roughly 10x MoE performance per watt per GPU and about 10x lower cost per token versus Hopper for large MoE inference. With GB200 NVL72 deployed alongside Quantum‑X800 InfiniBand or Spectrum‑X Ethernet and ConnectX‑8 SuperNICs, the company reports up to 30x real‑time performance for trillion‑parameter LLM inference and significant AI factory throughput gains [1][3]. These figures reflect the combination of rack‑scale NVLink, low‑precision inference (including FP4/NVFP4), and MoE‑tuned software stacks [1][3].

The momentum behind MoE includes leading open‑source frontier models—DeepSeek‑R1 and others—that emphasize specialized experts to amplify capability without linear compute growth [5].

Deployment architecture and networking

GB200 NVL72 is designed to be paired with next‑generation fabrics—Quantum‑X800 InfiniBand or Spectrum‑X Ethernet—and ConnectX‑8 SuperNICs to sustain cluster‑level throughput. This combination is positioned to reduce bottlenecks during expert routing and token exchange, helping the unified NVLink domain behave as a single, large accelerator for serving [1][3].

When to choose Blackwell for MoE (quick checklist)

  • You are consolidating large MoE inference onto a rack‑scale NVLink domain to reduce tail latency and cost per token [1][3].
  • You plan to leverage TensorRT‑LLM, vLLM, or SGLang for MoE‑aware kernels and scheduling on Blackwell [2][3].
  • Your precision roadmap includes FP4/NVFP4 and you are evaluating NVFP4 W4A4 for production [1][4].
  • Your networking stack can adopt Quantum‑X800 or Spectrum‑X with ConnectX‑8 SuperNICs for end‑to‑end throughput [1][3].

For teams building operational playbooks and benchmarking frameworks, you can also explore AI tools and playbooks.

Outlook

From models like DeepSeek‑V3 and Mistral Large 3 to rack‑scale deployments, the direction is clear: MoE is moving mainstream, and Blackwell’s hardware–software co‑design is built to meet it. Expect GB200 NVL72 MoE inference performance improvements to translate into faster time‑to‑value for real‑time and high‑throughput services, with 4‑bit inference becoming increasingly practical in production settings [1][3][4][5].

Sources

[1] GB200 NVL72 | NVIDIA
https://www.nvidia.com/en-us/data-center/gb200-nvl72/

[2] NVIDIA GB200 NVL72 Boosts Frontier AI Models with Mixture-of-Experts
https://www.linkedin.com/posts/dionharris_mixture-of-experts-powers-the-most-intelligent-activity-7402169546117328896-cXxg

[3] Mixture of Experts Powers the Most Intelligent Frontier Models
https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/

[4] NVFP4 W4A4 MoE Inference on NVIDIA Blackwell GB10
https://medium.com/@cogitatus/nvfp4-w4a4-moe-inference-on-nvidia-blackwell-gb10-1a83e85d0f9e

[5] NVIDIA – Mixture-of-Experts (MoE) models like DeepSeek-R1…
https://www.facebook.com/NVIDIA/posts/mixture-of-experts-moe-models-like-deepseek-r1-unlock-new-levels-of-capabilitybu/1284062583760497/

Scroll to Top