Nvidia Vera Rubin platform: rack-scale design, NVL72 specs, and cost claims

NVL72 rack view of the Nvidia Vera Rubin platform showing MGX trays, Rubin GPUs, and pooled memory components

Nvidia Vera Rubin platform: rack-scale design, NVL72 specs, and cost claims

By Agustin Giovagnoli / January 5, 2026

Nvidia is shifting AI infrastructure from single accelerators to rack-scale systems with the Nvidia Vera Rubin platform, a tightly integrated architecture designed to drive higher throughput and lower cost per token for training and inference at data center scale [1][2].

TL;DR: What Nvidia said about Vera Rubin and why it matters

Rubin extends Nvidia’s data center stack with a full-rack approach that co-designs compute, memory, and networking to improve throughput and efficiency. Headline claims include multi-exaFLOP NVFP4 performance in NVL72, modular MGX trays for serviceability, and cost-per-token reductions—especially for MoE workloads—positioning Rubin as an end-to-end platform for modern LLM training and inference [1][2].

What is the Vera Rubin platform? Rack-scale architecture explained

Rubin is a full-stack, rack-scale AI supercomputer architecture that tightly couples CPUs, GPUs, DPUs, and networking under a third-generation MGX design. The platform emphasizes modular, hot-swappable trays for compute and NVLink switching, enabling field service without draining racks or halting jobs—an operational shift from previous GPU-centric generations [1][2].

The six co-designed chips that make Rubin: an at-a-glance breakdown

Rubin revolves around six parts engineered to work as one platform [1]:

  • Vera CPU: Acts as a data engine tightly coupled to GPUs for memory sharing, scheduling, and synchronization.
  • Rubin GPU: Core accelerator with deep support for NVFP4 to boost arithmetic density and efficiency.
  • Rubin CPX GPU: A variant tuned for prefill-heavy, long-context workloads and FLOPs-per-dollar efficiency [3].
  • NVLink 6 switch: High-bandwidth fabric for intra-rack GPU/CPU connectivity.
  • ConnectX-9 SuperNIC and BlueField-4 DPU: Networking and offload to streamline data movement and system orchestration.
  • Spectrum-6 Ethernet switch: Rack-scale Ethernet fabric integration.

This chip-level integration shows up in complete systems such as NVL72 and larger NVL144 configurations, where compute, memory, and fabric are treated as a unified resource pool rather than piecemeal components [1][2].

NVFP4: the numeric change behind the performance gains

A key architectural lever is NVFP4, which increases arithmetic density and power efficiency while aiming to preserve model accuracy. Crucially, Nvidia integrates NVFP4 across hardware and software so that numeric advances translate directly into training and inference throughput, not just theoretical TOPS [1]. For official technical context, see Nvidia’s platform overview (external) on the NVIDIA Developer Blog.

Real numbers: NVL72 and NVL144 performance and memory figures

Nvidia’s NVL72 rack aggregates the platform’s components into a turnkey AI supercomputer with:

  • About 3.6 exaFLOPS of NVFP4 inference and 2.5 exaFLOPS of NVFP4 training performance
  • 54 TB of LPDDR5X on Vera CPUs
  • 20.7 TB of HBM4 on Rubin GPUs
  • 1.6 PB/s of bandwidth

These metrics set a concrete baseline for rack-level planning and comparison with alternatives [1][2]. NVL144 configurations scale into multi-exaFLOP FP4 territory with very large pooled memory, underscoring the shift to rack-level units as the planning and procurement unit for enterprise AI [1][2].

Cost and efficiency claims: fewer GPUs for MoE, lower cost-per-token

For mixture-of-experts models, Nvidia claims Rubin can train with roughly one-quarter the GPUs required by Blackwell and reduce inference cost per token by up to 10x for many models. The company also cites up to 5x greater inference performance relative to Blackwell in certain scenarios [2]. Buyers should validate these deltas in POCs using representative batch sizes, sequence lengths, and expert routing, and verify system-level throughput, cost per token, and power draw under real workloads [2].

Rubin CPX and workload targeting: who benefits most

Rubin CPX targets prefill-heavy and long-context workloads—classes that often dominate end-to-end latency and cost as contexts grow. For organizations serving large-context LLMs or retrieval-augmented applications with substantial prefill phases, CPX’s focus on FLOPs-per-dollar and attention efficiency could deliver outsized benefits within the Rubin ecosystem [3].

Operational benefits: modular MGX design and field serviceability

Rubin’s third-generation MGX design features modular, hot-swappable trays for compute, NVLink switching, power, and cooling. The goal is to enable field service without draining the rack or interrupting active jobs, shrinking maintenance windows and reducing operational risk—practical advantages that can factor meaningfully into TCO [1][2].

Why the Nvidia Vera Rubin platform matters now

Beyond raw speed, the platform’s CPU-GPU “superchip” coupling turns the CPU into a first-class data engine for memory sharing and synchronization, while NVLink 6, Spectrum-6 Ethernet, ConnectX-9, and BlueField-4 knit the rack into a unified compute fabric. This rack-scale cohesion is central to scaling large models and multi-tenant serving without drowning in orchestration overhead [1].

Enterprise implications: procurement, TCO, and migration considerations

Rubin reframes procurement around rack-level units (e.g., NVL72, NVL144) rather than standalone GPU counts. As enterprises plan migrations from Blackwell-era stacks, key questions include:

  • How do NVFP4 gains translate to your models’ accuracy, throughput, and cost per token? [1]
  • What are the realized performance and cost deltas for MoE training and inference versus Blackwell under your data distributions? [2]
  • Which workloads map best to Rubin CPX for prefill-heavy and long-context needs? [3]
  • How does the MGX servicing model affect uptime, staffing, and spares strategy? [1][2]

For practical adoption playbooks and vendor due diligence templates, explore our curated guides: Explore AI tools and playbooks.

How Rubin compares to Blackwell (and what to validate in a POC)

Nvidia positions Rubin as requiring roughly one-quarter the GPUs for MoE training versus Blackwell and offering up to 10x lower inference cost per token for many models, with up to 5x greater inference performance in some cases [2]. In a POC, validate:

  • End-to-end tokens/sec, latency budgets (including prefill), and cost per token
  • NVFP4 accuracy parity on your target tasks
  • Memory pooling behavior across Vera CPU LPDDR5X and Rubin GPU HBM4
  • Stability of NVLink 6 fabric and DPU offloads under peak loads [1][2]

Sources

[1] Inside the NVIDIA Rubin Platform: Six New Chips, One AI …
https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/

[2] Nvidia launches Vera Rubin NVL72 AI supercomputer at CES
https://www.tomshardware.com/pc-components/gpus/nvidia-launches-vera-rubin-nvl72-ai-supercomputer-at-ces-promises-up-to-5x-greater-inference-performance-and-10x-lower-cost-per-token-than-blackwell-coming-2h-2026

[3] The Architecture of Dominance: NVIDIA’s Rubin CPX and the $254 …
https://shanakaanslemperera.substack.com/p/the-architecture-of-dominance-nvidias

Scroll to Top