Nemotron 3 Super hybrid MoE model architecture diagram showing Mamba-Transformer layers and sparse MoE experts for million-token reasoning

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

By Agustin Giovagnoli / March 11, 2026

Nemotron 3 is NVIDIA’s latest open family of large language models optimized for agentic reasoning, built around a hybrid Mamba‑Transformer Mixture‑of‑Experts (MoE) architecture. For teams evaluating the Nemotron 3 Super hybrid MoE model, the promise is higher token throughput, lower inference cost, and million‑token context windows—paired with permissive licensing and broad deployment options [1][2][4][5].

What Nemotron 3 Super Means for Businesses

Nemotron 3 combines Mamba‑2 sequence modeling, sparse MoE experts, and a small number of Transformer self‑attention layers to deliver efficient long‑range reasoning. NVIDIA reports up to 4x higher token throughput than Nemotron 2 Nano and about 60% lower cost per reasoning token, while supporting contexts up to 1 million tokens—enabling multi‑document workflows and persistent agent memory [1][2]. All models are released under the permissive NVIDIA Open Model License and are published on GitHub, Hugging Face, and as NIM microservices, making integration straightforward for enterprise teams [2][4][5].

What is Nemotron 3? Family, Availability, and Licensing

Nemotron 3 is an open family centered on agentic AI use cases. The initial publicly available model is Nemotron 3 Nano, with Super and Ultra variants planned to scale capacity for more demanding reasoning workloads while retaining MoE efficiency [1][2][4][5]. The entire line is released under the NVIDIA Open Model License, enabling modification and commercial use without attribution, and is available via GitHub, Hugging Face, and as NVIDIA NIM microservices [2][4][5].

Hybrid Mamba‑Transformer + MoE Architecture Explained

Nemotron 3’s backbone interleaves Mamba‑2 sequence models with MoE blocks and a limited number of Transformer self‑attention layers. Mamba handles long‑range sequence modeling with linear‑time complexity; MoE routing activates only a subset of experts per token, improving throughput and lowering inference costs compared with dense Transformers [1][2][3]. This design targets accurate agentic reasoning while maintaining efficiency at large context lengths [1][2].

Why the Nemotron 3 Super hybrid MoE model matters

The Super variant targets higher‑end reasoning workloads at increased parameter counts while preserving sparse MoE efficiency benefits. For operators balancing token throughput and inference cost, the architecture is designed to keep per‑token activation sparse without sacrificing performance on complex, multi‑step tasks [1][2]. The same design principles support million‑token context window scenarios, such as multi‑document analysis and long‑running agent memory [1][2][3].

Performance: Million‑Token Contexts, Throughput, and Cost

NVIDIA reports Nemotron 3 can operate natively at up to 1 million tokens, a fit for large corpora review, multi‑document workflows, and persistent context in agents. Compared with Nemotron 2 Nano, the new family delivers up to 4x higher token throughput and as much as 60% reduction in cost per reasoning token, driven by sparse MoE activation and Mamba’s linear‑time sequence modeling [1][2]. For production teams, these gains translate into faster interactive sessions, lower GPU utilization per unit work, and greater concurrency on long‑running tasks [1][2][3].

Training, Reinforcement Learning, and Alignment with NeMo

Nemotron 3 models are trained on roughly 3 trillion tokens with emphasis on reasoning, coding, and multi‑step workflows. Alignment uses multi‑environment reinforcement learning via the open‑source NeMo Gym and NeMo RL libraries, which expose the models to interactive tasks and enable domain‑specific post‑training for enterprise needs [1][2]. This approach is designed to produce more reliable agent behavior in complex operational contexts [1][2][3].

Safety and Evaluation: NeMo Evaluator & Agentic Safety Dataset

NVIDIA’s NeMo Evaluator framework supports systematic performance and safety assessment of agent behavior, including the Nemotron Agentic Safety Dataset to probe model responses under realistic conditions [1][2]. For enterprises prioritizing robust deployment, structured evaluation is critical to measuring progress and identifying gaps before scale‑out [1][2].

Deployment Options: Hugging Face, GitHub, and NIM Microservices

Nemotron 3 Nano is available now through GitHub, on Hugging Face, and as NVIDIA NIM microservices. This makes it straightforward to test locally, integrate via containerized endpoints, or plug into existing MLOps pipelines. The same channels are planned for Super and Ultra variants as they roll out [2][4][5]. For official release updates, see NVIDIA’s announcement NVIDIA press release (external) [4].

Business Use Cases and Industry Integrations

Nemotron 3 targets agentic AI across sectors including manufacturing, cybersecurity, telecom, and software development—particularly where long‑context reasoning and multi‑document workflows are core to value delivery [1][2][3][5]. Teams can pair retrieval and orchestration layers with the million‑token context window to accelerate audits, incident investigations, knowledge synthesis, and codebase‑wide refactoring [1][2][3].

Considerations for Production: Cost, Safety, and Customization

Benchmark token throughput and inference cost across representative workloads.
Use NeMo Evaluator and the Nemotron Agentic Safety Dataset to validate behavior under realistic operational conditions.
Apply domain‑specific post‑training with NeMo Gym and NeMo RL for task alignment.
Confirm licensing terms under the NVIDIA Open Model License for commercial rollouts [1][2][4][5].

For additional planning guidance, explore our AI tools and playbooks.

How to Get Started: Quick Steps for Trials and Proofs of Concept

Pull Nemotron 3 Nano from GitHub or Hugging Face and stand up a simple test service.
Spin up NIM microservices for a production‑like endpoint in staging.
Run NeMo Gym examples to evaluate multi‑step task performance and iterate on reward shaping.
Log throughput, accuracy, and safety outcomes to inform migration to Super as it becomes available [1][2][4][5].

Conclusion and Outlook: Super Today, Ultra Tomorrow

Nemotron 3 brings an open, efficiency‑first design for agentic reasoning, combining Mamba‑2, sparse MoE, and selective self‑attention to scale context and control cost. Nemotron 3 Nano is available now, with Super focused on higher‑end reasoning workloads and Ultra on the roadmap—positioning the family for evolving enterprise demands [1][2][4][5]. As deployments mature, the Nemotron 3 Super hybrid MoE model offers a clear path to long‑context, cost‑effective agents aligned through NeMo tooling [1][2].

Sources

[1] Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make …
https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/

[2] NVIDIA Nemotron 3: Efficient and Open Intelligence
https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-White-Paper.pdf

[3] NVIDIA Nemotron 3: Hybrid MoE + Mamba‑Transformer for Agentic AI
https://tecknexus.com/nvidia-nemotron-3-hybrid-moe-mamba-transformer-for-agentic-ai/

[4] NVIDIA Debuts Nemotron 3 Family of Open Models
https://investor.nvidia.com/news/press-release-details/2025/NVIDIA-Debuts-Nemotron-3-Family-of-Open-Models/default.aspx

[5] NVIDIA Nemotron – Foundation Models for Agentic AI
https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/