Llama Nemotron#
Llama Nemotron is NVIDIA’s family of large language models derived from Meta’s Llama architecture, post-trained for enhanced reasoning, human chat preferences, and agentic tasks such as RAG and tool calling. The models feature neural architecture search (NAS) optimizations for improved efficiency and accuracy trade-offs.
Llama Nemotron models are supported via the Bridge system with auto-detected configuration and weight mapping.
Available Models#
Megatron Bridge supports the following Llama Nemotron model variants:
Llama-3.3-Nemotron-Super-49B: 49B parameters (NAS-optimized from 70B)
Llama-3.1-Nemotron-Ultra-253B: 253B parameters (large-scale reasoning model)
Llama-3.1-Nemotron-70B: 70B parameters (standard size)
Llama-3.1-Nemotron-Nano-8B: 8B parameters (efficient variant)
Llama-3.1-Nemotron-Nano-4B: 4B parameters (ultra-compact variant)
All models are ready for commercial use and support context lengths up to 128K tokens.
Model Architecture Features#
Neural Architecture Search (NAS): Novel approach to reduce memory footprint while maintaining accuracy
Heterogeneous Blocks: Non-standard and non-repetitive layer configurations for efficiency
Skip attention in some blocks
Variable FFN expansion/compression ratios between blocks
Multi-Phase Post-Training:
Supervised fine-tuning for Math, Code, Science, and Tool Calling
Reward-aware Preference Optimization (RPO) for chat
Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning
Iterative Direct Preference Optimization (DPO) for tool calling
Extended Context: Native support for sequences up to 128K tokens
Commercial Ready: Fully licensed for commercial deployment
Conversion with 🤗 Hugging Face#
Load HF → Megatron#
from megatron.bridge import AutoBridge
# Example: Llama-3.3-Nemotron-Super-49B
bridge = AutoBridge.from_hf_pretrained(
"nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
trust_remote_code=True
)
provider = bridge.to_megatron_provider()
# Optionally configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 2
provider.pipeline_model_parallel_size = 1
model = provider.provide_distributed_model(wrap_with_ddp=False)
Note: Heterogeneous Llama-Nemotron models (Super/Ultra) require trust_remote_code=True as they use custom DeciLMForCausalLM architecture. Homogeneous models (Nano/70B) use standard Llama architecture and don’t require this flag.
Import Checkpoint from HF#
python examples/conversion/convert_checkpoints.py import \
--hf-model nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 \
--megatron-path /checkpoints/llama_nemotron_super_49b_megatron \
--trust-remote-code
Export Megatron → HF#
from megatron.bridge import AutoBridge
# Load the bridge from HF model ID
bridge = AutoBridge.from_hf_pretrained(
"nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
trust_remote_code=True
)
# Export a trained Megatron checkpoint to HF format
bridge.export_ckpt(
megatron_path="/results/llama_nemotron_super_49b/checkpoints/iter_0000500",
hf_path="/exports/llama_nemotron_super_49b_hf",
)
Run Inference on Converted Checkpoint#
python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 \
--megatron_model_path /checkpoints/llama_nemotron_super_49b_megatron \
--prompt "What is artificial intelligence?" \
--max_new_tokens 100 \
--tp 2 \
--trust-remote-code
For more details, see examples/conversion/hf_to_megatron_generate_text.py
Recipes#
Training recipes for Llama Nemotron models are not currently available.
Hugging Face Model Cards & References#
Hugging Face Model Cards#
Llama Nemotron Collection: https://huggingface.co/collections/nvidia/llama-nemotron
Llama-3.3-Nemotron-Super-49B-v1.5: https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
Llama-3.1-Nemotron-Ultra-253B-v1: https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
Llama-3.1-Nemotron-Nano-8B-v1: https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
Llama-3.1-Nemotron-Nano-4B-v1.1: https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
Technical Papers#
Llama-Nemotron: Efficient Reasoning Models: arXiv:2505.00949
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs: arXiv:2411.19146
Reward-aware Preference Optimization: arXiv:2502.00203
Additional Resources#
NVIDIA Build Platform: https://build.nvidia.com/
Llama Nemotron Post-Training Dataset: https://huggingface.co/nvidia/Llama-Nemotron-Post-Training-Dataset