Nemotron H and Nemotron Nano v2#

Nemotron H and Nemotron Nano v2 are families of hybrid SSM-Attention models from NVIDIA that combine Mamba (State Space Model) layers with traditional attention layers. These models achieve strong performance while maintaining computational efficiency through their hybrid architecture.

The Nemotron H family includes models from 4B to 56B parameters with 8K context length, while Nemotron Nano v2 models (9B and 12B) are optimized for edge deployment with extended 128K context support.

Model Families#

Nemotron H#

  • 4B: 52 layers, 3072 hidden size, 8K context

  • 8B: 52 layers, 4096 hidden size, 8K context

  • 47B: 98 layers, 8192 hidden size, 8K context

  • 56B: 118 layers, 8192 hidden size, 8K context

Nemotron Nano v2#

  • 9B: 56 layers, 4480 hidden size, 128K context

  • 12B: 62 layers, 5120 hidden size, 128K context

All models are supported via the Bridge system with specialized configurations for hybrid SSM-Attention architecture.

Model Architecture#

Common Features Across All Models#

  • Architecture: Hybrid SSM-Attention (Mamba + Multi-Query Attention)

  • SSM: Mamba-2 selective state space layers

  • Attention: Multi-query attention with QK LayerNorm and RoPE

  • Activation: Squared ReLU (SwiGLU in FFN)

  • Normalization: RMSNorm

  • Position Embedding: RoPE (Rotary Position Embeddings)

  • Hybrid Pattern: Configurable layer-wise mixing of Mamba (“M”) and Attention (“*”) layers

Nemotron H 4B Specifications#

  • Parameters: 4B

  • Layers: 52 (Hybrid pattern: M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-)

  • Hidden size: 3072

  • FFN hidden size: 12288

  • Attention heads: 32 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 112

  • Mamba head dim: 64

  • Mamba state dim: 128

  • Context Length: 8K tokens

Nemotron H 8B Specifications#

  • Parameters: 8B

  • Layers: 52 (Hybrid pattern: M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-)

  • Hidden size: 4096

  • FFN hidden size: 21504

  • Attention heads: 32 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 128

  • Mamba head dim: 64

  • Mamba state dim: 128

  • Context Length: 8K tokens

Nemotron H 47B Specifications#

  • Parameters: 47B

  • Layers: 98

  • Hidden size: 8192

  • FFN hidden size: 30720

  • Attention heads: 64 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 256

  • Mamba head dim: 64

  • Mamba state dim: 256

  • Context Length: 8K tokens

Nemotron H 56B Specifications#

  • Parameters: 56B

  • Layers: 118

  • Hidden size: 8192

  • FFN hidden size: 32768

  • Attention heads: 64 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 256

  • Mamba head dim: 64

  • Mamba state dim: 256

  • Context Length: 8K tokens

Nemotron Nano 9B v2 Specifications#

  • Parameters: 9B

  • Layers: 56 (Hybrid pattern: M-M-M-MM-M-M-M*-M-M-M*-M-M-M-M*-M-M-M-M*-M-MM-M-M-M-M-M-)

  • Hidden size: 4480

  • FFN hidden size: 15680

  • Attention heads: 40 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 128

  • Mamba head dim: 80

  • Mamba state dim: 128

  • Context Length: 128K tokens

  • Vocab size: 131,072

Nemotron Nano 12B v2 Specifications#

  • Parameters: 12B

  • Layers: 62 (Hybrid pattern: M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M*-M-M-M-M-)

  • Hidden size: 5120

  • FFN hidden size: 20480

  • Attention heads: 40 query heads, 8 key-value groups

  • KV channels: 128

  • Mamba heads: 128

  • Mamba head dim: 80

  • Mamba state dim: 128

  • Context Length: 128K tokens

  • Vocab size: 131,072

Key Features#

Hybrid SSM-Attention Architecture#

  • Mamba Layers (M): State space model layers for efficient long-range modeling

  • Attention Layers (*): Standard multi-query attention for complex reasoning

  • Configurable Pattern: Each model has a predefined hybrid pattern balancing efficiency and performance

Advanced Optimizations#

  • Squared ReLU Activation: Enhanced non-linearity for better gradient flow

  • QK LayerNorm: Applies LayerNorm to query and key projections for training stability

  • RoPE: Rotary Position Embeddings with base 10000

  • Multi-Query Attention: Efficient attention with shared key-value heads

  • Selective State Space: Mamba-2 architecture with selective gating

Extended Context (Nano v2)#

  • 128K Context Window: Nemotron Nano v2 models support up to 128K tokens

  • Efficient Long-Range Modeling: Hybrid architecture optimized for long sequences

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

Nemotron H Models#

from megatron.bridge import AutoBridge

# Example: Nemotron H 8B
bridge = AutoBridge.from_hf_pretrained("nvidia/Nemotron-H-8B-Base-8K", trust_remote_code=True)
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 2
provider.pipeline_model_parallel_size = 1
provider.context_parallel_size = 1
provider.sequence_parallel = True

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

# Other models:
# bridge = AutoBridge.from_hf_pretrained("nvidia/Nemotron-H-4B-Base-8K", trust_remote_code=True)
# bridge = AutoBridge.from_hf_pretrained("nvidia/Nemotron-H-47B-Base-8K", trust_remote_code=True)
# bridge = AutoBridge.from_hf_pretrained("nvidia/Nemotron-H-56B-Base-8K", trust_remote_code=True)

Nemotron Nano v2 Models#

from megatron.bridge import AutoBridge

# Example: Nemotron Nano 9B v2
bridge = AutoBridge.from_hf_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base", trust_remote_code=True)
provider = bridge.to_megatron_provider()

# Configure parallelism
provider.tensor_model_parallel_size = 2
provider.pipeline_model_parallel_size = 1
provider.context_parallel_size = 1
provider.sequence_parallel = True

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

# For instruct variant:
# bridge = AutoBridge.from_hf_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2", trust_remote_code=True)

# For 12B model:
# bridge = AutoBridge.from_hf_pretrained("nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base", trust_remote_code=True)

Export Megatron → HF#

# Convert from a Megatron checkpoint directory to HF format
bridge.export_ckpt(
    megatron_path="/results/nemotronh_8b/checkpoints/iter_0500000",
    hf_path="./nemotronh-8b-hf-export",
)

Examples#

Finetuning Recipes#

Nemotron H 4B Finetuning#

LoRA Finetuning#

from megatron.bridge.recipes.nemotronh import nemotronh_4b_finetune_config

cfg = nemotronh_4b_finetune_config(
    tokenizer_path="nvidia/Nemotron-H-4B-Base-8K",
    name="nemotronh_4b_lora",
    pretrained_checkpoint="path/to/nemotronh/4b/checkpoint",
    peft="lora",  # or "dora" for DoRA
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
)

Full Supervised Finetuning (SFT)#

cfg = nemotronh_4b_finetune_config(
    tokenizer_path="nvidia/Nemotron-H-4B-Base-8K",
    name="nemotronh_4b_sft",
    pretrained_checkpoint="path/to/nemotronh/4b/checkpoint",
    peft=None,  # Full supervised finetuning
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=5e-6,  # Lower LR for full SFT
)

Nemotron H 8B Finetuning#

from megatron.bridge.recipes.nemotronh import nemotronh_8b_finetune_config

# LoRA finetuning
cfg = nemotronh_8b_finetune_config(
    tokenizer_path="nvidia/Nemotron-H-8B-Base-8K",
    name="nemotronh_8b_lora",
    pretrained_checkpoint="path/to/nemotronh/8b/checkpoint",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
)

Nemotron H 47B Finetuning#

from megatron.bridge.recipes.nemotronh import nemotronh_47b_finetune_config

# LoRA finetuning (recommended for 47B)
cfg = nemotronh_47b_finetune_config(
    tokenizer_path="nvidia/Nemotron-H-47B-Base-8K",
    name="nemotronh_47b_lora",
    pretrained_checkpoint="path/to/nemotronh/47b/checkpoint",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
) 

Nemotron H 56B Finetuning#

from megatron.bridge.recipes.nemotronh import nemotronh_56b_finetune_config

# LoRA finetuning (recommended for 56B)
cfg = nemotronh_56b_finetune_config(
    tokenizer_path="nvidia/Nemotron-H-56B-Base-8K",
    name="nemotronh_56b_lora",
    pretrained_checkpoint="path/to/nemotronh/56b/checkpoint",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
)

Nemotron Nano 9B v2 Finetuning#

from megatron.bridge.recipes.nemotronh import nemotron_nano_9b_v2_finetune_config

# LoRA finetuning
cfg = nemotron_nano_9b_v2_finetune_config(
    tokenizer_path="nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base",
    name="nano_9b_v2_lora",
    pretrained_checkpoint="path/to/nano/9b/v2/checkpoint",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
    seq_length=2048,  # Can use up to 128K
    finetune_lr=1e-4,
)

Nemotron Nano 12B v2 Finetuning#

from megatron.bridge.recipes.nemotronh import nemotron_nano_12b_v2_finetune_config

# LoRA finetuning
cfg = nemotron_nano_12b_v2_finetune_config(
    tokenizer_path="nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base",
    name="nano_12b_v2_lora",
    pretrained_checkpoint="path/to/nano/12b/v2/checkpoint",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
    seq_length=2048,  # Can use up to 128K
    finetune_lr=1e-4,
)

Default Configurations#

Nemotron H Models#

4B - LoRA (1 node, 8 GPUs)#

  • TP=1, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: False

  • Precision: BF16 mixed

  • Optimized for single-GPU finetuning

4B - Full SFT (1 node, 8 GPUs)#

  • TP=1, PP=1, CP=1, LR=5e-6

  • Sequence Parallel: False

  • Precision: BF16 mixed

8B - LoRA (1 node, 8 GPUs)#

  • TP=1, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: False

  • Precision: BF16 mixed

8B - Full SFT (1 node, 8 GPUs)#

  • TP=2, PP=1, CP=1, LR=5e-6

  • Sequence Parallel: True

  • Precision: BF16 mixed

47B - LoRA (2+ nodes)#

  • TP=4, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: False

  • Precision: FP8 hybrid (recommended)

47B - Full SFT (4+ nodes)#

  • TP=8, PP=1, CP=1, LR=5e-6

  • Sequence Parallel: True

  • Precision: FP8 hybrid

56B - LoRA (2+ nodes)#

  • TP=4, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: False

  • Precision: FP8 hybrid (recommended)

56B - Full SFT (4+ nodes)#

  • TP=8, PP=1, CP=1, LR=5e-6

  • Sequence Parallel: True

  • Precision: FP8 hybrid

Nemotron Nano v2 Models#

9B - LoRA (1 node, 8 GPUs)#

  • TP=2, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: True

  • Precision: BF16 mixed

  • Context: Up to 128K tokens

9B - Full SFT (1 node, 8 GPUs)#

  • TP=2, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: True

  • Precision: BF16 mixed

12B - LoRA (2 nodes, 16 GPUs)#

  • TP=4, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: True

  • Precision: FP8 hybrid (recommended)

  • Context: Up to 128K tokens

12B - Full SFT (2 nodes, 16 GPUs)#

  • TP=4, PP=1, CP=1, LR=1e-4

  • Sequence Parallel: True

  • Precision: FP8 hybrid

API Reference#

Nemotron H#

Nemotron Nano v2#

Performance Optimizations#

Memory Efficiency#

  • Selective Recomputation: Reduces activation memory for larger models

  • Sequence Parallelism: Distributes sequence dimension across GPUs (enabled for 8B+)

  • Context Parallelism: Support for ultra-long sequences (Nano v2)

  • Manual GC: Aggressive garbage collection for stable memory usage

  • Precision-aware optimizer: BF16/FP8 gradients with FP32 master weights

Compute Efficiency#

  • Mamba-2 Optimizations: Efficient selective state space computations

  • Hybrid Architecture: Balanced mix of Mamba and Attention layers

  • Squared ReLU: Efficient activation function with good gradient properties

  • RoPE Fusion: Optional optimization for position embeddings

  • Multi-Query Attention: Reduced KV cache memory and compute

Hybrid Pattern Optimization#

The hybrid override pattern determines which layers use Mamba (M) vs Attention (*):

  • Mamba layers: Fast, memory-efficient, good for long-range dependencies

  • Attention layers: Better for complex reasoning and multi-token relationships

  • Optimal patterns: Pre-configured per model size based on extensive experimentation

Pipeline Parallelism Layouts#

Nemotron H models support several PP configurations with pre-defined layouts:

  • PP=1: No pipelining (default for most configurations)

  • PP=2: Supported with symmetric layer splits

  • PP=4: Supported for larger models (47B, 56B)

  • VP (Virtual Pipeline): Supported for reducing pipeline bubbles

Hugging Face Model Cards#

Nemotron H Models#

Nemotron Nano v2 Models#

Technical Resources#

Research Papers#