Mamba 2#

New in Mamba: Nemotron-H Models#

Introduction#

Nemotron-H is a family of hybrid Mamba-Transformer models designed to reduce inference cost while maintaining high accuracy. These models replace the majority of self-attention layers in the common Transformer architecture with Mamba layers that perform constant computation and require constant memory per generated token.

The Nemotron-H family includes models of 8B, 47B, and 56B parameters. According to the research, these models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3× faster at inference. The 47B model, created using a new compression technique called MiniPuzzle, achieves similar accuracy to the 56B model while being 20% faster to infer.

A notable feature of Nemotron-H is its FP8-based training recipe, which achieves on par results with BF16-based training. This recipe was used to train the 56B model.

Note

To use Nemotron-H models, you should use the container nemo:25.04.nemotron-h. Support for Nemotron-H will be included in future official NeMo containers.

Nemotron-H Training Recipes#

We provide pre-defined recipes for both pre-training and fine-tuning Nemotron-H models in three sizes: 8B, 47B, and 56B. Here are examples for each:

Pre-Training Example:

from nemo.collections import llm

# For 8B model
pretrain = llm.nemotronh_8b.pretrain_recipe(
    name="nemotronh_8b_pretraining",
    dir="/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    tensor_parallelism=2,
    gbs=768,
    sequence_parallelism=True,
    vocab_file="/path/to/vocab.json",
)

# For 47B model
# pretrain = llm.nemotronh_47b.pretrain_recipe(...)

# For 56B model
# pretrain = llm.nemotronh_56b.pretrain_recipe(...)

dataloader = a_function_that_configures_your_custom_dataset(
    gbs=gbs,
    mbs=mbs,
    seq_length=pretrain.model.config.seq_length,
)
pretrain.data = dataloader

Fine-Tuning Example:

from nemo.collections import llm

# For 8B model
finetune = llm.nemotronh_8b.finetune_recipe(
    resume_path="/path/to/nemo/checkpoint",
    name="nemotronh_8b_finetuning",
    dir="/path/to/checkpoints",
    num_nodes=32,
    num_gpus_per_node=8,
    tensor_parallelism=2,
    gbs=768,
    sequence_parallelism=True,
    vocab_file="/path/to/vocab.json",
)

# For 47B model
# finetune = llm.nemotronh_47b.finetune_recipe(...)

# For 56B model
# finetune = llm.nemotronh_56b.finetune_recipe(...)

dataloader = a_function_that_configures_your_custom_dataset(
    gbs=gbs,
    mbs=mbs,
    seq_length=finetune.model.config.seq_length,
)
finetune.data = dataloader

Note

For pre-training and fine-tuning, the recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your custom dataset.

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which runs the training locally in a separate process. You can use it as follows:

import nemo_run as run

# For pre-training
run.run(pretrain, executor=run.LocalExecutor())

# For fine-tuning
run.run(finetune, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

# For pre-training
run.run(pretrain, direct=True)

# For fine-tuning
run.run(finetune, direct=True)

Nemotron-H Inference#

To run inference with Nemotron-H models, you can use the following command:

torchrun --nproc-per-node=8 /opt/NeMo/scripts/llm/generate.py \
    --model_path=<PATH_TO_NEMO2_MODEL> \
    --tp=8 \
    --devices=8 \
    --num_tokens_to_generate=40 \
    --temperature=0.001 \
    --top_p=0.0 \
    --top_k=1 \
    --fp8

Note

Inference can be performed in either FP8 or BF16 precision. To use FP8, include the --fp8 flag as shown above. To use BF16, simply remove the --fp8 flag from the command.

Mamba2 and Mamba2-Hybrid Models#

Introduction#

State Space Models (SSMs) have recently emerged as a promising alternative to transformers. SSMs offer advantages such as linear time complexity relative to sequence length and a constant cache size for inference. These features enable the processing of longer sequences and higher throughput. Despite these benefits, SSMs alone may fall short compared to transformers on tasks that demand strong copying or in-context learning capabilities.

To harness the strengths of both approaches, SSM-Hybrid models incorporate MLP, Transformer, and SSM blocks in their architecture. As highlighted in a study by NVIDIA, these hybrid models outperform traditional transformers of the same size by achieving faster inference times due to the inclusion of SSM blocks. Based on experimental results, Mamba2-Hybrid models not only surpass transformer baselines in performance, but also benefit from increased computational efficiency.

Available Models#

The Mamba2 models discussed in the Transformers are SSMs paper are available in five different sizes: 130 million, 370 million, 780 million, 1.3 billion, and 2.7 billion parameters. The Mamba2-Hybrid models, along with their Mamba2 baseline as released by NVIDIA, are provided in an 8 billion parameter size.

A comprehensive list of pre-training and fine-tuning recipes that we currently support or plan to support soon is provided below for reference:

Recipe	Status
Mamba2 130M	Yes
Mamba2 370M	Yes
Mamba2 780M	Yes
Mamba2 1.3B	Yes
Mamba2 2.7B	Yes
Mamba2 8B	Yes
Mamba2 Hybrid-8B	Yes

Mamba2 Training Recipes#

We provide pre-defined recipes for pre-training and fine-tuning Mamba2 and Hybrid models in the following sizes: 130M, 370M, 780M, 1.3B, 2.7B, 8B, and Hybrid-8B using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. The recipes are hosted in recipes folder (for example mamba_130m.py).

Pre-Training Example:

from nemo.collections import llm

pretrain = llm.mamba2_130m.pretrain_recipe(
    tokenizer_model="/path/to/tokenizer/model",
    name="mamba2_130m_pretraining",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
)

dataloader = a_function_that_configures_your_custom_dataset(
    gbs=gbs,
    mbs=mbs,
    seq_length=pretrain.model.config.seq_length,
)
pretrain.data = dataloader

Fine-Tuning Example:

from nemo.collections import llm

finetune = llm.mamba2_130m.finetune_recipe(
    resume_path="/path/to/nemo/checkpoint",
    tokenizer_model="/path/to/tokenizer/model",
    name="mamba2_130m_finetuning",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
)

dataloader = a_function_that_configures_your_custom_dataset(
    gbs=gbs,
    mbs=mbs,
    seq_length=finetune.model.config.seq_length,
)
finetune.data = dataloader

Note

For pre-training, the recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your custom dataset.

For fine-tuning, the recipes use the SquadDataModule (designed for SQUAD dataset) for the data argument. You are expected to replace the SquadDataModule with your custom dataset.

Note

For Mamba2 and Hybrid models, you should provide a path to the tokenizer model (defaults to None) if the tokenizer is not available on Hugging Face model card. This is the case for 8B and Hybrid 8B models (for other variants set to None). Tokenizer model is located here.

For fine-tuning, you should provide your NeMo checkpoint to resume_path for all models.

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which runs the training locally in a separate process. You can use it as follows:

import nemo_run as run

# For pre-training
run.run(pretrain, executor=run.LocalExecutor())

# For fine-tuning
run.run(finetune, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

# For pre-training
run.run(pretrain, direct=True)

# For fine-tuning
run.run(finetune, direct=True)