Mamba 2#
New in Mamba: Nemotron-H Models#
Introduction#
Nemotron-H is a family of hybrid Mamba-Transformer models designed to reduce inference cost while maintaining high accuracy. These models replace the majority of self-attention layers in the common Transformer architecture with Mamba layers that perform constant computation and require constant memory per generated token.
The Nemotron-H family includes models of 8B, 47B, and 56B parameters. According to the research, these models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3× faster at inference. The 47B model, created using a new compression technique called MiniPuzzle, achieves similar accuracy to the 56B model while being 20% faster to infer.
A notable feature of Nemotron-H is its FP8-based training recipe, which achieves on par results with BF16-based training. This recipe was used to train the 56B model.
Note
To use Nemotron-H models, you should use the container nemo:25.04.nemotron-h
. Support for Nemotron-H will be included in future official NeMo containers.
Nemotron-H Training Recipes#
We provide pre-defined recipes for both pre-training and fine-tuning Nemotron-H models in three sizes: 8B, 47B, and 56B. Here are examples for each:
Pre-Training Example:
from nemo.collections import llm
# For 8B model
pretrain = llm.nemotronh_8b.pretrain_recipe(
name="nemotronh_8b_pretraining",
dir="/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
tensor_parallelism=2,
gbs=768,
sequence_parallelism=True,
vocab_file="/path/to/vocab.json",
)
# For 47B model
# pretrain = llm.nemotronh_47b.pretrain_recipe(...)
# For 56B model
# pretrain = llm.nemotronh_56b.pretrain_recipe(...)
dataloader = a_function_that_configures_your_custom_dataset(
gbs=gbs,
mbs=mbs,
seq_length=pretrain.model.config.seq_length,
)
pretrain.data = dataloader
Fine-Tuning Example:
from nemo.collections import llm
# For 8B model
finetune = llm.nemotronh_8b.finetune_recipe(
resume_path="/path/to/nemo/checkpoint",
name="nemotronh_8b_finetuning",
dir="/path/to/checkpoints",
num_nodes=32,
num_gpus_per_node=8,
tensor_parallelism=2,
gbs=768,
sequence_parallelism=True,
vocab_file="/path/to/vocab.json",
)
# For 47B model
# finetune = llm.nemotronh_47b.finetune_recipe(...)
# For 56B model
# finetune = llm.nemotronh_56b.finetune_recipe(...)
dataloader = a_function_that_configures_your_custom_dataset(
gbs=gbs,
mbs=mbs,
seq_length=finetune.model.config.seq_length,
)
finetune.data = dataloader
Note
For pre-training and fine-tuning, the recipes use the MockDataModule
for the data
argument. You are expected to replace the MockDataModule
with your custom dataset.
Note
The configuration in the recipes is done using the NeMo-Run run.Config
and run.Partial
configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which runs the training locally in a separate process. You can use it as follows:
import nemo_run as run
# For pre-training
run.run(pretrain, executor=run.LocalExecutor())
# For fine-tuning
run.run(finetune, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
# For pre-training
run.run(pretrain, direct=True)
# For fine-tuning
run.run(finetune, direct=True)
Nemotron-H Inference#
To run inference with Nemotron-H models, you can use the following command:
torchrun --nproc-per-node=8 /opt/NeMo/scripts/llm/generate.py \
--model_path=<PATH_TO_NEMO2_MODEL> \
--tp=8 \
--devices=8 \
--num_tokens_to_generate=40 \
--temperature=0.001 \
--top_p=0.0 \
--top_k=1 \
--fp8
Note
Inference can be performed in either FP8 or BF16 precision. To use FP8, include the --fp8
flag as shown above. To use BF16, simply remove the --fp8
flag from the command.
Mamba2 and Mamba2-Hybrid Models#
Introduction#
State Space Models (SSMs) have recently emerged as a promising alternative to transformers. SSMs offer advantages such as linear time complexity relative to sequence length and a constant cache size for inference. These features enable the processing of longer sequences and higher throughput. Despite these benefits, SSMs alone may fall short compared to transformers on tasks that demand strong copying or in-context learning capabilities.
To harness the strengths of both approaches, SSM-Hybrid models incorporate MLP, Transformer, and SSM blocks in their architecture. As highlighted in a study by NVIDIA, these hybrid models outperform traditional transformers of the same size by achieving faster inference times due to the inclusion of SSM blocks. Based on experimental results, Mamba2-Hybrid models not only surpass transformer baselines in performance, but also benefit from increased computational efficiency.
Available Models#
The Mamba2 models discussed in the Transformers are SSMs paper are available in five different sizes: 130 million, 370 million, 780 million, 1.3 billion, and 2.7 billion parameters. The Mamba2-Hybrid models, along with their Mamba2 baseline as released by NVIDIA, are provided in an 8 billion parameter size.
A comprehensive list of pre-training and fine-tuning recipes that we currently support or plan to support soon is provided below for reference:
Recipe |
Status |
---|---|
Mamba2 130M |
Yes |
Mamba2 370M |
Yes |
Mamba2 780M |
Yes |
Mamba2 1.3B |
Yes |
Mamba2 2.7B |
Yes |
Mamba2 8B |
Yes |
Mamba2 Hybrid-8B |
Yes |
Mamba2 Training Recipes#
We provide pre-defined recipes for pre-training and fine-tuning Mamba2 and Hybrid models in the following sizes: 130M, 370M, 780M, 1.3B, 2.7B, 8B, and Hybrid-8B using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial
for one of the nemo.collections.llm
api functions introduced in NeMo 2.0. The recipes are hosted in recipes folder (for example mamba_130m.py
).
Pre-Training Example:
from nemo.collections import llm
pretrain = llm.mamba2_130m.pretrain_recipe(
tokenizer_model="/path/to/tokenizer/model",
name="mamba2_130m_pretraining",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
)
dataloader = a_function_that_configures_your_custom_dataset(
gbs=gbs,
mbs=mbs,
seq_length=pretrain.model.config.seq_length,
)
pretrain.data = dataloader
Fine-Tuning Example:
from nemo.collections import llm
finetune = llm.mamba2_130m.finetune_recipe(
resume_path="/path/to/nemo/checkpoint",
tokenizer_model="/path/to/tokenizer/model",
name="mamba2_130m_finetuning",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
)
dataloader = a_function_that_configures_your_custom_dataset(
gbs=gbs,
mbs=mbs,
seq_length=finetune.model.config.seq_length,
)
finetune.data = dataloader
Note
For pre-training, the recipes use the MockDataModule
for the data
argument. You are expected to replace the MockDataModule
with your custom dataset.
For fine-tuning, the recipes use the SquadDataModule
(designed for SQUAD dataset) for the data
argument. You are expected to replace the SquadDataModule
with your custom dataset.
Note
For Mamba2 and Hybrid models, you should provide a path to the tokenizer model (defaults to None
) if the tokenizer is not
available on Hugging Face model card. This is the case for 8B and Hybrid 8B models (for other variants set to None
). Tokenizer model is located here.
For fine-tuning, you should provide your NeMo checkpoint to resume_path
for all models.
Note
The configuration in the recipes is done using the NeMo-Run run.Config
and run.Partial
configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which runs the training locally in a separate process. You can use it as follows:
import nemo_run as run
# For pre-training
run.run(pretrain, executor=run.LocalExecutor())
# For fine-tuning
run.run(finetune, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
# For pre-training
run.run(pretrain, direct=True)
# For fine-tuning
run.run(finetune, direct=True)