Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Checkpoint Conversion

Obtain the Checkpoints from Hugging Face

To obtain the checkpoint you want from Hugging Face, go to:

Repository for the Mamba2 and Mamba2-Hybrid models by NVIDIA. The checkpoint from this repository is located in the files tab under release/mp_rank_00/model_optim_rng.pt. The tokenizer is under the files tab and is named mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model. You need both of these for conversion to .nemo checkpoint.

Repository for the Mamba2 models from the Transformers are SSMs paper.

For checkpoints from this repository, run the following Python script to convert the pytorch checkpoint (pytorch_model.bin in the Hugging Face model card) to a format similar to the 8b models:

import torch
import os

ckpt_path = "/path/to/pytorch_model.bin"
pyt_checkpoint = torch.load(ckpt_path)
new_ckpt_path = os.path.join(os.path.dirname(ckpt_path), f"wrapped_{os.path.basename(ckpt_path)}")

# Save the new checkpoint which will be used as the input to the conversion script
torch.save({"model": pyt_checkpoint}, new_ckpt_path)

You will use this wrapped_pytorch_model.bin for the conversion to .nemo in the next step.

Convert the PyTorch Checkpoint to a NeMo Checkpoint

Authenticate with NVIDIA NGC, generate API KEY from NGC, add the key to your credentials following instructions in this guide, and get into NVIDIA NeMo dev container nvcr.io/nvidia/nemo:dev.
Get into the NVIDIA dev container from NGC, or the 24.07 container (once released).
Run the conversion script located at. For this script, you need to provide the PyTorch state dictionary of the model as the input_name_or_path argument. Note that this argument only accepts a single state_dict.

CUDA_VISIBLE_DEVICES="0" python /opt/NeMo/scripts/checkpoint_converters/convert_mamba2_pyt_to_nemo.py \
                                --input_name_or_path <path to the source pytorch model> \
                                --output_path <path to target .nemo model> \
                                --mamba_ssm_ngroups 8 \
                                --precision bf16 \
                                --tokenizer_model_dir=<path to tokenizer.model> # Remove this line (or set it to None) for 130m, 370m, 780m, 1.3b, and 2.7b models.

Note

The mamba_ssm_ngroups parameter should be set to 1 for the Mamba2 models from the Transformers are SSMs paper (130m, 370m, 780m, 1.3b, and 2.7b) and to 8 for the Mamba2 and Mamba2-Hybrid models by NVIDIA (both 8b).

Run Tensor Parallelism (TP) for 8b Models

Note

Distributed checkpointing for the Mamba2 and Mamba2-Hybrid models will be implemented soon. In the meantime, use the method below to convert to Tensor Parallel (TP) of different sizes.

The Hugging Face checkpoint for the 8b model is configured for a TP size 1, as is the .nemo checkpoint obtained in the previous step. To share the model weights for a larger TP size, use this script located at the NeMo Repository.

python /opt/NeMo/examples/nlp/language_modeling/mamba_change_num_partition.py \
       --model_file=<path to source .nemo model> \
       --target_file=<path to target .nemo model> \
       --tensor_model_parallel_size=1 \
       --target_tensor_model_parallel_size=4 \
       --precision=bf16 \
       --tokenizer_path=<path to tokenizer.model>

After running this script, a .nemo model and the corresponding number of TP-size folders (4 in this example) will be generated in the target path. The folders for each rank will be displayed as mp_rank_00 to mp_rank_03 in this example.

Note

You can only use Tensor Parallelism for the 8b models by NVIDIA (Mamba2 8b and Mamba2-Hybrid 8b). This is because the mamba_ssm_ngroups parameter in the model architecture should be divisible by the TP size. The mamba_ssm_ngroups parameter is 8 for NVIDIA models and 1 for other models in the list.