Nemotron 3 Nano#

Nemotron 3 Nano is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model employs a hybrid Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 and MoE layers, along with 6 Attention layers. Each MoE layer includes 128 experts plus 1 shared expert, with 5 experts activated per token. The model has 3.5B active parameters and 30B parameters in total.

NeMo Megatron Bridge supports pretraining, full parameters finetuning, and LoRA finetuning this model. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.

Important

Run all commands from /opt/Megatron-Bridge (e.g. docker run -w /opt/Megatron-Bridge ...)

Tip

We use the following environment variables throughout this page

HF_MODEL_ID=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
MEGATRON_MODEL_PATH=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (feel free to set your own path)

Conversion with 🤗 Hugging Face#

Import HF → Megatron#

To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.

uv run python examples/conversion/convert_checkpoints.py import  \
--hf-model $HF_MODEL_ID  \
--megatron-path /path/to/output/megatron/ckpt \
--trust-remote-code

Export Megatron → HF#

uv run python examples/conversion/convert_checkpoints.py export  \
--hf-model $HF_MODEL_ID  \
--megatron-path /path/to/trained/megatron/ckpt \
--hf-path /path/to/output/hf/ckpt

Pretraining Examples#

BLEND_PATH=/path/to/dataset/blend
TOKENIZER_MODEL=/path/to/tiktok/tokenizer/model

uv run python -m torch.distributed.run --nproc-per-node=8 examples/models/nemotron/nemotron_3/nano/pretrain_nemotron_3_nano.py \
--per-split-data-args-path=${BLEND_PATH} \
--tokenizer-model=${TOKENIZER_MODEL} \
train.global_batch_size=3072 \
train.train_iters=39500 \
scheduler.lr_warmup_iters=350

Notes:

The default parallelism settings are TP=4, EP=8, PP=1, CP=1. It is recommended to run this pretraining on 4 H100 nodes (32 GPUs).
To enable wandb logging, you can append logger.wandb_project=PROJECT_NAME, wandb_entity=ENTITY_NAME, and wandb_exp_name=EXP_NAME arguments
If BLEND_PATH and TOKENIZER_MODEL are not specified, mock dataset will be used.

Finetuning Recipes#

Full Parameter Fine-Tuning#

uv run python -m torch.distributed.run --nproc-per-node=8 examples/models/nemotron/nemotron_3/nano/finetune_nemotron_3_nano.py \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
checkpoint.pretrained_checkpoint=/path/to/output/megatron/ckpt

Notes:

Default parallelism TP=1, EP=8, PP=1, CP=1. Running this recipe requires at least 2 H100 nodes (16 GPUs).
By default, the SQuAD dataset is used. To use customerized dataset, see this tutorial
Fine-tuning requires a pretrained megatron checkpoint, which can be obtained in “Import HF → Megatron” section above

LoRA Fine-Tuning#

To enable LoRA fine-tuning, pass --peft lora to script

uv run python -m torch.distributed.run --nproc-per-node=8 examples/models/nemotron/nemotron_3/nano/finetune_nemotron_3_nano.py \
--peft lora \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
checkpoint.pretrained_checkpoint=/path/to/output/megatron/ckpt

Notes:

By default, the target modules are linear layers ["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2", "in_proj", "out_proj"] in the model
The rest of settings are the same as full parameter fine-tuning above.

A LoRA checkpoint only contains the learnable adapter weights. In order to convert the LoRA checkpoint to Hugging Face format for downstream evaluation, it is necessary to merge the LoRA adapters back to the base model.

uv run python examples/peft/merge_lora.py \
--hf-model-path $HF_MODEL_ID \
--lora-checkpoint /path/to/lora/ckpt/iter_xxxxxxx 
--output /path/to/merged/hf/ckpt