Nemotron 3 Nano#
Nemotron 3 Nano is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model employs a hybrid Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 and MoE layers, along with 6 Attention layers. Each MoE layer includes 128 experts plus 1 shared expert, with 5 experts activated per token. The model has 3.5B active parameters and 30B parameters in total.
NeMo Megatron Bridge supports pretraining, full parameters finetuning, and LoRA finetuning this model. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.
Important
Please use the custom container nvcr.io/nvidia/nemo:25.11.nemotron_3_nano when working with this model.
Run all commands from /opt/Megatron-Bridge (e.g. docker run -w /opt/Megatron-Bridge ...)
Tip
We use the following environment variables throughout this page
HF_MODEL_ID=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16MEGATRON_MODEL_PATH=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16(feel free to set your own path)
Conversion with 🤗 Hugging Face#
Import HF → Megatron#
To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.
python examples/conversion/convert_checkpoints.py import \
--hf-model $HF_MODEL_ID \
--megatron-path /path/to/output/megatron/ckpt \
--trust-remote-code
Export Megatron → HF#
python examples/conversion/convert_checkpoints.py export \
--hf-model $HF_MODEL_ID \
--megatron-path /path/to/trained/megatron/ckpt \
--hf-path /path/to/output/hf/ckpt
Pretraining Examples#
BLEND_PATH=/path/to/dataset/blend
TOKENIZER_MODEL=/path/to/tiktok/tokenizer/model
torchrun --nproc-per-node=8 examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
--per-split-data-args-path=${BLEND_PATH} \
--tokenizer-model=${TOKENIZER_MODEL} \
train.global_batch_size=3072 \
train.train_iters=39500 \
scheduler.lr_warmup_iters=350
Notes:
The default parallelism settings are TP=4, EP=8, PP=1, CP=1. It is recommended to run this pretraining on 4 H100 nodes (32 GPUs).
To enable wandb logging, you can append
logger.wandb_project=PROJECT_NAME,wandb_entity=ENTITY_NAME, andwandb_exp_name=EXP_NAMEargumentsIf
BLEND_PATHandTOKENIZER_MODELare not specified, mock dataset will be used.
Finetuning Recipes#
Full Parameter Fine-Tuning#
torchrun --nproc-per-node=8 examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
checkpoint.pretrained_checkpoint=/path/to/output/megatron/ckpt
Notes:
Default parallelism TP=1, EP=8, PP=1, CP=1. It is recommended to run this recipe on at least 2 H100 nodes (16 GPUs).
By default, the SQuAD dataset is used. To use customerized dataset, see this tutorial
Fine-tuning requires a pretrained megatron checkpoint, which can be obtained in “Import HF → Megatron” section above
LoRA Fine-Tuning#
To enable LoRA fine-tuning, pass --peft lora to script
torchrun --nproc-per-node=8 examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
--peft lora \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
checkpoint.pretrained_checkpoint=/path/to/output/megatron/ckpt
Notes:
By default, the target modules are linear layers
["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2", "in_proj", "out_proj"]in the modelThe rest of settings are the same as full parameter fine-tuning above.
A LoRA checkpoint only contains the learnable adapter weights. In order to convert the LoRA checkpoint to Hugging Face format for downstream evaluation, it is necessary to merge the LoRA adapters back to the base model.
python examples/peft/merge_lora.py \
--hf-model-path $HF_MODEL_ID \
--lora-checkpoint /path/to/lora/ckpt/iter_xxxxxxx
--output /path/to/merged/ckpt