Nemotron Nano V2 VL#
NVIDIA Nemotron Nano v2 VL is an open 12B multimodal reasoning model for document intelligence and video understanding. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and videos. This makes the model valuable for agents focused on data analysis, document processing and visual understanding in applications like generating reports, curating videos, and dense captioning for media asset management, and retrieval-augmented search.
NeMo Megatron Bridge supports finetuning this model (including LoRA finetuning) on single-image, multi-image, and video datasets. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.
Important
Please use the custom container nvcr.io/nvidia/nemo:25.09.nemotron_nano_v2_vl when working with this model.
Run all commands from /opt/Megatron-Bridge (e.g. docker run -w /opt/Megatron-Bridge ...)
Tip
We use the following environment variables throughout this page
HF_MODEL_PATH=nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16MEGATRON_MODEL_PATH=/models/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16(feel free to set your own path)
Unless explicitly stated, any megatron model path in the commands below should NOT contain the iteration number
iter_xxxxxx. For more details on checkpointing, please see
here
Conversion with 🤗 Hugging Face#
Import HF → Megatron#
To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.
python examples/conversion/convert_checkpoints.py import \
--hf-model $HF_MODEL_PATH \
--megatron-path $MEGATRON_MODEL_PATH \
--trust-remote-code
Export Megatron → HF#
You can export a trained model with the following command.
python examples/conversion/convert_checkpoints.py export \
--hf-model $HF_MODEL_PATH \
--megatron-path <trained megatron model path> \
--hf-path <output hf model path> \
--not-strict
Note: it is normal to see a warning that vision_model.radio_model.input_conditioner.norm_mean and vision_model.radio_model.input_conditioner.norm_std from source are not in the exported checkpoint. These two weights are not needed in the checkpoint.
Run In-Framework Inference on Converted Checkpoint#
You can run a quick sanity check on the converted checkpoint with the following command.
python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path $HF_MODEL_PATH \
--megatron_model_path $MEGATRON_MODEL_PATH \
--image_path <example image path> \
--prompt "Describe this image." \
--max_new_tokens 100 \
--use_llava_model
Note:
--megatron_model_pathis optional. If not specified, the script will convert the model and then run forward. If specified, the script will just load the megatron model--max_new_tokenscontrols the number of tokens to generate.For inference with multiple images, pass in a comma-separated list, e.g.
--image_path="/path/to/example1.jpeg,/path/to/example2.jpeg". Use a suitable prompt, e.g.--prompt="Describe the two images in detail.".For inference with video, pass in video path instead, e.g.
--video_path="/path/to/demo.mp4". Use a suitable prompt, e.g.--prompt="Describe what you see.".
Finetuning Recipes#
Before training, ensure the following environment variables are set.
SAVE_DIR: to specify a checkpoint and log saving directory, used in the commands below.HF_TOKEN: to download models from HF Hub.HF_HOME: (optional) to avoid re-downloading models and datasets every time.WANDB_API_KEY: (optional) to enable WandB logging.
Full Finetuning#
Example usage for full parameter finetuning using the Raven dataset:
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint <megatron model path> \
dataset.maker_name make_raven_dataset \
logger.wandb_project <optional wandb project name> \
logger.wandb_save_dir $SAVE_DIR \
checkpoint.save $SAVE_DIR/<experiment name>
Note:
The config file
examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_override_example.yamlcontains a list of arguments that can be overridden in the command. For example, you can settrain.global_batch_size=<batch size>in the command.To change the dataset, you only need to change
dataset.maker_name. See the dataset section below for details.After training, you can run inference with
hf_to_megatron_generate_vlm.pyby supplying the trained megatron checkpoint. You can also export the trained checkpoint to Hugging Face format.
Parameter-Efficient Finetuning (PEFT)#
Parameter-efficient finetuning (PEFT) using LoRA is supported. LoRA can be independently applied to the vision model, vision projection, and language model. We support two commonly used settings out of the box in the example script:
Apply LoRA to the language model, and fully finetune the vision model and projection (used when the visual distribution is substantially different from pretrained.)
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--lora-on-language-model \
dataset.maker_name make_raven_dataset \
logger.wandb_project <optional wandb project name> \
logger.wandb_save_dir $SAVE_DIR \
checkpoint.save $SAVE_DIR/<experiment name> \
model.freeze_language_model True \
model.freeze_vision_model False \
model.freeze_vision_projection False
Apply LoRA to all linear layers in attention and MLP modules of the vision model, vision projection, and the language model.
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--lora-on-language-model \
—-lora-on-vision-model \
dataset.maker_name make_raven_dataset \
logger.wandb_project <optional wandb project name> \
logger.wandb_save_dir $SAVE_DIR \
checkpoint.save $SAVE_DIR/<experiment name> \
model.freeze_language_model True \
model.freeze_vision_model True \
model.freeze_vision_projection True
A LoRA checkpoint only contains the learnable adapter weights. In order to convert the LoRA checkpoint to Hugging Face format for downstream evaluation, it is necessary to merge the LoRA adapters back to the base model.
python examples/peft/merge_lora.py \
--hf-model-path $HF_MODEL_PATH \
--lora-checkpoint <trained LoRA checkpoint>/iter_N \
--output <LoRA checkpoint merged>
You can now run in-framework inference with hf_to_megatron_generate_vlm.py by supplying the merged LoRA checkpoint.
You can also export the merged LoRA checkpoint to Hugging Face format.
Example Datasets#
Megatron Bridge supports various vision-language dataset examples which can be used to finetune Nemotron Nano V2 VL:
Dataset |
Maker Name |
Description |
|---|---|---|
|
OCR receipts: Single-image-text dataset for receipt understanding, outputs xml-like annotated text. |
|
|
Medical VQA: Single-image question-answer dataset covering clinical medical images and free-form answers. |
|
|
Visual reasoning: Multi-image, vision reasoning dataset for analogical reasoning in different visual layouts. |
|
|
Video understanding: video question-answer dataset covering every-day scenarios. |
Note on video training example:
We provide a video config yaml file instead of the default config yaml file that overwrites a few commands. Please pass in
--config-file "examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml".The LLaVA video dataset requires manual download beforehand. Please place the downloaded and extracted video files in a folder
VIDEO_ROOTand pass it in to the maker withdataset.maker_kwargs={"video_root_path":$VIDEO_ROOT}. In the nextqa subset example,VIDEO_ROOTshould look like$VIDEO_ROOT/ ├── NextQA/ │ └── NExTVideo/ │ └── 0000/ │ └── 2440175990.mp4 │ └── 0001/ │ └── ... └── ...
Full video training example command:
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--config-file "examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml" \
logger.wandb_project <optional wandb project name> \
logger.wandb_save_dir $SAVE_DIR \
checkpoint.save $SAVE_DIR/<experiment name> \
dataset.maker_kwargs {"video_root_path":$VIDEO_ROOT}