Fine-Tune a Model#

You can easily deploy custom, fine-tuned models on NVIDIA NIM for VLMs. NIM automatically builds an optimized TensorRT-LLM locally-built engine given the weights in HuggingFace format.

Usage#

Launch the NIM container with your custom VLM model. With the command below, the checkpoint will be prepared with the necessary vision processor file during startup.

export CUSTOM_WEIGHTS=/path/to/customized/llama
docker run -it --rm --name=llama-3.1-nemotron-nano-vl-8b-v1 \
    --gpus all \
    --shm-size=16GB \
    -e NIM_FT_MODEL=$CUSTOM_WEIGHTS \
    -e NIM_SERVED_MODEL_NAME="llama-3.1-nemotron-nano-vl-8b-v1" \
    -e NIM_CUSTOM_MODEL_NAME=custom_1 \ # set this to cache the model for faster subsequent runs
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
    -u $(id -u) \
    -p 8000:8000 \
    $NIM_IMAGE \
    bash -c "mkdir -p $CUSTOM_WEIGHTS/visual_engine && cp /opt/nim/llm/vision_processor.py $CUSTOM_WEIGHTS/visual_engine/vision_processor.py && bash /opt/nim/start_server.sh"

The NIM_FT_MODEL environment variable must be set to a path to a directory containing a HuggingFace checkpoint or a quantized checkpoint. The checkpoints are served using the TRTLLM backend.

The HuggingFace checkpoint should have the following directory structure:

├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── ...
├── model.safetensors.index.json
├── runtime_params.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

A TRTLLM checkpoint should have the following directory structure:

/path/to/customized/llama/
    ├── config.json
    ├── generation_config.json
    ├── model.safetensors.index.json
    ├── special_tokens_map.json
    ├── tokenizer.json
    ├── tokenizer_config.json
    ├── ....
    └── trtllm_ckpt
        ├── config.json
        └── rank0.safetensors
        ├── ...

Note

The TRTLLM checkpoint root directory should have HF tokenizer and configuration files. The sub-folder named trtllm_ckpt in this example should have a TRTLLM checkpoint configuration file and weight tensors.

Preparing Checkpoints for FP8 Precision#

For improved performance, you can convert your checkpoints to FP8 precision. We use the TensorRT Model Optimizer toolkit in our container for that purpose. Please check the TensorRT Model Optimizer documentation for more details.

Note

Currently, only HuggingFace checkpoints are supported for FP8 conversion.

First, set up the dataset for calibration. The following example uses the cnn_dailymail dataset as recommended by the TensorRT Model Optimizer documentation. NVIDIA models calibrated with this dataset show good accuracy. Please verify that your model maintains the same accuracy after quantization.

docker run -it --rm \
    -v $YOUR_CALIB_SET_PATH:$YOUR_CALIB_SET_PATH \
    -u $(id -u) \
    -e HF_HUB_OFFLINE=0 \
    $NIM_IMAGE \
    huggingface-cli download cnn_dailymail \
        --repo-type dataset \
        --local-dir $YOUR_CALIB_SET_PATH

Note

Make sure that if you’re using the cnn_dailymail dataset, your $YOUR_CALIB_SET_PATH includes “cnn_dailymail” in the path (e.g., “/path/to/cnn_dailymail”) for the quantize.py script to properly recognize the dataset structure.

Once the dataset is ready, convert the checkpoint using the NIM container:

docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -v $YOUR_CALIB_SET_PATH:$YOUR_CALIB_SET_PATH \
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
    -u $(id -u) \
    -e HF_HUB_OFFLINE=0 \
    $NIM_IMAGE \
    python3 /app/tensorrt_llm/examples/quantization/quantize.py \
        --model_dir $CUSTOM_WEIGHTS \
        --dtype bfloat16 \
        --qformat fp8 \
        --kv_cache_dtype fp8 \
        --calib_size 512 \
        --output_dir $CUSTOM_WEIGHTS/trtllm_ckpt/ \
        --device cuda \
        --calib_dataset $YOUR_CALIB_SET_PATH

After conversion, the checkpoint is populated with a trtllm_ckpt directory containing the quantized checkpoint. You can then use this converted checkpoint as described in the previous sections for non-converted checkpoints.