Fine-Tune a Model#
You can easily deploy custom, fine-tuned models on NVIDIA NIM for VLMs. NIM automatically builds an optimized TensorRT-LLM locally-built engine given the weights in HuggingFace format.
Usage#
Launch the NIM container with your custom VLM model. With the command below, the checkpoint will be prepared with the necessary vision processor file during startup.
export CUSTOM_WEIGHTS=/path/to/customized/llama
docker run -it --rm --name=llama-3.1-nemotron-nano-vl-8b-v1 \
--gpus all \
--shm-size=16GB \
-e NIM_FT_MODEL=$CUSTOM_WEIGHTS \
-e NIM_SERVED_MODEL_NAME="llama-3.1-nemotron-nano-vl-8b-v1" \
-e NIM_CUSTOM_MODEL_NAME=custom_1 \ # set this to cache the model for faster subsequent runs
-v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
-u $(id -u) \
-p 8000:8000 \
$NIM_IMAGE \
bash -c "mkdir -p $CUSTOM_WEIGHTS/visual_engine && cp /opt/nim/llm/vision_processor.py $CUSTOM_WEIGHTS/visual_engine/vision_processor.py && bash /opt/nim/start_server.sh"
The NIM_FT_MODEL
environment variable must be set to a path to a directory
containing a HuggingFace checkpoint or a quantized checkpoint. The checkpoints
are served using the TRTLLM backend.
The HuggingFace checkpoint should have the following directory structure:
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── ...
├── model.safetensors.index.json
├── runtime_params.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json
A TRTLLM checkpoint should have the following directory structure:
/path/to/customized/llama/
├── config.json
├── generation_config.json
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── ....
└── trtllm_ckpt
├── config.json
└── rank0.safetensors
├── ...
Note
The TRTLLM checkpoint root directory should have HF tokenizer and
configuration files. The sub-folder named trtllm_ckpt
in this example
should have a TRTLLM checkpoint configuration file and weight tensors.
Preparing Checkpoints for FP8 Precision#
For improved performance, you can convert your checkpoints to FP8 precision. We use the TensorRT Model Optimizer toolkit in our container for that purpose. Please check the TensorRT Model Optimizer documentation for more details.
Note
Currently, only HuggingFace checkpoints are supported for FP8 conversion.
First, set up the dataset for calibration.
The following example uses the cnn_dailymail
dataset as recommended by the TensorRT Model Optimizer documentation.
NVIDIA models calibrated with this dataset show good accuracy.
Please verify that your model maintains the same accuracy after quantization.
docker run -it --rm \
-v $YOUR_CALIB_SET_PATH:$YOUR_CALIB_SET_PATH \
-u $(id -u) \
-e HF_HUB_OFFLINE=0 \
$NIM_IMAGE \
huggingface-cli download cnn_dailymail \
--repo-type dataset \
--local-dir $YOUR_CALIB_SET_PATH
Note
Make sure that if you’re using the cnn_dailymail dataset, your $YOUR_CALIB_SET_PATH
includes “cnn_dailymail” in the path (e.g., “/path/to/cnn_dailymail”) for the
quantize.py script to properly recognize the dataset structure.
Once the dataset is ready, convert the checkpoint using the NIM container:
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-v $YOUR_CALIB_SET_PATH:$YOUR_CALIB_SET_PATH \
-v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
-u $(id -u) \
-e HF_HUB_OFFLINE=0 \
$NIM_IMAGE \
python3 /app/tensorrt_llm/examples/quantization/quantize.py \
--model_dir $CUSTOM_WEIGHTS \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512 \
--output_dir $CUSTOM_WEIGHTS/trtllm_ckpt/ \
--device cuda \
--calib_dataset $YOUR_CALIB_SET_PATH
After conversion, the checkpoint is populated with a trtllm_ckpt
directory
containing the quantized checkpoint. You can then use this converted checkpoint as
described in the previous sections for non-converted checkpoints.