Fine-Tuned Model Support in NVIDIA NIM for LLMs#
You can deploy custom, fine-tuned models on NIM. NIM automatically builds an optimized TensorRT-LLM locally-built engine given weights in the Hugging Face format.
Note
This document explains how to deploy fine-tuned models in LLM-specific NIMs. For the multi-LLM compatible NIM container, you can deploy a fine-tuned model (in any supported model format) by setting NIM_MODEL_NAME
to the local folder path. For detailed instructions, refer to Launch NVIDIA NIM for LLMs (Option 3).
Usage#
You can deploy the non-optimized model as described in Serving models from local assets.
Launch the NIM container
export CUSTOM_WEIGHTS=/path/to/customized/llama
docker run -it --rm --name=llama3-8b-instruct \
--gpus all \
-e NIM_FT_MODEL=$CUSTOM_WEIGHTS \
-e NIM_SERVED_MODEL_NAME="llama3.1-8b-my-domain" \
-e NIM_CUSTOM_MODEL_NAME=custom_1 \ # cache the model for faster subsequent runs
-v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \
-u $(id -u) \
$NIM_IMAGE
NIM_FT_MODEL
must be a path to a directory containing a Hugging Face checkpoint or a TRTLLM checkpoint. TRTLLM checkpoints are served with the TRTLLM backend.
A Hugging Face checkpoint should have the following directory structure:
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── ...
├── model.safetensors.index.json
├── runtime_params.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json
TRTLLM checkpoints should have the following directory structure:
/path/to/customized/llama/
├── config.json
├── generation_config.json
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── ....
└── trtllm_ckpt
├── config.json
└── rank0.safetensors
├── ...
Note: The TRTLLM checkpoint root directory should include Hugging Face tokenizer and configuration files. The trtllm_ckpt
subfolder should include the TRTLLM checkpoint configuration file and weight tensors.
Optimal Build Configuration#
To build an optimal TRTLLM engine, you can provide a complete engine configuration from a previously built engine. NIM uses build options from this configuration file instead of the model and runtime defaults. The configuration file should include the following fields:
build_config
pretrained_config
version
You can structure the NIM_FT_MODEL
path to specify an optimal engine configuration file. The order of precedence is:
trtllm_engine/config.json
trtllm_ckpt/config.json
Note: Typically, TRTLLM checkpoints constructed using TRTLLM conversion scripts result in a partial engine configuration with only the pretrained configuration options. You can provide a partial engine configuration or a complete engine configuration. If you provide a partial engine configuration, NIM uses model and runtime defaults for the build options.
Optimal Runtime Configuration#
You can optionally provide a runtime_params.json
with the following TRTLLM runtime configuration keys for a Hugging Face checkpoint or a TRTLLM checkpoint. You can structure the NIM_FT_MODEL
path to specify an optimal runtime configuration file. The order of precedence is:
runtime_params.json
trtllm_ckpt/runtime_params.json
trtllm_engine/runtime_params.json
The following runtime overrides are supported. If not specified, NIM picks appropriate defaults.
medusa_choices: List[List[int]] = None
You can also select an alternative profile using the output of the list-model-profiles
command, which lists the profiles available within the container.
This command produces output similar to the following.
SYSTEM INFO
- Free GPUs:
- [26b3:10de] (0) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
- [26b3:10de] (1) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable:
- 771c17ba45c566b400c5823af6188d479e3703e5b25f56260713afcc377bcfa5 (custom_1)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
- With LoRA support:
- c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
- 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines:
- 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd (tensorrt_llm-trtllm_buildable-bf16-tp2)
- 54946b08b79ecf9e7f2d5c000234bf2cce19c8fee21b243c1a084b03897e8c95 (tensorrt_llm-trtllm_buildable-bf16-tp1)
- With LoRA support:
- 7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724 (tensorrt_llm-trtllm_buildable-bf16-tp2-lora)
- 00172c81416075181f203532da34b88e371b8081d2ad801d9d30110ea88cbf95 (tensorrt_llm-trtllm_buildable-bf16-tp1-lora)
- Incompatible with system:
- dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
- f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
- 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
- 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
Select a compatible tensorrt_llm
or tensorrt_llm-trtllm_buildable
profile. Then run the previous command with the additional option -e NIM_MODEL_PROFILE=profile_name
, where profile_name
is the name of a profile.
Setting NIM_CUSTOM_MODEL_NAME
caches the locally built engine (with the same name) before serving it.
If cached by setting the NIM_CUSTOM_MODEL_NAME
environment variable, the cached engine profile takes precedence over all other profiles (for example, custom_1
above).
If you have multiple locally-cached, locally-built engines, other underlying logic is used to find the best profile.
To force the LLM NIM to serve a specific cached engine, set -e NIM_MODEL_PROFILE=custom_model_name
, where custom_model_name
is the name of a custom model. This serves that specific cached engine like any other profile.