You can deploy custom, fine-tuned models on NIM. NIM automatically builds an optimized TensorRT-LLM (TRT-LLM) locally-built engine given weights in the Hugging Face format.

The default PyTorch TRT-LLM backend does not support fine-tuned models. If you need to deploy a fine-tuned model, enable the legacy TRT-LLM backend .

Note These instructions apply to LLM-specific NIMs. For multi-LLM NIM deployments, you can deploy a fine-tuned model in any supported model format by setting NIM_MODEL_NAME to the local folder path. For detailed instructions, refer to Launch NVIDIA NIM for LLMs (Option 3).

You can deploy the non-optimized model as described in serving models from local assets.

Launch the NIM container:

export CUSTOM_WEIGHTS = /path/to/customized/llama docker run -it --rm --name = llama3-8b-instruct \ --gpus all \ -e NIM_FT_MODEL = $CUSTOM_WEIGHTS \ -e NIM_SERVED_MODEL_NAME = "llama3.1-8b-my-domain" \ -e NIM_CUSTOM_MODEL_NAME = custom_1 \ # cache the model for faster subsequent runs -e NIM_USE_TRTLLM_LEGACY_BACKEND = 1 \ -e NIM_DISABLE_TRTLLM_PYTORCH_RT = 1 \ -v $CUSTOM_WEIGHTS : $CUSTOM_WEIGHTS \ -u $( id -u ) \ $NIM_IMAGE

NIM_FT_MODEL must be a path to a directory containing a Hugging Face checkpoint or a TRT-LLM checkpoint. TRT-LLM checkpoints are served with the TRT-LLM backend.

Hugging Face checkpoints should have the following directory structure:

├── config.json ├── generation_config.json ├── model-00001-of-00004.safetensors ├── model-00002-of-00004.safetensors ├── model-00003-of-00004.safetensors ├── model-00004-of-00004.safetensors ├── ... ├── model.safetensors.index.json ├── runtime_params.json ├── special_tokens_map.json ├── tokenizer.json └── tokenizer_config.json

TRT-LLM checkpoints should have the following directory structure:

/path/to/customized/llama/ ├── config.json ├── generation_config.json ├── model.safetensors.index.json ├── special_tokens_map.json ├── tokenizer.json ├── tokenizer_config.json ├── .... └── trtllm_ckpt ├── config.json └── rank0.safetensors ├── ...

Note The TRT-LLM checkpoint root directory should include Hugging Face tokenizer and configuration files. The trtllm_ckpt subfolder should include the TRTLLM checkpoint configuration file and weight tensors.

Note The backend_optimization/ namespace can be used to store optimized runtime configuration files (for example, runtime_params.json ) for backends that do not rely on pre-built engines. This avoids name conflicts with engine-based layouts and enables optimized runtime behavior across backends. Environment variables take precedence over keys in runtime_params.json when both are set.

Optimal Build Configuration# To build an optimal TRT-LLM engine, you can provide a complete engine configuration from a previously built engine. NIM uses build options from this configuration file instead of the model and runtime defaults. The configuration file should include the following fields: build_config

pretrained_config

version You can structure the NIM_FT_MODEL path to specify an optimal engine configuration file. The order of precedence is: trtllm_engine/config.json

trtllm_ckpt/config.json Note Typically, TRT-LLM checkpoints constructed using TRT-LLM conversion scripts result in a partial engine configuration with only the pretrained configuration options. You can provide a partial engine configuration or a complete engine configuration. If you provide a partial engine configuration, NIM uses model and runtime defaults for the build options.