Large Language Models (Latest)
Large Language Models (Latest)

Fine-tuned model support

You can easily deploy custom, fine-tuned models on NIM. NIM automatically builds an optimized TensorRT-LLM locally-built engine given the weights in the HuggingFace or NeMo formats.

You can deploy the non-optimized model as described in Serving models from local assets.

  1. Launch the NIM container

Copy
Copied!
            

export CUSTOM_WEIGHTS=/path/to/customized/llama docker run -it --rm --name=llama3-8b-instruct \ --gpus all \ -e NIM_FT_MODEL=$CUSTOM_WEIGHTS \ -e NIM_SERVED_MODEL_NAME="llama3.1-8b-my-domain" \ -e NIM_CUSTOM_MODEL_NAME=custom_1 \# set this to cache the model for faster subsequent runs -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \ -u $(id -u) \ $NIM_IMAGE

You can also select an alternative profile by using the output of the list-model-profiles command, which lists the profiles available within the container.

This command should produce output similar to the following.

Copy
Copied!
            

SYSTEM INFO - Free GPUs: - [26b3:10de] (0) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%] - [26b3:10de] (1) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%] MODEL PROFILES - Compatible with system and runnable: - 771c17ba45c566b400c5823af6188d479e3703e5b25f56260713afcc377bcfa5 (custom_1) - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) - With LoRA support: - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora) - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora) - Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines: - 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd (tensorrt_llm-trtllm_buildable-bf16-tp2) - 54946b08b79ecf9e7f2d5c000234bf2cce19c8fee21b243c1a084b03897e8c95 (tensorrt_llm-trtllm_buildable-bf16-tp1) - With LoRA support: - 7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724 (tensorrt_llm-trtllm_buildable-bf16-tp2-lora) - 00172c81416075181f203532da34b88e371b8081d2ad801d9d30110ea88cbf95 (tensorrt_llm-trtllm_buildable-bf16-tp1-lora) - Incompatible with system: - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency) - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)

Select a compatible tensorrt_llm or a tensorrt_llm-trtllm_buildable profile. Then run the previous command with the additional option e NIM_MODEL_PROFILE=<profile_name>

Setting NIM_CUSTOM_MODEL_NAME will cache the locally-built engine (with the same name) before serving it. If cached by setting NIM_CUSTOM_MODEL_NAME environment variable, the cached engine profile will take precedence over all other profiles, like - 771c17ba45c566b400c5823af6188d479e3703e5b25f56260713afcc377bcfa5 (custom_1) as above. If you have multiple locally cached locally-built engines, then other underlying logic is used to find the best profile. To force LLM NIM to serve a specific cached engine, set -e NIM_MODEL_PROFILE=<custom_model_name> to serve that specific cached engine like any other profile.

Previous Utilities
Next Observability
© Copyright © 2024, NVIDIA Corporation. Last updated on Dec 11, 2024.