Fine-tuned model support#
The nim-optimize
command enables using custom weights with a pre-defined optimized
profile for fine-tuned versions of LLM to be deployed in optimized
configurations. Note there may be a small performance degradation compared to
an optimized engine built for specific weights.
Usage#
The first step is to tune your model using the framework of you choice
and then convert the tuned model to a Hugging Face model.
The examples in this section use the fine-tuned model’s model name /llama3.1-8b-my-domain
.
You can deploy the non-optimized model as described in Serving models from local assets.
To use nim-optimize
to create an optimized engine for your fine-tuned model perform the following steps.
Launch the NIM container
export CUSTOM_WEIGHTS=/llama3.1-8b-my-domain export OPTIMIZED_ENGINE=/llama3.1-8b-my-domain-optimized export NIM_CACHE_PATH=/path/to/cache export NIM_IMAGE=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 export NGC_API_KEY=key docker run -it --rm --name=llama3-8b-instruct \ -e LOG_LEVEL=$LOG_LEVEL \ -e NGC_API_KEY \ -v $NIM_CACHE_PATH:/opt/nim/.cache \ -v $CUSTOM_WEIGHTS:/custom_weights \ -v $OPTIMIZED_ENGINE:/optimized_engine \ -u $(id -u) \ $NIM_IMAGE \ bash -i
From within the container run
nim-optimize
. The command chooses the most optimal profile, depending on the GPUs provided.nim-optimize \ --model_dir /custom_weights \ --output_engine /optimized_engine \ --builder_type llama
To choose a LoRA enabled profile to fine tune, add the --lora
argument, as shown in the following example.
nim-optimize \
--model_dir /custom_weights \
--lora \
--output_engine /optimized_engine \
--builder_type llama
You can also select the profile you want by using the list-model-profiles
command, which lists the profiles available within the container.
This command should produce output similar to the following.
SYSTEM INFO
- Free GPUs:
- [26b3:10de] (0) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
- [26b3:10de] (1) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
- b80e254301eff63d87b9aa13953485090e3154ca03d75ec8eff19b224918c2b5 (tensorrt_llm-h100-fp8-tp8-latency)
- 8860fe7519bece6fdcb642b907e07954a0b896dbb1b77e1248a873d8a1287971 (tensorrt_llm-h100-fp8-tp8-throughput)
- f8bf5df73b131c5a64c65a0671dab6cf987836eb58eb69f2a877c4a459fd2e34 (tensorrt_llm-a100-fp16-tp8-latency)
- b02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407 (tensorrt_llm-h100-fp16-tp8-latency)
- ab02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407 (tensorrt_llm-h100-fp16-tp8-latency-stripped)
- 80a4300d085f99ec20c1e75fd475f202ed8dc639a37f6b214a43c7640474380c (vllm-fp16-tp16)
- f8d065d29a1c73791ea3f06197fbea054c61c19cd3266935dba102c1eb62909c (vllm-fp16-tp8)
- c675aba61e4ca2d3e709a20ea291d241292c49e83e9bf537a24f02dd1ad9ce9e (vllm-fp16-tp16-lora)
- 80d48a8a3fad3d9fa9fcb0d71a7ce040d34e0f03ce019b27f620d65ddb1f7e7f (vllm-fp16-tp8-lora)
Select a profile with a stripped
suffix, which in this case is the fifth profile in the list. Then run the following command to manually select the stripped engine.
nim-optimize \
--model_dir /custom_weights \
--input_profile ab02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407\
--output_engine /optimized_engine \
--builder_type llama
Finally, run the optimized engine as before by setting the NIM_MODEL_NAME
environment variable to /llama3.1-8b-my-domain-optimized
, which is the value of the OPTIMIZED_ENGINE
option when you launched the container, and passing that environment variable to the docker run
command, as shown in the following example.
MODEL_REPO=/llama3.1-8b-my-domain-optimized
export NIM_IMAGE=nvcr.io/nim/meta/llama3-8b-instruct:latest
docker run -it --rm --name=llama3-8b-instruct \
-e LOG_LEVEL=$LOG_LEVEL \
-e NIM_MODEL_NAME=$MODEL_REPO \
-e NIM_SERVED_MODEL_NAME="llama3.1-8b-my-domain" \
-v $MODEL_REPO:$MODEL_REPO \
-u $(id -u) \
$NIM_IMAGE