Fine-tuned model support#

The nim-optimize command enables using custom weights with a pre-defined optimized profile for fine-tuned versions of LLM to be deployed in optimized configurations. Note there may be a small performance degradation compared to an optimized engine built for specific weights.

Usage#

The first step is to tune your model using the framework of you choice and then convert the tuned model to a Hugging Face model. The examples in this section use the fine-tuned model’s model name /llama3.1-8b-my-domain.

You can deploy the non-optimized model as described in Serving models from local assets.

To use nim-optimize to create an optimized engine for your fine-tuned model perform the following steps.

  1. Launch the NIM container

    export CUSTOM_WEIGHTS=/llama3.1-8b-my-domain
    export OPTIMIZED_ENGINE=/llama3.1-8b-my-domain-optimized
    export NIM_CACHE_PATH=/path/to/cache
    export NIM_IMAGE=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
    export NGC_API_KEY=key
    
    docker run -it --rm --name=llama3-8b-instruct \
       -e LOG_LEVEL=$LOG_LEVEL \
       -e NGC_API_KEY \
       -v $NIM_CACHE_PATH:/opt/nim/.cache \
       -v $CUSTOM_WEIGHTS:/custom_weights \
       -v $OPTIMIZED_ENGINE:/optimized_engine \
       -u $(id -u) \
       $NIM_IMAGE \
       bash -i
    
  2. From within the container run nim-optimize. The command chooses the most optimal profile, depending on the GPUs provided.

    nim-optimize \
       --model_dir /custom_weights \
       --output_engine /optimized_engine \
       --builder_type llama
    

To choose a LoRA enabled profile to fine tune, add the --lora argument, as shown in the following example.

nim-optimize \
  --model_dir /custom_weights \
  --lora \
  --output_engine /optimized_engine \
  --builder_type llama

You can also select the profile you want by using the list-model-profiles command, which lists the profiles available within the container.

This command should produce output similar to the following.

SYSTEM INFO
- Free GPUs:
  -  [26b3:10de] (0) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
  -  [26b3:10de] (1) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
  - b80e254301eff63d87b9aa13953485090e3154ca03d75ec8eff19b224918c2b5 (tensorrt_llm-h100-fp8-tp8-latency)
  - 8860fe7519bece6fdcb642b907e07954a0b896dbb1b77e1248a873d8a1287971 (tensorrt_llm-h100-fp8-tp8-throughput)
  - f8bf5df73b131c5a64c65a0671dab6cf987836eb58eb69f2a877c4a459fd2e34 (tensorrt_llm-a100-fp16-tp8-latency)
  - b02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407 (tensorrt_llm-h100-fp16-tp8-latency)
  - ab02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407 (tensorrt_llm-h100-fp16-tp8-latency-stripped)
  - 80a4300d085f99ec20c1e75fd475f202ed8dc639a37f6b214a43c7640474380c (vllm-fp16-tp16)
  - f8d065d29a1c73791ea3f06197fbea054c61c19cd3266935dba102c1eb62909c (vllm-fp16-tp8)
  - c675aba61e4ca2d3e709a20ea291d241292c49e83e9bf537a24f02dd1ad9ce9e (vllm-fp16-tp16-lora)
  - 80d48a8a3fad3d9fa9fcb0d71a7ce040d34e0f03ce019b27f620d65ddb1f7e7f (vllm-fp16-tp8-lora)

Select a profile with a stripped suffix, which in this case is the fifth profile in the list. Then run the following command to manually select the stripped engine.

nim-optimize \
  --model_dir /custom_weights \
  --input_profile ab02b0fe7ec18cb1af9a80b46650cf6e3195b2efa4c07a521e9a90053c4292407\
  --output_engine /optimized_engine \
  --builder_type llama

Finally, run the optimized engine as before by setting the NIM_MODEL_NAME environment variable to /llama3.1-8b-my-domain-optimized, which is the value of the OPTIMIZED_ENGINE option when you launched the container, and passing that environment variable to the docker run command, as shown in the following example.

MODEL_REPO=/llama3.1-8b-my-domain-optimized
export NIM_IMAGE=nvcr.io/nim/meta/llama3-8b-instruct:latest
docker run -it --rm --name=llama3-8b-instruct \
    -e LOG_LEVEL=$LOG_LEVEL \
    -e NIM_MODEL_NAME=$MODEL_REPO \
    -e NIM_SERVED_MODEL_NAME="llama3.1-8b-my-domain" \
    -v $MODEL_REPO:$MODEL_REPO \
    -u $(id -u) \
    $NIM_IMAGE