Caching Locally Built LLM NIM Engines#

About Buildable Profiles#

NVIDIA NIM microservices typically use model engines that are tuned for specific NVIDIA GPU configurations. NVIDIA produces model engines for several popular combinations, and these are referred to as model profiles.

Buildable profiles enable compilation on the fly and support broader compatibility with different GPU and model combinations by allowing you to build the profile on your cluster. They offer greater flexibility for deploying NIM microservices on a wider range of NVIDIA GPUs, especially when pre-optimized profiles for your specific GPU configuration are not available.

To view all of the profiles for a given NIM, run the list-model-profiles utility. It lists all of the model profiles that are compatible with the system and are runnable.

For example, run the following commands after you replace <HF/NFC_or_local_path> with your model path:

$ export NIM_MODEL_NAME=<HF/NFC_or_local_path>
$ docker run -it --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY -e HF_TOKEN=$HF_TOKEN -e NIM_MODEL_NAME=$NIM_MODEL_NAME $IMG_NAME list-model-profiles
Example output
# Example of list-model-profiles utility output
MODEL PROFILES
- Compatible with system and runnable:
  - e2f00b2cbfb168f907c8d6d4d40406f7261111fbab8b3417a485dcd19d10cc98 (vllm)
  - 668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c (tensorrt_llm)
  - 50e138f94d85b97117e484660d13b6b54234e60c20584b1de6ed55d109ca4f21 (sglang)
  - With LoRA support:
    - 93c5e281d6616f45e2ef801abf4ed82fc65e38ec5f46e0664f340bad4f92d551 (vllm-lora)
    - cdcd22d151713c8b91fcd279a4b5e021153e72ff5cf6ad5498aac96974f5b7d7 (tensorrt_llm-lora)
- Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines: <None>

Note

Model profiles must be compatible, runnable, and buildable to be built and cached locally. To identify buildable profiles, refer to step 2. Identify Which Profiles Are Buildable Locally that is in the example procedure below.

Benefits of Using NIM Build#

NIM Operator supports the NIM Build custom resource that is responsible for initiating engine build jobs for buildable profiles. It enables you to pre-build and cache optimized TensorRT-LLM model engines for GPUs in the cluster before starting a NIM deployment. This helps improve startup times and reduce resource usage during NIM deployments and autoscaling, making the deployments more predictable.

About the NIM Build Custom Resource#

A NIM Build is a Kubernetes custom resource, nimbuilds.apps.nvidia.com. You create and delete NIM Build resources to manage buildable NIM.

Note

There is a one to one mapping between a NIM Build CRD and a profile for building an engine.

Refer to the following table for information about the commonly modified fields:

Field

Description

Default Value

spec.annotations

Specifies to add the user-supplied annotations to the Engine Build.

None

spec.env

Specifies environment variables for Engine Build pod.

None

spec.image

Specifies repository, tag, pull policy, and pull secret for the container image.

None

spec.labels

Specifies to add the user-supplied labels to the Engine Build pod.

None

spec.modelName

Specifies the name given to the locally built engine.

If modelName is not specified, then the name of the NIMBuild CR is used to build the model.

spec.nimCache.name (required)

Specifies the name of the NIM Cache resource.

None

spec.nimCache.profile

Specifies the name of the buildable profile used to build engine.

If the NIM Cache has only one buildable profile, then NIM Build defaults to that profile name if none is specified.

spec.nodeSelector

Specifies node selector labels to schedule the Engine Build pod.

None

spec.resources

Resources is the resource requirements for the NIM Build pod.

None

spec.tolerations

Specifies the tolerations for the NIM Build pod.

None

Example Procedure#

Summary#

To cache and serve a locally built LLM NIM engine, follow these steps:

  1. Create a NIM Cache that caches buildable model profiles.

  2. Identify which profiles are buildable from NIM Cache.

  3. Create a NIM Build resource to build an engine from your cached profile.

  4. Create a NIM Service that uses the built engine.

Note

You can distinguish built engines from cached profiles by looking for custom: "true" in the profile metadata.

1. Create a NIM Cache to cache model profiles#

Create a NIM Cache to cache multiple buildable profiles.

Note

Refer to Prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following example:

    # NIM Cache Multiple Profiles
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-2-3b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama-3.2-3b-instruct:1.8.5
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            buildable: true # Set to true to filter for and cache all buildable profiles
      storage:
        pvc:
          create: true
          storageClass: "local-path"
          size: "50Gi"
          volumeAccessMode: ReadWriteOnce
    

    Note

    If only caching a single profile, you can set spec.source.ngc.model.profiles to the name of the single buildable profile instead of using the spec.source.ngc.model.buildable: true filter.

  2. Create the NIM Cache:

    $ kubectl create -f nimcache.yaml -n nim-service
    
    Example output
    nimcache.apps.nvidia.com/meta-llama3-2-3b-instruct created
    
  3. Check the status of the NIM Cache:

    Note

    It can take several minutes to download the profiles into the cache, depending on their size.

    $ kubectl get nimcache -A -o yaml
    
    Example output
    # Example of NIM Cache status output (partial)
    ...
    status:
      conditions:
      ...
      - lastTransitionTime: "2025-07-07T00:00:00Z"
        message: The Job to cache NIM is in progress
        reason: JobRunning
        status: "False"
        type: NIM_CACHE_JOB_PENDING
      state: InProgress
      ...
    

    Note that the in-progress status type is NIM_CACHE_JOB_PENDING.

    You can verify that the pod is running:

    $ kubectl get pods -A | grep -i "instruct-job"
    
    Example output
    # Example of get pods output (partial)
    NAMESPACE     NAME                                  READY   STATUS   
    ...
    nim-service   meta-llama3-2-3b-instruct-job-wz7kw   1/1     Running
    

    You can also verify in the pod log that the profile is being downloaded. Replace meta-llama3-2-3b-instruct-job-wz7kw with the name of your instruct job pod.

    $ kubectl logs meta-llama3-2-3b-instruct-job-wz7kw -n nim-service
    
    Example output
    # Example of pod log output (partial)
    INFO] Fetching contents for profile ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2
    ...
    

    When the NIM Cache is ready, the completed status type is NIM_CACHE_JOB_COMPLETED, and the profile configuration details are listed in the cache.

    $ kubectl get nimcache -A -o yaml
    
    Example output
    # Example of NIM_CACHE_JOB_COMPLETED output (partial)
    ...
    status:
      conditions:
      ...
      - lastTransitionTime: "2025-07-07T00:01:00Z"
        message: The Job to cache NIM has successfully completed
        reason: JobCompleted
        status: "True"
        type: NIM_CACHE_JOB_COMPLETED
      profiles:
      - config:
          feat_lora: "false"
          llm_engine: tensorrt_llm
          pp: "1"
          precision: bf16
          tp: "1"
          trtllm_buildable: "true"  # This indicates the profile is buildable
        name: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2
      state: Ready
      ...
    

    Note that the profile is buildable if trtllm_buildable: "true" in the cache status.

2. Identify Which Profiles Are Buildable Locally#

To retrieve the cached model names from NIM Cache, view the .status.profiles field of the NIM Cache to get the list of cached profile names for use in the NIMBuild CR in the next step. Replace meta-llama3-2-3b-instruct with the name of your NIM Cache.

$ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'
Example output
# Example of buildable profiles output
[
  {
    "config": {
      "feat_lora": "true",
      "feat_lora_max_rank": "32",
      "llm_engine": "tensorrt_llm",
      "pp": "1",
      "precision": "bf16",
      "tp": "1",
      "trtllm_buildable": "true"
    },
    "name": "7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724"
  },
  {
    "config": {
      "feat_lora": "false",
      "llm_engine": "tensorrt_llm",
      "pp": "1",
      "precision": "bf16",
      "tp": "1",
      "trtllm_buildable": "true"
    },
    "name": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2"
  }
]

3. Create a NIM Build Resource to Build an Engine From Your Cached Profile#

  1. Create a file, such as nimbuild.yaml, with contents like the following example:

    # NIMBuild Select Profile
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMBuild
    metadata:
      name: meta-llama3-2-3b-instruct
    spec:
      nimCache:
        name : meta-llama3-2-3b-instruct
        profile: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2
      image:
        repository: "nvcr.io/nim/meta/llama-3.2-3b-instruct"
        tag: "1.8.5"
        pullSecrets:
         - ngc-secret-1
    

    Select one of the cached profile names from the previous step and add it to the NIMBuild spec.nimCache.profile field to select the engine to build.

    Note

    • If the NIM Cache has only one buildable profile, then you do not need to specify a profile name because NIM Build defaults to using that profile.

    • However, if the NIM Cache has multiple profiles, then you must explicitly specify a buildable profile name in the NIMBuild CRD.

  2. Apply the manifest:

    $ kubectl create -f nimbuild.yaml -n nim-service
    
    Example output
    nimbuild.apps.nvidia.com/meta-llama3-2-3b-instruct created
    
  3. Check the status of the NIM Build:

    Note

    It can take several minutes to build the profiles, depending on their size.

    $ kubectl get nimbuild -A
    
    Example output
    NAMESPACE     NAME                        STATUS    AGE
    nim-service   meta-llama3-2-3b-instruct  Pending   7s
    

    You can verify that the pod is running:

    $ kubectl get pods -A | grep -i "engine-build"
    
    Example output
    # Example of get pods output (partial)
    NAMESPACE     NAME                                          READY   STATUS   
    ...
    nim-service   meta-llama3-2-3b-instruct-engine-build-pod   0/1     Running
    

    You can also verify in the pod log that the profile is being built. Replace meta-llama3-2-3b-instruct-engine-build-pod with the name of your engine build pod.

    $ kubectl logs meta-llama3-2-3b-instruct-engine-build-pod -n nim-service
    
    Example output
    # Example of pod log output (partial)
    ...
    INFO] Selected profile: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 (tensorrt_llm_buildable-bf16-tpl-ppl)
    INFO] Profile metadata: fea_lora: false
    INFO] Profile metadata: llm_engine: tensorrt_llm
    INFO] Profile metadata: pp: 1
    INFO] Profile metadata: precision: bf16
    INFO] Profile metadata: tp: 1
    INFO] Profile metadata: trtllm_buildable: true
    ...
    

    When the profile is done being built, the engine build pod is terminated:

    $ kubectl get pods -A
    
    Example output
    # Example of get pods (partial)
    NAMESPACE     NAME                                          READY   STATUS   
    ...
    nim-service   meta-llama3-2-3b-instruct-engine-build-pod    1/1     Terminating
    

    You can verify the status of the build:

    $ kubectl get nimbuild -A -o yaml
    
    Example output
    # Example of completed build output
    ...
        - lastTransitionTime: "2025-07-23T05:58:40Z"
          message: The Pod to read local model manifest is completed
          reason: PodCompleted
          status: "True"
          type: NIM_BUILD_MODEL_MANIFEST_POD_COMPLETED
        inputProfile:
          config:
            feat_lora: "false"
            llm_engine: tensorrt_llm
            pp: "1"
            precision: bf16
            tp: "1"
            trtllm_buildable: "true"
          name: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2
        outputProfile:
          config:
            custom: "true"
            feat_lora: "false"
            llm_engine: tensorrt_llm
            model_name: meta-llama3-2-3b-instruct
            pp: "1"
            precision: bf16
            tp: "1"
          name: 1595f282ae143f034a28997246a212c32e077e2c4ca144c27fc1ff834fd9b56c
        state: Ready
    

    Note that the NIM Build object status type is then NIM_BUILD_MODEL_MANIFEST_POD_COMPLETED.

    You can then get the details of the newly built profiles in the NIM Cache:

    $ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'
    
    Example output
    # Example of newly built profile details output
        ...
        [
          {
            "config": {
              "feat_lora": "true",
              "feat_lora_max_rank": "32",
              "llm_engine": "tensorrt_llm",
              "pp": "1",
              "precision": "bf16",
              "tp": "1",
              "trtllm_buildable": "true"
            },
            "name": "7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724"
          },
          {
            "config": {
              "feat_lora": "false",
              "llm_engine": "tensorrt_llm",
              "pp": "1",
              "precision": "bf16",
              "tp": "1",
              "trtllm_buildable": "true"
            },
            "name": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2" # input profile name
          },
          {
            "config": {
              "custom": "true",
              "feat_lora": "false",
              "input_profile": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2", # input profile name
              "llm_engine": "tensorrt_llm",
              "model_name": "meta-llama3-2-3b-instruct",
              "pp": "1",
              "precision": "bf16",
              "tp": "1"
            },
            "name": "1595f282ae143f034a28997246a212c32e077e2c4ca144c27fc1ff834fd9b56c"
          }
        ]
    

4. Create a NIM Service that uses the built engine in NIM Cache#

  1. Create a file, such as nimservice.yaml, with contents like the following example:

    # NIMService for LLM-specific
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: llama-3.1-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:   #Include the model object to describe the LLM-specific model you want to pull from NGC
            engine: tensorrt_llm
            tensorParallelism: "1"
      storage:
        nimCache:
          name: llama-3.1-8b-instruct
          profile: ''
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
    
  2. Apply the manifest:

    $ kubectl create -f nimservice.yaml -n nim-service
    
  3. Wait for the NIM Service to be ready, view information about the service:

    $ kubectl describe nimservices.apps.nvidia.com -n nim-service
    
    Example output
    # Example of NIM Service status output (partial)
    ...
    Conditions:
      Last Transition Time:  2024-08-12T19:09:43Z
      Message:               Deployment is ready
      Reason:                Ready
      Status:                True
      Type:                  Ready
    State:                  Ready
    

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with caching locally built LLM NIM engines.

  • Ensure that the NIMCache referred in the NIMBuild is in ready state.

  • Ensure that the NIMCache referred in the NIMBuild has cached buildable profiles

  • Ensure that the model puller image is same on the NIMCache and NIMBuild

  • If the cluster contains heterogeneous GPU types, the spec.nodeSelector field in a NIMBuild must match that of the corresponding NIMService. This ensures both are scheduled onto nodes with the same GPU product type. For example:

    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
    

Logs and Messages to Gather#

  • Logs for the NIMBuild pod can be gathered by running the following command:

    $ kubectl logs <NIM-Build-Name>-engine-build-pod -n nim-service
    
  • Additional information can be found by describing the NIM Build using the following command:

    $ kubectl describe nimbuild <NIM-Build-Name> -n <Namespace>