Caching Locally Built LLM NIM Engines#

About Buildable Profiles#

NVIDIA NIM microservices typically use model engines that are tuned for specific NVIDIA GPU configurations. NVIDIA produces model engines for several popular combinations, and these are referred to as model profiles.

Buildable profiles enable compilation on the fly and support broader compatibility with different GPU and model combinations by allowing you to build the profile on your cluster. They offer greater flexibility for deploying NIM microservices on a wider range of NVIDIA GPUs, especially when pre-optimized profiles for your specific GPU configuration are not available.

To view all of the profiles for a given NIM, run the list-model-profiles utility. It lists all of the model profiles that are compatible with the system and are runnable.

For example, run the following commands after you replace <HF/NFC_or_local_path> with your model path:

$ export NIM_MODEL_NAME=<HF/NFC_or_local_path>
$ docker run -it --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY -e HF_TOKEN=$HF_TOKEN -e NIM_MODEL_NAME=$NIM_MODEL_NAME $IMG_NAME list-model-profiles

Note

Model profiles must be compatible, runnable, and buildable to be built and cached locally. To identify buildable profiles, refer to step 2. Identify Which Profiles Are Buildable Locally that is in the example procedure below.

Benefits of Using NIM Build#

NIM Operator supports the NIM Build custom resource that is responsible for initiating engine build jobs for buildable profiles. It enables you to pre-build and cache optimized TensorRT-LLM model engines for GPUs in the cluster before starting a NIM deployment. This helps improve startup times and reduce resource usage during NIM deployments and autoscaling, making the deployments more predictable.

About the NIM Build Custom Resource#

A NIM Build is a Kubernetes custom resource, nimbuilds.apps.nvidia.com. You create and delete NIM Build resources to manage buildable NIM.

Note

There is a one to one mapping between a NIM Build CRD and a profile for building an engine.

Refer to the following table for information about the commonly modified fields:

Field	Description	Default Value
`spec.annotations`	Specifies to add the user-supplied annotations to the Engine Build.	None
`spec.env`	Specifies environment variables for Engine Build pod.	None
`spec.image`	Specifies repository, tag, pull policy, and pull secret for the container image.	None
`spec.labels`	Specifies to add the user-supplied labels to the Engine Build pod.	None
`spec.modelName`	Specifies the name given to the locally built engine.	If `modelName` is not specified, then the name of the NIMBuild CR is used to build the model.
`spec.nimCache.name` (required)	Specifies the name of the NIM Cache resource.	None
`spec.nimCache.profile`	Specifies the name of the buildable profile used to build engine.	If the NIM Cache has only one buildable profile, then NIM Build defaults to that profile name if none is specified.
`spec.nodeSelector`	Specifies node selector labels to schedule the Engine Build pod.	None
`spec.resources`	Resources is the resource requirements for the NIM Build pod.	None
`spec.tolerations`	Specifies the tolerations for the NIM Build pod.	None

Example Procedure#

Summary#

To cache and serve a locally built LLM NIM engine, follow these steps:

Create a NIM Cache that caches buildable model profiles.
Identify which profiles are buildable from NIM Cache.
Create a NIM Build resource to build an engine from your cached profile.
Create a NIM Service that uses the built engine.

Note

You can distinguish built engines from cached profiles by looking for custom: "true" in the profile metadata.

1. Create a NIM Cache to cache model profiles#

Create a NIM Cache to cache multiple buildable profiles.

Note

Refer to Prerequisites for more information on using NIM Cache.

Create a file, such as nimcache.yaml, with contents like the following example:

# NIM Cache Multiple Profiles
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-2-3b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-3b-instruct:1.8.5
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        buildable: true # Set to true to filter for and cache all buildable profiles
  storage:
    pvc:
      create: true
      storageClass: "local-path"
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Note

If only caching a single profile, you can set spec.source.ngc.model.profiles to the name of the single buildable profile instead of using the spec.source.ngc.model.buildable: true filter.

Create the NIM Cache:

$ kubectl create -f nimcache.yaml -n nim-service

Check the status of the NIM Cache:

Note

It can take several minutes to download the profiles into the cache, depending on their size.

$ kubectl get nimcache -A -o yaml

Note that the in-progress status type is NIM_CACHE_JOB_PENDING.

You can verify that the pod is running:

$ kubectl get pods -A | grep -i "instruct-job"

You can also verify in the pod log that the profile is being downloaded. Replace meta-llama3-2-3b-instruct-job-wz7kw with the name of your instruct job pod.

$ kubectl logs meta-llama3-2-3b-instruct-job-wz7kw -n nim-service

When the NIM Cache is ready, the completed status type is NIM_CACHE_JOB_COMPLETED, and the profile configuration details are listed in the cache.

$ kubectl get nimcache -A -o yaml

Note that the profile is buildable if trtllm_buildable: "true" in the cache status.

2. Identify Which Profiles Are Buildable Locally#

To retrieve the cached model names from NIM Cache, view the .status.profiles field of the NIM Cache to get the list of cached profile names for use in the NIMBuild CR in the next step. Replace meta-llama3-2-3b-instruct with the name of your NIM Cache.

$ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'

3. Create a NIM Build Resource to Build an Engine From Your Cached Profile#

Create a file, such as nimbuild.yaml, with contents like the following example:
```
# NIMBuild Select Profile
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMBuild
metadata:
  name: meta-llama3-2-3b-instruct
spec:
  nimCache:
    name : meta-llama3-2-3b-instruct
    profile: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2
  image:
    repository: "nvcr.io/nim/meta/llama-3.2-3b-instruct"
    tag: "1.8.5"
    pullSecrets:
     - ngc-secret-1
```
Select one of the cached profile names from the previous step and add it to the NIMBuild spec.nimCache.profile field to select the engine to build.
Note
- If the NIM Cache has only one buildable profile, then you do not need to specify a profile name because NIM Build defaults to using that profile.
- However, if the NIM Cache has multiple profiles, then you must explicitly specify a buildable profile name in the NIMBuild CRD.

Apply the manifest:

$ kubectl create -f nimbuild.yaml -n nim-service

Check the status of the NIM Build:

Note

It can take several minutes to build the profiles, depending on their size.

$ kubectl get nimbuild -A

You can verify that the pod is running:

$ kubectl get pods -A | grep -i "engine-build"

You can also verify in the pod log that the profile is being built. Replace meta-llama3-2-3b-instruct-engine-build-pod with the name of your engine build pod.

$ kubectl logs meta-llama3-2-3b-instruct-engine-build-pod -n nim-service

When the profile is done being built, the engine build pod is terminated:

$ kubectl get pods -A

You can verify the status of the build:

$ kubectl get nimbuild -A -o yaml

Note that the NIM Build object status type is then NIM_BUILD_MODEL_MANIFEST_POD_COMPLETED.

You can then get the details of the newly built profiles in the NIM Cache:

$ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'

4. Create a NIM Service that uses the built engine in NIM Cache#

Create a file, such as nimservice.yaml, with contents like the following example:

# NIMService for LLM-specific
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: llama-3.1-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:   #Include the model object to describe the LLM-specific model you want to pull from NGC
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    nimCache:
      name: llama-3.1-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Apply the manifest:

$ kubectl create -f nimservice.yaml -n nim-service

Wait for the NIM Service to be ready, view information about the service:

$ kubectl describe nimservices.apps.nvidia.com -n nim-service

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with caching locally built LLM NIM engines.

Ensure that the NIMCache referred in the NIMBuild is in ready state.
Ensure that the NIMCache referred in the NIMBuild has cached buildable profiles
Ensure that the model puller image is same on the NIMCache and NIMBuild
If the cluster contains heterogeneous GPU types, the spec.nodeSelector field in a NIMBuild must match that of the corresponding NIMService. This ensures both are scheduled onto nodes with the same GPU product type. For example:
```
spec:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
```

Logs and Messages to Gather#

Logs for the NIMBuild pod can be gathered by running the following command:
```
$ kubectl logs <NIM-Build-Name>-engine-build-pod -n nim-service
```
Additional information can be found by describing the NIM Build using the following command:
```
$ kubectl describe nimbuild <NIM-Build-Name> -n <Namespace>
```