Caching Locally Built LLM NIM Engines#
About Buildable Profiles#
NVIDIA NIM microservices typically use model engines that are tuned for specific NVIDIA GPU configurations. NVIDIA produces model engines for several popular combinations, and these are referred to as model profiles.
Buildable profiles enable compilation on the fly and support broader compatibility with different GPU and model combinations by allowing you to build the profile on your cluster. They offer greater flexibility for deploying NIM microservices on a wider range of NVIDIA GPUs, especially when pre-optimized profiles for your specific GPU configuration are not available.
To view all of the profiles for a given NIM, run the list-model-profiles utility. It lists all of the model profiles that are compatible with the system and are runnable.
For example, run the following commands after you replace <HF/NFC_or_local_path>
with your model path:
$ export NIM_MODEL_NAME=<HF/NFC_or_local_path>
$ docker run -it --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY -e HF_TOKEN=$HF_TOKEN -e NIM_MODEL_NAME=$NIM_MODEL_NAME $IMG_NAME list-model-profiles
Example output
# Example of list-model-profiles utility output
MODEL PROFILES
- Compatible with system and runnable:
- e2f00b2cbfb168f907c8d6d4d40406f7261111fbab8b3417a485dcd19d10cc98 (vllm)
- 668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c (tensorrt_llm)
- 50e138f94d85b97117e484660d13b6b54234e60c20584b1de6ed55d109ca4f21 (sglang)
- With LoRA support:
- 93c5e281d6616f45e2ef801abf4ed82fc65e38ec5f46e0664f340bad4f92d551 (vllm-lora)
- cdcd22d151713c8b91fcd279a4b5e021153e72ff5cf6ad5498aac96974f5b7d7 (tensorrt_llm-lora)
- Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines: <None>
Note
Model profiles must be compatible, runnable, and buildable to be built and cached locally. To identify buildable profiles, refer to step 2. Identify Which Profiles Are Buildable Locally that is in the example procedure below.
Benefits of Using NIM Build#
NIM Operator supports the NIM Build custom resource that is responsible for initiating engine build jobs for buildable profiles. It enables you to pre-build and cache optimized TensorRT-LLM model engines for GPUs in the cluster before starting a NIM deployment. This helps improve startup times and reduce resource usage during NIM deployments and autoscaling, making the deployments more predictable.
About the NIM Build Custom Resource#
A NIM Build is a Kubernetes custom resource, nimbuilds.apps.nvidia.com
.
You create and delete NIM Build resources to manage buildable NIM.
Note
There is a one to one mapping between a NIM Build CRD and a profile for building an engine.
Refer to the following table for information about the commonly modified fields:
Field |
Description |
Default Value |
---|---|---|
|
Specifies to add the user-supplied annotations to the Engine Build. |
None |
|
Specifies environment variables for Engine Build pod. |
None |
|
Specifies repository, tag, pull policy, and pull secret for the container image. |
None |
|
Specifies to add the user-supplied labels to the Engine Build pod. |
None |
|
Specifies the name given to the locally built engine. |
If |
|
Specifies the name of the NIM Cache resource. |
None |
|
Specifies the name of the buildable profile used to build engine. |
If the NIM Cache has only one buildable profile, then NIM Build defaults to that profile name if none is specified. |
|
Specifies node selector labels to schedule the Engine Build pod. |
None |
|
Resources is the resource requirements for the NIM Build pod. |
None |
|
Specifies the tolerations for the NIM Build pod. |
None |
Example Procedure#
Summary#
To cache and serve a locally built LLM NIM engine, follow these steps:
Create a NIM Cache that caches buildable model profiles.
Identify which profiles are buildable from NIM Cache.
Create a NIM Build resource to build an engine from your cached profile.
Create a NIM Service that uses the built engine.
Note
You can distinguish built engines from cached profiles by looking for custom: "true"
in the profile metadata.
1. Create a NIM Cache to cache model profiles#
Create a NIM Cache to cache multiple buildable profiles.
Note
Refer to Prerequisites for more information on using NIM Cache.
Create a file, such as
nimcache.yaml
, with contents like the following example:# NIM Cache Multiple Profiles apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-2-3b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama-3.2-3b-instruct:1.8.5 pullSecret: ngc-secret authSecret: ngc-api-secret model: buildable: true # Set to true to filter for and cache all buildable profiles storage: pvc: create: true storageClass: "local-path" size: "50Gi" volumeAccessMode: ReadWriteOnce
Note
If only caching a single profile, you can set
spec.source.ngc.model.profiles
to the name of the single buildable profile instead of using thespec.source.ngc.model.buildable: true
filter.Create the NIM Cache:
$ kubectl create -f nimcache.yaml -n nim-service
Example output
nimcache.apps.nvidia.com/meta-llama3-2-3b-instruct created
Check the status of the NIM Cache:
Note
It can take several minutes to download the profiles into the cache, depending on their size.
$ kubectl get nimcache -A -o yaml
Example output
# Example of NIM Cache status output (partial) ... status: conditions: ... - lastTransitionTime: "2025-07-07T00:00:00Z" message: The Job to cache NIM is in progress reason: JobRunning status: "False" type: NIM_CACHE_JOB_PENDING state: InProgress ...
Note that the in-progress status type is
NIM_CACHE_JOB_PENDING
.You can verify that the pod is running:
$ kubectl get pods -A | grep -i "instruct-job"
Example output
# Example of get pods output (partial) NAMESPACE NAME READY STATUS ... nim-service meta-llama3-2-3b-instruct-job-wz7kw 1/1 Running
You can also verify in the pod log that the profile is being downloaded. Replace meta-llama3-2-3b-instruct-job-wz7kw with the name of your instruct job pod.
$ kubectl logs meta-llama3-2-3b-instruct-job-wz7kw -n nim-service
Example output
# Example of pod log output (partial) INFO] Fetching contents for profile ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 ...
When the NIM Cache is ready, the completed status type is
NIM_CACHE_JOB_COMPLETED
, and the profile configuration details are listed in the cache.$ kubectl get nimcache -A -o yaml
Example output
# Example of NIM_CACHE_JOB_COMPLETED output (partial) ... status: conditions: ... - lastTransitionTime: "2025-07-07T00:01:00Z" message: The Job to cache NIM has successfully completed reason: JobCompleted status: "True" type: NIM_CACHE_JOB_COMPLETED profiles: - config: feat_lora: "false" llm_engine: tensorrt_llm pp: "1" precision: bf16 tp: "1" trtllm_buildable: "true" # This indicates the profile is buildable name: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 state: Ready ...
Note that the profile is buildable if
trtllm_buildable: "true"
in the cache status.
2. Identify Which Profiles Are Buildable Locally#
To retrieve the cached model names from NIM Cache, view the .status.profiles
field of the NIM Cache
to get the list of cached profile names for use in the NIMBuild CR in the next step.
Replace meta-llama3-2-3b-instruct with the name of your NIM Cache.
$ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'
Example output
# Example of buildable profiles output
[
{
"config": {
"feat_lora": "true",
"feat_lora_max_rank": "32",
"llm_engine": "tensorrt_llm",
"pp": "1",
"precision": "bf16",
"tp": "1",
"trtllm_buildable": "true"
},
"name": "7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724"
},
{
"config": {
"feat_lora": "false",
"llm_engine": "tensorrt_llm",
"pp": "1",
"precision": "bf16",
"tp": "1",
"trtllm_buildable": "true"
},
"name": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2"
}
]
3. Create a NIM Build Resource to Build an Engine From Your Cached Profile#
Create a file, such as
nimbuild.yaml
, with contents like the following example:# NIMBuild Select Profile apiVersion: apps.nvidia.com/v1alpha1 kind: NIMBuild metadata: name: meta-llama3-2-3b-instruct spec: nimCache: name : meta-llama3-2-3b-instruct profile: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 image: repository: "nvcr.io/nim/meta/llama-3.2-3b-instruct" tag: "1.8.5" pullSecrets: - ngc-secret-1
Select one of the cached profile names from the previous step and add it to the NIMBuild
spec.nimCache.profile
field to select the engine to build.Note
If the NIM Cache has only one buildable profile, then you do not need to specify a profile name because NIM Build defaults to using that profile.
However, if the NIM Cache has multiple profiles, then you must explicitly specify a buildable profile name in the NIMBuild CRD.
Apply the manifest:
$ kubectl create -f nimbuild.yaml -n nim-service
Example output
nimbuild.apps.nvidia.com/meta-llama3-2-3b-instruct created
Check the status of the NIM Build:
Note
It can take several minutes to build the profiles, depending on their size.
$ kubectl get nimbuild -A
Example output
NAMESPACE NAME STATUS AGE nim-service meta-llama3-2-3b-instruct Pending 7s
You can verify that the pod is running:
$ kubectl get pods -A | grep -i "engine-build"
Example output
# Example of get pods output (partial) NAMESPACE NAME READY STATUS ... nim-service meta-llama3-2-3b-instruct-engine-build-pod 0/1 Running
You can also verify in the pod log that the profile is being built. Replace meta-llama3-2-3b-instruct-engine-build-pod with the name of your engine build pod.
$ kubectl logs meta-llama3-2-3b-instruct-engine-build-pod -n nim-service
Example output
# Example of pod log output (partial) ... INFO] Selected profile: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 (tensorrt_llm_buildable-bf16-tpl-ppl) INFO] Profile metadata: fea_lora: false INFO] Profile metadata: llm_engine: tensorrt_llm INFO] Profile metadata: pp: 1 INFO] Profile metadata: precision: bf16 INFO] Profile metadata: tp: 1 INFO] Profile metadata: trtllm_buildable: true ...
When the profile is done being built, the engine build pod is terminated:
$ kubectl get pods -A
Example output
# Example of get pods (partial) NAMESPACE NAME READY STATUS ... nim-service meta-llama3-2-3b-instruct-engine-build-pod 1/1 Terminating
You can verify the status of the build:
$ kubectl get nimbuild -A -o yaml
Example output
# Example of completed build output ... - lastTransitionTime: "2025-07-23T05:58:40Z" message: The Pod to read local model manifest is completed reason: PodCompleted status: "True" type: NIM_BUILD_MODEL_MANIFEST_POD_COMPLETED inputProfile: config: feat_lora: "false" llm_engine: tensorrt_llm pp: "1" precision: bf16 tp: "1" trtllm_buildable: "true" name: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2 outputProfile: config: custom: "true" feat_lora: "false" llm_engine: tensorrt_llm model_name: meta-llama3-2-3b-instruct pp: "1" precision: bf16 tp: "1" name: 1595f282ae143f034a28997246a212c32e077e2c4ca144c27fc1ff834fd9b56c state: Ready
Note that the NIM Build object status type is then
NIM_BUILD_MODEL_MANIFEST_POD_COMPLETED
.You can then get the details of the newly built profiles in the NIM Cache:
$ kubectl get nimcache meta-llama3-2-3b-instruct -n nim-service -o json | jq '.status.profiles'
Example output
# Example of newly built profile details output ... [ { "config": { "feat_lora": "true", "feat_lora_max_rank": "32", "llm_engine": "tensorrt_llm", "pp": "1", "precision": "bf16", "tp": "1", "trtllm_buildable": "true" }, "name": "7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724" }, { "config": { "feat_lora": "false", "llm_engine": "tensorrt_llm", "pp": "1", "precision": "bf16", "tp": "1", "trtllm_buildable": "true" }, "name": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2" # input profile name }, { "config": { "custom": "true", "feat_lora": "false", "input_profile": "ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2", # input profile name "llm_engine": "tensorrt_llm", "model_name": "meta-llama3-2-3b-instruct", "pp": "1", "precision": "bf16", "tp": "1" }, "name": "1595f282ae143f034a28997246a212c32e077e2c4ca144c27fc1ff834fd9b56c" } ]
4. Create a NIM Service that uses the built engine in NIM Cache#
Create a file, such as
nimservice.yaml
, with contents like the following example:# NIMService for LLM-specific apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: llama-3.1-8b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3 pullSecret: ngc-secret authSecret: ngc-api-secret model: #Include the model object to describe the LLM-specific model you want to pull from NGC engine: tensorrt_llm tensorParallelism: "1" storage: nimCache: name: llama-3.1-8b-instruct profile: '' resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000
Apply the manifest:
$ kubectl create -f nimservice.yaml -n nim-service
Wait for the NIM Service to be ready, view information about the service:
$ kubectl describe nimservices.apps.nvidia.com -n nim-service
Example output
# Example of NIM Service status output (partial) ... Conditions: Last Transition Time: 2024-08-12T19:09:43Z Message: Deployment is ready Reason: Ready Status: True Type: Ready State: Ready
Troubleshooting#
This section explains some common troubleshooting steps to identify issues with caching locally built LLM NIM engines.
Ensure that the NIMCache referred in the NIMBuild is in ready state.
Ensure that the NIMCache referred in the NIMBuild has cached buildable profiles
Ensure that the model puller image is same on the NIMCache and NIMBuild
If the cluster contains heterogeneous GPU types, the
spec.nodeSelector
field in a NIMBuild must match that of the corresponding NIMService. This ensures both are scheduled onto nodes with the same GPU product type. For example:spec: nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
Logs and Messages to Gather#
Logs for the NIMBuild pod can be gathered by running the following command:
$ kubectl logs <NIM-Build-Name>-engine-build-pod -n nim-service
Additional information can be found by describing the NIM Build using the following command:
$ kubectl describe nimbuild <NIM-Build-Name> -n <Namespace>