Caching NIM Models#
Benefits of Caching Models#
NVIDIA recommends caching models locally on your cluster. Caching a model improves microservice startup time. When deployments scale to multiple NIM microservice pods, a single cached model can serve multiple pods. This is achieved through a persistent volume that has network storage.
For single node clusters, the Local Path Provisioner from Rancher Labs is sufficient for research and development. For production, NVIDIA recommends installing a provisioner that provides a network storage class.
Prerequisites#
Installed the NVIDIA NIM Operator.
A persistent volume provisioner that uses network storage such as NFS, S3, or vSAN. The models are downloaded and stored in persistent storage.
You can create a PVC and specify the name when you create the NIM cache resource, or you can request that the Operator create the PVC.
The required image pull secrets with your NVIDIA NGC API Key.
The model name of the NVIDIA NIM you want to cache. The sample manifests on this page show commonly used container images in the
spec.source.ngc.modelPuller
field, but you can update this field to any supported NIM. When selecting a model, check the resource and supported GPU version for the model that you plan to use. This is typically available from the model card (on build.nvidia.com) or model overview (on NVIDIA NGC) pages. Refer to the Platform Support page for details on supported architectures or refer to the NVIDIA NIM documentation for more details on NIM.You can learn about available models from the following sources:
Browse models at NVIDIA AI Foundation Models. To view models that you can run anywhere, click NIM Type then run anywhere. Use the search box to filter for specific NIM.
Browse NVIDIA NGC Catalog containers. Use the search box to find the container you are looking for, or click the NVIDIA NIM checkbox to view all NVIDIA NIM.
Run
ngc registry image list "nim/*"
to display NIM images. Refer to ngc registry in the NVIDIA NGC CLI User Guide for information about the command.
To cache models or datasets from Hugging Face Hub, you must have a Hugging Face Hub account and an user access token. Your desired models or datasets should be available in your Hugging Face Hub account. Create a Kubernetes secret with your Hugging Face Hub user access token to use as the
spec.source.hf.authSecret
.
About the NIM Cache Custom Resource Definition#
A NIM cache is a Kubernetes custom resource, nimcaches.apps.nvidia.com
.
You create and delete NIM cache resources to manage model caching.
NIM Cache configuration#
If you delete a NIM cache resource that was created with spec.storage.pvc.create: true
, the NIM Operator deletes
the persistent volume (PV) and persistent volume claim (PVC).
Refer to the following table for information about the commonly modified fields:
Field |
Description |
Default Value |
---|---|---|
|
Deprecated. Use |
None |
|
Specifies environment variable names and values for the caching job. |
None |
|
Specifies the group for the pods.
This value is used to set the security context of the pod in the |
|
|
Specifies node selector labels to schedule the caching job. |
None |
|
Specifies the name of the ConfigMap with CA certs for your proxy server. |
None |
|
Specifies the address of a proxy server that should be used for outbound HTTP requests. |
None |
|
Specifies the address of a proxy server that should be used for outbound HTTPS requests. |
None |
|
Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed. |
None |
|
Specifies the resource requirements for the pods. |
None |
|
Specifies the underlying container runtime class name to be used for running NIM with NVIDIA GPUs allocated.
If not set, the default |
None |
|
Specifies the revision of the object to be cached. This is either a commit hash, branch name, or tag. For example, you can start a training job, such as in the NeMo Data Flywheel Jupyter notebook, then create a NIM Cache with a revision: apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-1b-instruct-datastore-e2e
spec:
source:
dataStore:
endpoint: http://10.105.55.171:8000/v1/hf
modelName: "llama-3.2-1b-xlam-run1" # default/llama-3-1b-instruct model must be present in NeMo DataStore
namespace: xlam-tutorial-ns
authSecret: hf-auth
modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.04
pullSecret: ngc-secret
revision: "cust-3VpkN1ve1GMkwsYqEptoij"
storage:
pvc:
create: true
storageClass: ""
size: "50Gi"
volumeAccessMode: ReadWriteOnce
|
None |
|
Annotations to add to the NIM Operator created PVC. |
None |
|
When set to |
|
|
Specifies the PVC name. This field is required if you specify |
The NIM cache resource name with a |
|
Specifies the size, in Gi, for the PVC to create. This field is required if you specify |
None |
|
Specifies the storage class for the PVC to create. Leave empty to use your cluster’s default StorageClass. |
None |
|
Specifies to create a subpath on the PVC and cache the model profiles in the directory. |
None |
|
Specifies the access mode for the PVC to create. |
None |
|
Specifies the tolerations for the caching job. |
None |
|
Specifies the user ID for the pod.
This value is used to set the security context of the pod in the |
|
Caching Non-LLM, LLM-Specific, and Multi-LLM NIM#
NIM Cache supports different types of NIM, including non-LLM, LLM-Specific, and multi-LLM NIM. Each type has different caching source configuration options.
Select your NIM type for detailed caching instructions:
Cover a wide variety of domains, such as retrieval, vision, speech, biology, and safety and moderation.
Focus on individual Large Language Models or model families, offering maximum performance.
Enable the deployment of a broad range of Large Language Models, offering maximum flexibility.
Supported Sources and Protocols#
You can easily deploy custom, fine-tuned models on NIM. NIM automatically builds an optimized TensorRT-LLM locally-built engine given the weights in the HuggingFace format.
You can pull models from a variety of sources using various protocols:
For all NIM microservices, NVIDIA NGC Catalog is supported.
For Multi-LLM NIM microservices, the following registry types are also supported:
Registries using the Hugging Face Protocol, such as
Hugging Face Hub Data Store
NVIDIA NeMo Data Store
Local File Data Store
Refer to Caching Multi-LLM Compatible NIM for examples.
For LLM-Specific NIM microservices, the following protocols are also supported:
NGC Mirrored Local Model Registries (S3, HTTPS, JFrog)
Refer to Caching LLM-Specific NIM for examples.
Note
Each cache can only be configured to pull from one source.
Example Procedure#
Summary#
To cache a NIM model, follow these steps:
Note
Ensure you have completed the prerequisites.
1. Create a Namespace#
$ kubectl create namespace nim-service
2. Create and Configure a NIM Cache Custom Resource#
Create a file, such as
cache-all.yaml
, with contents like the following example:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-8b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3 pullSecret: ngc-secret authSecret: ngc-api-secret model: engine: tensorrt_llm tensorParallelism: "1" storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {} --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: nv-embedqa-1b-v2 spec: source: ngc: modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.1 pullSecret: ngc-secret authSecret: ngc-api-secret model: engine: tensorrt_llm storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {} --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: nv-rerankqa-1b-v2 spec: source: ngc: modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.1 pullSecret: ngc-secret authSecret: ngc-api-secret model: engine: tensorrt_llm storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {}
Apply the manifest:
$ kubectl apply -n nim-service -f cache-all.yaml
3. Optional: View Information About the Caching Progress#
Confirm a persistent volume and claim are created:
$ kubectl get -n nim-service pvc,pv
Example output
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/meta-llama3-8b-instruct-pvc Bound pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce 50Gi RWO nfs-client <unset> 10m NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE persistentvolume/pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce 50Gi RWO Delete Bound nim-service/meta-llama3-8b-instruct-pvc nfs-client <unset> 10m
View the NIM cache resource to view the status:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service
Example output
NAME STATUS PVC AGE meta-llama3-8b-instruct Ready meta-llama3-8b-instruct-pvc 2024-09-19T13:20:53Z nv-embedqa-e5-v5 Ready nv-embedqa-e5-v5-pvc 2024-09-18T21:11:37Z nv-rerankqa-mistral-4b-v3 Ready nv-rerankqa-mistral-4b-v3-pvc 2024-09-18T21:11:37Z
Support for Advanced Configurations#
LoRA Models and Adapters#
NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.
Refer to LoRA Models and Adapters for detailed usage instructions.
Air-Gapped and Proxy Environments#
NVIDIA NIM for large language models (LLMs) supports serving models in an air-gapped system (also known as air wall, air-gapping, or disconnected network). In an air-gapped system, you can run a NIM with no internet connection and with no connection to the NGC registry or Hugging Face Hub. You have two options for air-gapped deployment: accessing NGC through proxy and serving model through local assets.
Refer to Air-Gapped Environments for detailed usage instructions.
Caching Locally Built LLM NIM Engines#
NIM Operator supports the NIM Build custom resource, that allows end users to build and cache model engines before starting a NIM deployment. It helps improve startup times and reduce resource usage during NIM deployments and autoscaling, making the deployments more predictable.
Refer to Caching Locally Built LLM NIM Engines for detailed usage instructions.
Displaying the NIM Cache Status#
Run the following command to display the NIM cache status:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service
Example output
NAME STATUS PVC AGE
meta-llama3-8b-instruct Ready meta-llama3-8b-instruct 2024-08-09T20:54:28Z
nv-embedqa-e5-v5 Ready nv-embedqa-e5-v5-pvc 2024-08-09T20:54:28Z
nv-rerankqa-mistral-4b-v3 Ready nv-rerankqa-mistral-4b-v3-pvc 2024-08-09T20:54:28Z
The NIM cache object can report the following statuses:
Status |
Description |
---|---|
Failed |
The job failed to download and cache the model profile. |
InProgress |
The job is downloading the model profile from NGC. |
NotReady |
The job is not ready. This status can be reported shortly after creating the NIM cache resource while the image for the pod is downloaded from NGC. For more information, run |
Pending |
The job is created, but has not yet started and become active. |
PVC-Created |
The Operator creates a PVC for the model profile cache if you set |
Ready |
The job downloaded and cached the model profile. |
Started |
The Operator creates a job to download the model profile from NGC. |
Displaying Cached Model Profiles#
To view the .status.profiles
field of the custom resource, use the following command and
update meta-llama3-8b-instruct
to your NIM cache name.
$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .
Example output
[
{
"config": {
"feat_lora": "false",
"llm_engine": "tensorrt_llm",
"precision": "bf16",
"tp": "1",
"trtllm_buildable": "true"
},
"name": "7cc8597690a35aba19a3636f35e7f1c7e7dbc005fe88ce9394cad4a4adeed414"
}
]
The output shows an array of cached profiles, including the profile name
for identification.
Troubleshooting#
This section explains some common troubleshooting steps to identify issues with caching models.
NIM Pods Show ContainerCreating Status for a Long Time#
After starting a NIM cache, you can check the cache pod status by running the following command:
$ kubectl get pods -n nim-service
Example output
meta-llama3-8b-instruct-pod 0/1 ContainerCreating 0 2m33s
nv-embedqa-1b-v2-pod 0/1 ContainerCreating 0 2m33s
nv-rerankqa-1b-v2-pod 0/1 ContainerCreating 0 2m33s
You might notice the init NIM cache pod show a status of ContainerCreating
for several minutes.
This can happen when the NIM container image takes a long time to download and can cause your NIM cache service to show no status or a status of NotReady
while the NIM container is downloading.
Refer to NIM Cache Reports No Status for more troubleshooting details.
NIM Cache Reports No Status#
If you run kubectl get nimcache -n nim-service
and the output does not report a status, perform the following actions to get more information:
Determine the state of the caching jobs:
$ kubectl get jobs -n nim-service
Example output
NAME COMPLETIONS DURATION AGE meta-llama3-8b-instruct-job 1/1 8s 2m57s
View the logs from the job with a command like
kubectl logs -n nim-service job/meta-llama3-8b-instruct-job
.If the caching job is no longer available, delete the NIM cache resource and reapply the manifest.
Describe the NIM cache resource and review the conditions:
$ kubectl describe nimcache -n nim-service <nim-cache-name>
Example output
Status: Conditions: Last Transition Time: 2025-04-17T15:47:29Z Message: The PVC has been created for caching NIM model Reason: PVCCreated Status: True Type: NIM_CACHE_PVC_CREATED Last Transition Time: 2025-04-17T15:50:36Z Message: Reason: Reconciled Status: False Type: NIM_CACHE_RECONCILE_FAILED Last Transition Time: 2025-04-17T15:50:36Z Message: The Job to cache NIM is in pending state Reason: JobPending Status: True Type: NIM_CACHE_JOB_PENDING Last Transition Time: 2025-04-17T15:50:36Z Message: The Job to cache NIM has successfully completed Reason: JobCompleted Status: True Type: NIM_CACHE_JOB_COMPLETED ... State: Ready Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 6m nimcache-controller NIMCache meta-llama3-8b-instruct reconcile failed, msg: Pod "meta-llama3-8b-instruct-pod" not found Normal Started 3m28s nimcache-controller NIMCache meta-llama3-8b-instruct reconcile success, new state: Started Normal InProgress 3m28s nimcache-controller NIMCache meta-llama3-8b-instruct reconcile success, new state: InProgress Normal Pending 2m53s (x2 over 3m28s) nimcache-controller NIMCache meta-llama3-8b-instruct reconcile success, new state: Pending Normal Ready 2m53s nimcache-controller NIMCache meta-llama3-8b-instruct reconcile success, new state: Ready
The preceding output shows a NIM cache resource that eventually succeeded in downloading and caching a model profile.
The
NIM_CACHE_RECONCILE_FAILED
condition andReconcileFailed
event reason were reported during the interval that the caching job was created by the Operator, but before the pod was running because the image was downloading from NGC. In the output, the status for that condition is set toFalse
to indicate that the condition is no longer accurate.View the Operator logs by running
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator
.
NIM Cache Event Failures Report ReconcileFailed
But NIM Pod is ContainerCreate
#
Some NIM container images can take a long time to download the container image due to image size or network connectivity.
If this happens, NIM cache can report no status, a status of NotReady
, or event failures like ReconcileFailed
reported while the pods for the caching job are still being created.
Run kubectl get pods -n <nim-namespace>
to check the status of the NIM Cache pod.
Once the container download is completed, the pod will start normally, and the NIM Cache status should update to a running status.
View Displaying the NIM Cache Status for more details on statuses.
Deleting a NIM Cache#
To delete a NIM cache, perform the following steps.
View the NIM cache custom resources:
$ kubectl get nimcaches.apps.nvidia.com -A
Example output
NAMESPACE NAME STATUS PVC AGE
nim-service meta-llama3-8b-instruct Ready model-store 2024-08-08T13:14:30Z
Delete the custom resource:
$ kubectl delete nimcache -n nim-service meta-llama3-8b-instruct
If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
-o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
Example output
meta-llama3-8b-instruct: true
Next Steps#
Deploy NIM microservices either by adding NIM service custom resources or managing several services in a single NIM pipeline custom resource.