Caching NIM Models#

Benefits of Caching Models#

NVIDIA recommends caching models locally on your cluster. Caching a model improves microservice startup time. When deployments scale to multiple NIM microservice pods, a single cached model can serve multiple pods. This is achieved through a persistent volume that has network storage.

For single node clusters, the Local Path Provisioner from Rancher Labs is sufficient for research and development. For production, NVIDIA recommends installing a provisioner that provides a network storage class.

Prerequisites#

  • Installed the NVIDIA NIM Operator.

  • A persistent volume provisioner that uses network storage such as NFS, S3, or vSAN. The models are downloaded and stored in persistent storage.

    You can create a PVC and specify the name when you create the NIM cache resource, or you can request that the Operator create the PVC.

  • The required image pull secrets with your NVIDIA NGC API Key.

  • The model name of the NVIDIA NIM you want to cache. The sample manifests on this page show commonly used container images in the spec.source.ngc.modelPuller field, but you can update this field to any supported NIM. When selecting a model, check the resource and supported GPU version for the model that you plan to use. This is typically available from the model card (on build.nvidia.com) or model overview (on NVIDIA NGC) pages. Refer to the Platform Support page for details on supported architectures or refer to the NVIDIA NIM documentation for more details on NIM.

    You can learn about available models from the following sources:

    • Browse models at NVIDIA AI Foundation Models. To view models that you can run anywhere, click NIM Type then run anywhere. Use the search box to filter for specific NIM.

    • Browse NVIDIA NGC Catalog containers. Use the search box to find the container you are looking for, or click the NVIDIA NIM checkbox to view all NVIDIA NIM.

    • Run ngc registry image list "nim/*" to display NIM images. Refer to ngc registry in the NVIDIA NGC CLI User Guide for information about the command.

  • To cache models or datasets from Hugging Face Hub, you must have a Hugging Face Hub account and an user access token. Your desired models or datasets should be available in your Hugging Face Hub account. Create a Kubernetes secret with your Hugging Face Hub user access token to use as the spec.source.hf.authSecret.

About the NIM Cache Custom Resource Definition#

A NIM cache is a Kubernetes custom resource, nimcaches.apps.nvidia.com. You create and delete NIM cache resources to manage model caching.

NIM Cache configuration#

If you delete a NIM cache resource that was created with spec.storage.pvc.create: true, the NIM Operator deletes the persistent volume (PV) and persistent volume claim (PVC).

Refer to the following table for information about the commonly modified fields:

Field

Description

Default Value

spec.certConfig

Deprecated. Use spec.proxy instead. Specifies custom CA certificates that might be required in environments with an HTTP proxy. Refer to Proxy Support for more information.

None

spec.env

Specifies environment variable names and values for the caching job.

None

spec.groupID

Specifies the group for the pods. This value is used to set the security context of the pod in the runAsGroup and fsGroup fields.

2000

spec.nodeSelector

Specifies node selector labels to schedule the caching job.

None

spec.proxy.certConfigMap

Specifies the name of the ConfigMap with CA certs for your proxy server.

None

spec.proxy.httpProxy

Specifies the address of a proxy server that should be used for outbound HTTP requests.

None

spec.proxy.httpsProxy

Specifies the address of a proxy server that should be used for outbound HTTPS requests.

None

spec.proxy.noProxy

Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed.

None

spec.resources

Specifies the resource requirements for the pods.

None

spec.runtimeClassName

Specifies the underlying container runtime class name to be used for running NIM with NVIDIA GPUs allocated. If not set, the default nvidia runtime class is assigned automatically. This runtime class is created by the NVIDIA GPU Operator.

None

spec.source.dataStore.revision

Specifies the revision of the object to be cached. This is either a commit hash, branch name, or tag.

For example, you can start a training job, such as in the NeMo Data Flywheel Jupyter notebook, then create a NIM Cache with a revision:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-1b-instruct-datastore-e2e
spec:
  source:
    dataStore:
      endpoint: http://10.105.55.171:8000/v1/hf
      modelName: "llama-3.2-1b-xlam-run1" # default/llama-3-1b-instruct model must be present in NeMo DataStore
      namespace: xlam-tutorial-ns
      authSecret: hf-auth
      modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.04
      pullSecret: ngc-secret
      revision: "cust-3VpkN1ve1GMkwsYqEptoij"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

None

spec.storage.pvc.annotations

Annotations to add to the NIM Operator created PVC.

None

spec.storage.pvc.create

When set to true, the Operator creates the PVC. If you delete a NIM cache resource and this field was set to true, the Operator deletes the PVC and the cached models.

false

spec.storage.pvc.name

Specifies the PVC name.

This field is required if you specify create: false.

The NIM cache resource name with a -pvc suffix.

spec.storage.pvc.size

Specifies the size, in Gi, for the PVC to create.

This field is required if you specify create: true.

None

spec.storage.pvc.storageClass

Specifies the storage class for the PVC to create. Leave empty to use your cluster’s default StorageClass.

None

spec.storage.pvc.subPath

Specifies to create a subpath on the PVC and cache the model profiles in the directory.

None

spec.storage.pvc.volumeAccessMode

Specifies the access mode for the PVC to create.

None

spec.tolerations

Specifies the tolerations for the caching job.

None

spec.userID

Specifies the user ID for the pod. This value is used to set the security context of the pod in the runAsUser fields.

1000

Caching Non-LLM, LLM-Specific, and Multi-LLM NIM#

NIM Cache supports different types of NIM, including non-LLM, LLM-Specific, and multi-LLM NIM. Each type has different caching source configuration options.

Select your NIM type for detailed caching instructions:

Non-LLM NIM

Cover a wide variety of domains, such as retrieval, vision, speech, biology, and safety and moderation.

Caching Non-LLM NIM
LLM-Specific NIM

Focus on individual Large Language Models or model families, offering maximum performance.

cache-llm.html#caching-llm-specific-nim
Multi-LLM Compatible NIM

Enable the deployment of a broad range of Large Language Models, offering maximum flexibility.

cache-llm.html#caching-multi-llm-nim

Supported Sources and Protocols#

You can easily deploy custom, fine-tuned models on NIM. NIM automatically builds an optimized TensorRT-LLM locally-built engine given the weights in the HuggingFace format.

You can pull models from a variety of sources using various protocols:

  • For all NIM microservices, NVIDIA NGC Catalog is supported.

  • For Multi-LLM NIM microservices, the following registry types are also supported:

    • Registries using the Hugging Face Protocol, such as

      • Hugging Face Hub Data Store

      • NVIDIA NeMo Data Store

    • Local File Data Store

    Refer to Caching Multi-LLM Compatible NIM for examples.

  • For LLM-Specific NIM microservices, the following protocols are also supported:

    • NGC Mirrored Local Model Registries (S3, HTTPS, JFrog)

    Refer to Caching LLM-Specific NIM for examples.

Note

Each cache can only be configured to pull from one source.

Example Procedure#

Summary#

To cache a NIM model, follow these steps:

Note

Ensure you have completed the prerequisites.

  1. Create a namespace.

  2. Create and configure a NIM Cache custom resource.

  3. Optional: View information about the caching progress

1. Create a Namespace#

$ kubectl create namespace nim-service

2. Create and Configure a NIM Cache Custom Resource#

  1. Create a file, such as cache-all.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: tensorrt_llm
            tensorParallelism: "1"
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: nv-embedqa-1b-v2
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.1
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: tensorrt_llm
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: nv-rerankqa-1b-v2
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.1
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: tensorrt_llm
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f cache-all.yaml
    

3. Optional: View Information About the Caching Progress#

  • Confirm a persistent volume and claim are created:

    $ kubectl get -n nim-service pvc,pv
    
    Example output
    NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
    persistentvolumeclaim/meta-llama3-8b-instruct-pvc   Bound    pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            nfs-client     <unset>                 10m
    
    NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                     STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
    persistentvolume/pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            Delete           Bound    nim-service/meta-llama3-8b-instruct-pvc   nfs-client     <unset>                          10m
    
  • View the NIM cache resource to view the status:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service
    
    Example output
    NAME                        STATUS   PVC                             AGE
    meta-llama3-8b-instruct     Ready    meta-llama3-8b-instruct-pvc     2024-09-19T13:20:53Z
    nv-embedqa-e5-v5            Ready    nv-embedqa-e5-v5-pvc            2024-09-18T21:11:37Z
    nv-rerankqa-mistral-4b-v3   Ready    nv-rerankqa-mistral-4b-v3-pvc   2024-09-18T21:11:37Z
    

Support for Advanced Configurations#

LoRA Models and Adapters#

NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.

Refer to LoRA Models and Adapters for detailed usage instructions.

Air-Gapped and Proxy Environments#

NVIDIA NIM for large language models (LLMs) supports serving models in an air-gapped system (also known as air wall, air-gapping, or disconnected network). In an air-gapped system, you can run a NIM with no internet connection and with no connection to the NGC registry or Hugging Face Hub. You have two options for air-gapped deployment: accessing NGC through proxy and serving model through local assets.

Refer to Air-Gapped Environments for detailed usage instructions.

Caching Locally Built LLM NIM Engines#

NIM Operator supports the NIM Build custom resource, that allows end users to build and cache model engines before starting a NIM deployment. It helps improve startup times and reduce resource usage during NIM deployments and autoscaling, making the deployments more predictable.

Refer to Caching Locally Built LLM NIM Engines for detailed usage instructions.

Displaying the NIM Cache Status#

Run the following command to display the NIM cache status:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service
Example output
NAME                        STATUS    PVC                             AGE
meta-llama3-8b-instruct     Ready     meta-llama3-8b-instruct         2024-08-09T20:54:28Z
nv-embedqa-e5-v5            Ready     nv-embedqa-e5-v5-pvc            2024-08-09T20:54:28Z
nv-rerankqa-mistral-4b-v3   Ready     nv-rerankqa-mistral-4b-v3-pvc   2024-08-09T20:54:28Z

The NIM cache object can report the following statuses:

Status

Description

Failed

The job failed to download and cache the model profile.

InProgress

The job is downloading the model profile from NGC.

NotReady

The job is not ready. This status can be reported shortly after creating the NIM cache resource while the image for the pod is downloaded from NGC.

For more information, run kubectl describe nimcache -n nim-service <nim-cache-name>.

Pending

The job is created, but has not yet started and become active.

PVC-Created

The Operator creates a PVC for the model profile cache if you set spec.storage.pvc.create: true.

Ready

The job downloaded and cached the model profile.

Started

The Operator creates a job to download the model profile from NGC.

Displaying Cached Model Profiles#

To view the .status.profiles field of the custom resource, use the following command and update meta-llama3-8b-instruct to your NIM cache name.

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
    meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .
Example output
[
  {
    "config": {
      "feat_lora": "false",
      "llm_engine": "tensorrt_llm",
      "precision": "bf16",
      "tp": "1",
      "trtllm_buildable": "true"
    },
    "name": "7cc8597690a35aba19a3636f35e7f1c7e7dbc005fe88ce9394cad4a4adeed414"
  }
]

The output shows an array of cached profiles, including the profile name for identification.

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with caching models.

NIM Pods Show ContainerCreating Status for a Long Time#

After starting a NIM cache, you can check the cache pod status by running the following command:

$ kubectl get pods -n nim-service
Example output
meta-llama3-8b-instruct-pod         0/1     ContainerCreating   0          2m33s
nv-embedqa-1b-v2-pod                0/1     ContainerCreating   0          2m33s
nv-rerankqa-1b-v2-pod               0/1     ContainerCreating   0          2m33s

You might notice the init NIM cache pod show a status of ContainerCreating for several minutes. This can happen when the NIM container image takes a long time to download and can cause your NIM cache service to show no status or a status of NotReady while the NIM container is downloading. Refer to NIM Cache Reports No Status for more troubleshooting details.

NIM Cache Reports No Status#

If you run kubectl get nimcache -n nim-service and the output does not report a status, perform the following actions to get more information:

  • Determine the state of the caching jobs:

    $ kubectl get jobs -n nim-service
    
    Example output
    NAME                            COMPLETIONS   DURATION   AGE
    meta-llama3-8b-instruct-job     1/1           8s         2m57s
    

    View the logs from the job with a command like kubectl logs -n nim-service job/meta-llama3-8b-instruct-job.

    If the caching job is no longer available, delete the NIM cache resource and reapply the manifest.

  • Describe the NIM cache resource and review the conditions:

    $ kubectl describe nimcache -n nim-service <nim-cache-name>
    
    Example output
    Status:
      Conditions:
        Last Transition Time:  2025-04-17T15:47:29Z
        Message:               The PVC has been created for caching NIM model
        Reason:                PVCCreated
        Status:                True
        Type:                  NIM_CACHE_PVC_CREATED
        Last Transition Time:  2025-04-17T15:50:36Z
        Message:
        Reason:                Reconciled
        Status:                False
        Type:                  NIM_CACHE_RECONCILE_FAILED
        Last Transition Time:  2025-04-17T15:50:36Z
        Message:               The Job to cache NIM is in pending state
        Reason:                JobPending
        Status:                True
        Type:                  NIM_CACHE_JOB_PENDING
        Last Transition Time:  2025-04-17T15:50:36Z
        Message:               The Job to cache NIM has successfully completed
        Reason:                JobCompleted
        Status:                True
        Type:                  NIM_CACHE_JOB_COMPLETED
        
    ...
     
      State:           Ready
    Events:
      Type     Reason           Age                    From                 Message
      ----     ------           ----                   ----                 -------
      Warning  ReconcileFailed  6m                     nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile failed, msg: Pod "meta-llama3-8b-instruct-pod" not found
      Normal   Started          3m28s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Started
      Normal   InProgress       3m28s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: InProgress
      Normal   Pending          2m53s (x2 over 3m28s)  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Pending
      Normal   Ready            2m53s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Ready
    

    The preceding output shows a NIM cache resource that eventually succeeded in downloading and caching a model profile.

    The NIM_CACHE_RECONCILE_FAILED condition and ReconcileFailed event reason were reported during the interval that the caching job was created by the Operator, but before the pod was running because the image was downloading from NGC. In the output, the status for that condition is set to False to indicate that the condition is no longer accurate.

  • View the Operator logs by running kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator.

NIM Cache Event Failures Report ReconcileFailed But NIM Pod is ContainerCreate#

Some NIM container images can take a long time to download the container image due to image size or network connectivity. If this happens, NIM cache can report no status, a status of NotReady, or event failures like ReconcileFailed reported while the pods for the caching job are still being created.

Run kubectl get pods -n <nim-namespace> to check the status of the NIM Cache pod. Once the container download is completed, the pod will start normally, and the NIM Cache status should update to a running status. View Displaying the NIM Cache Status for more details on statuses.

Deleting a NIM Cache#

To delete a NIM cache, perform the following steps.

  1. View the NIM cache custom resources:

$ kubectl get nimcaches.apps.nvidia.com -A
Example output
NAMESPACE     NAME                      STATUS   PVC           AGE
nim-service   meta-llama3-8b-instruct   Ready    model-store   2024-08-08T13:14:30Z
  1. Delete the custom resource:

$ kubectl delete nimcache -n nim-service meta-llama3-8b-instruct

If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
    -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
Example output
meta-llama3-8b-instruct: true

Next Steps#

  • Deploy NIM microservices either by adding NIM service custom resources or managing several services in a single NIM pipeline custom resource.