Caching Models#

About Model Profiles and Caching#

The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPUs, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles.

The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID. For more information, refer to Model Profiles in the NVIDIA NIM for LLMs documentation.

The Operator compliments model selection by the NIM microservices in the following ways:

Support for caching and running a specified model profile ID, like the NIM microservices.
Support for caching all model profiles.
Support for influencing model selection by specifying the engine, precision, tensor parallelism, and so on to match. The NIM cache custom resource provides the way to specify the model selection criteria.

Benefits of Caching Models#

NVIDIA recommends caching models for NVIDIA NIM microservices. By caching a model, the microservice startup time is improved. For deployments that scale to more than one NIM microservice pod, a single cached model in persistent volume with a network storage class can serve multiple pods.

For single-node clusters, the Local Path Provisioner from Rancher Labs is sufficient for research and development. For production, NVIDIA recommends installing a provisioner that provides a network storage class.

About the NIM Cache Custom Resource Definition#

A NIM cache is a Kubernetes custom resource, nimcaches.apps.nvidia.com. You create and delete NIM cache resources to manage model caching.

The NIM cache supports caching models from the following sources:

NVIDIA NGC
NVIDIAM NeMo Data Store
Hugging Face Hub

Note that each cache can only be configured to pull from one source.

Using NVIDIA NGC as NIM Cache Source#

When you create a NIM cache resource with the NGC Catalog as the source, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.

If you specify one or more model profile IDs to cache, the Operator starts a job that caches the model profiles that you specified.

If you did not specify model profile IDs, but do specify engine: tensorrt_llm or engine: tensorrt, the Operator attempts to match the model profiles with the GPUs on the nodes in the cluster. The Operator uses the value of the nvidia.com/gpu.product node label that is set by Node Feature Discovery.

You can let the Operator automatically detect the model profiles to cache or you can constrain the model profiles by specifying values for spec.source.ngc.model, such as the engine, GPU model, and so on, that must match the model profile.

Note

It is recommended that you use profile filtering when caching models. Models can have several profiles, and without filtering by one or more parameters, you may download more models than intended, which can increase your storage requirements. For more information on NIM profiles and their model storage requirements, refer to the NIM Models documentation.

Using NeMo Data Store as NIM Cache Source#

To cache models or datasets from a NeMo Data Store, you must have the NeMo Entity Store and NeMo Datastore microservices running on your cluster and have the model or dataset available in the Data Store. See more about deploying NeMo microservices with the NIM Operator. Refer to the NeMo Data Store documentation for details on managing entities.

You must also create a Kubernetes secret with your Hugging Face Hub user access token to use as the spec.source.dataStore.authSecret.

The following shows an example of using the NeMo Data Store as a cache source.

spec:
  source:
  dataStore:
    endpoint: http://<nemodatastore-sample>.<nemo>.svc.cluster.local:8000/v1/hf
    modelName: "llama-3-1b-instruct"  # this assumes the model is present in NeMo DataStore under the default namespace
    authSecret: hf-auth
    modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.6.0
    pullSecret: ngc-secret 

If you want to cache a dataset, use datasetName: "dataset-name" instead of modelName:. A cache can only be configured to pull a model or a dataset, not both. By default, the NIM Operator will assume models or datasets are stored in the default NeMo Data Store namespace. If your resources are in a different namespace, include the spec.datastore.namespace parameter.

Using Hugging Face Hub as NIM Cache Source#

To cache models or datasets from Hugging Face Hub, you must have a Hugging Face Hub account and an user access token. Your desired models or datasets should be available in your Hugging Face Hub account.
Create a Kubernetes secret with your Hugging Face Hub user access token to use as the spec.source.hf.authSecret.

The following shows an example of using the Hugging Face Hub as a cache source.

spec:
  source:
    hf:
      endpoint: https://huggingface.co
      modelName: "Llama-3.1-8B-Instruct"
      namespace: "meta-llama"        
      authSecret: hf-auth
      modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.6.0
      pullSecret: ngc-secret

If you want to cache a dataset, use spec.source.hf.datasetName: "dataset-name" instead of spec.source.hf.modelName. A cache can only be configured to pull a model or a dataset, not both.

NIM Cache configuration#

Refer to the following sample manifest:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
  resources: {}

If you delete a NIM cache resource that was created with spec.storage.pvc.create: true, the NIM Operator deletes the persistent volume (PV) and persistent volume claim (PVC).

Refer to the following table for information about the commonly modified fields:

Field	Description	Default Value
`spec.certConfig`	Deprecated. Use `spec.proxy` instead. Specifies custom CA certificates that might be required in environments with an HTTP proxy. Refer to Supporting Custom CA Certificates for more information.	None
`spec.env`	Specifies environment variable names and values for the caching job.	None
`spec.groupID`	Specifies the group for the pods. This value is used to set the security context of the pod in the `runAsGroup` and `fsGroup` fields.	`2000`
`spec.nodeSelector`	Specifies node selector labels to schedule the caching job.	None
`spec.proxy.certConfigMap`	Specifies the name of the ConfigMap with CA certs for your proxy server.	None
`spec.proxy.httpProxy`	Specifies the address of a proxy server that should be used for outbound HTTP requests.	None
`spec.proxy.httpsProxy`	Specifies the address of a proxy server that should be used for outbound HTTPS requests	None
`spec.proxy.noProxy`	Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed.	None
`spec.resources`	Specifies the resource requirements for the pods.	None
`spec.runtimeClassName`	Specifies the underlying container runtime class name to be used for running NIMs with NVIDIA GPUs allocated. If not set, the default `nvidia` runtime class is assigned automatically. This runtime class is created by the NVIDIA GPU Operator.	None
`spec.source.dataStore.authSecret`	Specifies the name of the secret containing your “HF_TOKEN” token. Required if you are using a NeMo Data Store as a source of your model or dataset.	None
`spec.source.dataStore.datasetName`	Specifies the name of the dataset in the NeMo Data Store you want to use.	None
`spec.source.dataStore.endpoint`	Specifies the NeMo DataStore endpoint. Required if you are using a NeMo Data Store as a source of your model or dataset.	None
`spec.source.dataStore.modelName`	Specifies the name of the model you want to use from your NeMo Data Store.	None
`spec.source.dataStore.modelPuller`	Specifies the containerized huggingface-cli image to pull the data. Required if you are using a NeMo Data Store as a source of your model or dataset.	None
`spec.source.dataStore.namespace`	Specifies the namespace in the NeMo Data Store. If empty, the `default` namespace is used.	default
`spec.source.dataStore.pullSecret` (required)	Specifies the name of the image pull secret for the modelPuller image. Required if you are using a NeMo Data Store as a source of your model or dataset.	None
`spec.source.dataStore.revision`	Specifies the revision of the object to be cached. This is either a commit hash, branch name, or tag. For example, you can start a training job, such as in the NeMo Data Flywheel Jupyter notebook, then create a NIM Cache with a revision: apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-1b-instruct-datastore-e2e spec: source: dataStore: endpoint: http://10.105.55.171:8000/v1/hf modelName: "llama-3.2-1b-xlam-run1" # default/llama-3-1b-instruct model must be present in NeMo DataStore namespace: xlam-tutorial-ns authSecret: hf-auth modelPuller: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.04 pullSecret: ngc-secret revision: "cust-3VpkN1ve1GMkwsYqEptoij" storage: pvc: create: true storageClass: "" size: "50Gi" volumeAccessMode: ReadWriteOnce	None
`spec.source.hf.authSecret` (required)	Specifies the name of the secret containing your “HF_TOKEN” token. Required if you are using Hugging Face Hub as a source of your model or dataset.	None
`spec.source.hf.datasetName` (required)	Specifies the name of the dataset from HuggingFace. Required if you are using Hugging Face Hub as a source of your dataset.	None
`spec.source.hf.endpoint` (required)	Specifies the HuggingFace endpoint. Required if you are using Hugging Face Hub as a source of your model or dataset.	None
`spec.source.hf.modelName`	Specifies the name of the model you want to use form HuggingFace. Required if you are using Hugging Face Hub as a source of your model.	None
`spec.source.hf.modelPuller`	Specifies the containerized huggingface-cli image to pull the data. Required if you are using Hugging Face Hub as a source of your model or dataset.	None
`spec.source.hf.namespace`	Specifies the namespace in the Hugging Face Hub. Required if you are using Hugging Face Hub as a source of your model or dataset.	None
`spec.source.hf.pullSecret`	Specifies the name of the image pull secret for the modelPuller image. Required if you are using Hugging Face Hub as a source of your model or dataset.	None
`spec.source.ngc.model.buildable`	When set to `true`, specifies to cache a model profile that can be optimized with an NVIDIA engine for any GPUs.	None
`spec.source.ngc.model.engine`	Specifies a model caching constraint based on the engine. Common values are as follows: `tensorrt_llm` – optimized engine for NIM for LLMs `vllm` – community engine for NIM for LLMs `tensorrt` – optimized engine for embedding and reranking `onnx` – community engine Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information. By default, the caching job matches model profiles for all engines.	None
`spec.source.ngc.model.gpus`	Specifies a list of model caching constraints to use the specified GPU model, by PCI ID or product name. By default, the caching job detects all GPU models in the cluster and matches model profiles for all GPUs. The following partial specification requests a model profile that is compatible with an NVIDIA L40S. spec: source: ngc: model: ... gpus: - ids: - "26b5" If GPU Operator or Node Feature Discovery is running on your cluster, you can determine the PCI IDs for the GPU models on your nodes by viewing the node labels that begin with `feature.node.kubernetes.io/pci-10de-`. Alternatively, if you know the device name, you can look up the PCI ID from the table in the NVIDIA Open Kernel Modules repository on GitHub. The following partial specification requests a model profile that is compatible with an NVIDIA A100. spec: source: ngc: model: ... gpus: - product: "a100" The product name, such as `h100`, `a100`, or `l40s` must match the model profile, as shown for the list-model-profiles command in the NVIDIA NIM for LLMs documentation. The value is not case-sensitive.	None
`spec.source.ngc.model.lora`	When set to `true`, specifies to cache a model profile that is compatible with LoRA. Refer to Using LoRA Models and Adapters for more information.	`false`
`spec.source.ngc.model.precision`	Specifies the model profile quantization to match. Common values are `fp16`, `bf16`, and `fp8`. Like the GPU product name field, the value must match the model profile, such as shown in the `list-model-profiles` command.	None
`spec.source.ngc.model.profiles`	Specifies an array of model profiles to cache. When you specify this field, automatic profile selection is disabled and all other `source.ngc.model` fields are ignored. The following partial specification requests a specific model profile. spec: source: ngc: model: modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 profiles: - 8835c31... You can determine the model profiles by running the list-model-profiles command. You can specify `all` to download all model profiles. Use the `all` parameter with care as some models have many profiles which will take several minutes to download and a large amount of storage space to cache.	None
`spec.source.ngc.model.qosProfile`	Specifies the model profile quality of service to match. Values are `latency` or `throughput`.	None
`spec.source.ngc.model.tensorParallelism`	Specifies the model profile tensor parallelism to match. The node that runs the NIM microservice and serves the model must have at least the specified number of GPUs. Common values are `"1"`, `"2"`, and `"4"`.	None
`spec.source.ngc.modelPuller`	Specifies the container image that can cache model profiles.	None
`spec.storage.pvc.annotations`	Annotations to add to the NIM Operator created PVC.	none
`spec.storage.pvc.create`	When set to `true`, the Operator creates the PVC. If you delete a NIM cache resource and this field was set to `true`, the Operator deletes the PVC and the cached models.	`false`
`spec.storage.pvc.name`	Specifies the PVC name. This field is required if you specify `create: false`.	The NIM cache resource name with a `-pvc` suffix.
`spec.storage.pvc.size`	Specifies the size, in Gi, for the PVC to create. This field is required if you specify `create: true`.	None
`spec.storage.pvc.storageClass`	Specifies the storage class for the PVC to create. Leave empty to use your cluster’s default StorageClass.	None
`spec.storage.pvc.subPath`	Specifies to create a subpath on the PVC and cache the model profiles in the directory.	None
`spec.storage.pvc.volumeAccessMode`	Specifies the access mode for the PVC to create.	None
`spec.tolerations`	Specifies the tolerations for the caching job.	None
`spec.userID`	Specifies the user ID for the pod. This value is used to set the security context of the pod in the `runAsUser` fields.	`1000`

Prerequisites#

Installed the NVIDIA NIM Operator.
A persistent volume provisioner that uses network storage such as NFS, S3, vSAN, and so on. The models are downloaded and stored in persistent storage.

You can create a PVC and specify the name when you create the NIM cache resource or you can request that the Operator creates the PVC.
The required image pull secrets with your NVIDIA NGC API Key.
The model name of the NVIDIA NIM you want to cache. The sample manifests on this page show commonly used container images in the spec.source.ngc.modelPuller field, but you can update this field to any supported NIM. When selecting a model, check the resource and supported GPU version for the model you plan to use. This is typically available from the model card (on build.nvidia.com) or model overview (on NVIDIA NGC) pages. Refer to the Platform Support page for details on supported architectures or refer to the NVIDIA NIM documentation for more details on NIMs.

The following is a list of places where you can learn about available models:
- Browse models at https://build.nvidia.com/models. To view models that you can run anywhere, click NIM Type then Run Anywhere. Use the search box to filter to search for specific NIMs.
- Browse https://catalog.ngc.nvidia.com/containers. Use the search box to find the container you are looking for, or click the NVIDIA NIM checkbox to view all NVIDIA NIMs.
- Run ngc registry image list "nim/*" to display NIM images. Refer to ngc registry in the NVIDIA NGC CLI User Guide for information about the command.

Procedure#

Create the namespace:
```
$ kubectl create namespace nim-service
```

Create a file, such as cache-all.yaml, with contents like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
  resources: {}
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nv-embedqa-1b-v2
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.1
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
  resources: {}
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nv-rerankqa-1b-v2
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.1
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
  resources: {}

Apply the manifest:

$ kubectl apply -n nim-service -f cache-all.yaml

Optional: View information about the caching progress.

Confirm a persistent volume and claim are created:

$ kubectl get -n nim-service pvc,pv

Example Output

NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/meta-llama3-8b-instruct-pvc   Bound    pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            nfs-client     <unset>                 10m

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                     STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
persistentvolume/pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            Delete           Bound    nim-service/meta-llama3-8b-instruct-pvc   nfs-client     <unset>                          10m

View the NIM cache resource to view the status:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service

Example Output

NAME                        STATUS   PVC                             AGE
meta-llama3-8b-instruct     Ready    meta-llama3-8b-instruct-pvc     2024-09-19T13:20:53Z
nv-embedqa-e5-v5            Ready    nv-embedqa-e5-v5-pvc            2024-09-18T21:11:37Z
nv-rerankqa-mistral-4b-v3   Ready    nv-rerankqa-mistral-4b-v3-pvc   2024-09-18T21:11:37Z

Displaying the NIM Cache Status#

Display the NIM cache status:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service

Example Output

NAME                        STATUS    PVC                             AGE
meta-llama3-8b-instruct     Ready     meta-llama3-8b-instruct         2024-08-09T20:54:28Z
nv-embedqa-e5-v5            Ready     nv-embedqa-e5-v5-pvc            2024-08-09T20:54:28Z
nv-rerankqa-mistral-4b-v3   Ready     nv-rerankqa-mistral-4b-v3-pvc   2024-08-09T20:54:28Z

The NIM cache object can report the following statuses:

Status	Description
Failed	The job failed to download and cache the model profile.
InProgress	The job is downloading the model profile from NGC.
NotReady	The job is not ready. This status can be reported shortly after creating the NIM cache resource while the image for the pod is downloaded from NGC. For more information, run `kubectl describe nimcache -n nim-service <nim-cache-name>`.
Pending	The job is created, but has not yet started and become active.
PVC-Created	The Operator creates a PVC for the model profile cache if you set `spec.storage.pvc.create: true`.
Ready	The job downloaded and cached the model profile.
Started	The Operator creates a job to download the model profile from NGC.

Displaying Cached Model Profiles#

View the .status.profiles field of the custom resource. Update meta-llama3-8b-instruct in the command below to your NIM cache name.

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
    meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .

Example Output

[
  {
    "config": {
      "feat_lora": "false",
      "llm_engine": "tensorrt_llm",
      "precision": "bf16",
      "tp": "1",
      "trtllm_buildable": "true"
    },
    "name": "7cc8597690a35aba19a3636f35e7f1c7e7dbc005fe88ce9394cad4a4adeed414"
  }
]

This command will output an array of profiles that have been cached. Details about all the downloaded profiles are included, specifically the profile name that can be used to identify the profile.

Caching Models in Air-Gapped Environments#

You can run a NIM microservice container on a host with network access, cache the model, create a container with the model cache, and then run a job to copy the model cache to a PVC.

For more information about the steps in the following procedure that run the NIM container and download model profiles, refer to Serving Models from Local Assets in the NVIDIA NIM for LLMs documentation. The following sections show one way to cache models and add them to a PVC.

Supporting Custom CA Certificates#

If your cluster has an HTTP proxy that requires custom certificates, you can add them in a config map and mount them into the NIM cache job. You can use self-signed or custom CA certificates.

Create a ConfigMap with the certificates:

$ kubectl create configmap -n nim-service ca-certs --from-file=<path-to-cert-file-1> --from-file=<path-to-cert-file-2>

When you create the NIM cache job and NIM service deployment, specify the name of the config map and the path to mount them in the container, and specify the proxy information:

spec:
  proxy:
    httpsProxy: "https://<node-ip>:<port>"                 # address of a proxy server that should be used for outbound HTTPS requests
    httpProxy: "http://<node-ip>:<port>"                   # address of a proxy server that should be used for outbound HTTP requests
    noProxy: "http://example.com, http://example2.com"     # comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed
    certConfigMap: "ca-certs"                              # the ConfigMap name that holds your CA certificates

For OpenShift Container Platform installations, you must enable certificate injection.

Create a file, such as ca-inject.yaml, with contents like the following example:

apiVersion: v1
data: {}
kind: ConfigMap
metadata:
  labels:
    config.openshift.io/inject-trusted-cabundle: "true"
  name: ca-inject-cm
  namespace: nim-service

Apply the manifest:

$ oc apply -n nim-service -f ca-inject.yaml

For more information, refer to Certificate injection using Operators in the OpenShift Container Platform documentation.

Downloading the Models#

Export your NGC API key as an environment variable:
```
export NGC_API_KEY=M2...
```

Run the NIM container image locally, list the model profiles, and download the model profile.

Start the container:

$ mkdir cache

$ docker run --rm -it \
    -v ./cache:/opt/nim/.cache \
    -u $(id -u):$(id -g) \
    -e NGC_API_KEY \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 \
    bash

Replace the container image and tag with the NIM microservice that you want to cache models for.

List the model profiles:

$ list-model-profiles

Partial Output

...
MODEL PROFILES
- Compatible with system and runnable:
 - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
...

Download and cache the model profiles:
```
$ download-to-cache --profile 1903...
```
Exit the container:
```
$ exit
```

Copying the Models to a Container#

Make a Dockerfile that copies the model profiles into the container:

FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212

COPY cache /cache

Build and push the container with the model profiles to a container registry that the air-gapped cluster has access to:

$ docker build <private-registry-name>/<model-name>:<tag>

$ docker push <private-registry-name>/<model-name>:<tag>

Copying the Models From the Container to the PVC#

Optional: Create a PVC if you do not already have one:

Create a manifest file, such as model-cache-pvc.yaml, with contents like the following example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: nim-service
spec:
  storageClassName: <storage-class>
  resources:
    requests:
      storage: 10Gi

Apply the manifest:

$ kubectl apply -f model-cache-pvc.yaml

Create a job that copies the model profiles to the PVC.

Create a manifest file, such as copy-cache-job.yaml, with contents like the following example:

apiVersion: batch/v1
kind: Job
metadata:
  name: copy-cache
  namespace: nim-service
spec:
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: copy-cache
        command:
        - "/bin/sh"
        args:
        - "-c"
        - "cd /cache && cp -R . /model-store && find /model-store"
        image: <private-registry-name>/<model-name>:<tag>
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          runAsGroup: 2000
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /model-store
          name: nim-cache-volume
          readOnly: false
      imagePullSecrets:
      - name: my-secret
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
      volumes:
      - name: nim-cache-volume
        persistentVolumeClaim:
          claimName: model-cache-pvc
  backoffLimit: 4

Apply the manifest:
```
$ kubectl apply -f copy-cache-job.yaml
```

Using LoRA Models and Adapters#

NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.

You must download the LoRA adapters manually and make them available to the NIM microservice. The following steps describe one way to meet the requirement.

Specify lora: true in the NIM cache manifest that you apply:

kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        tensorParallelism: "1"
        lora: true
        gpus:
        - product: "a100"
  storage:
    pvc:
      create: true
      size: "50Gi"
      storageClass: <storage-class>
      volumeAccessMode: ReadWriteMany
  resources: {}

Apply the manifest and wait until kubectl get nimcache -n nim-service meta-llama3-8b-instruct shows the NIM cache is Ready.

Use the NGC CLI or Hugging Face CLI to download the LoRA adapters.

Build a container that includes the CLIs, such as the following example:

FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212
ARG NGC_CLI_VERSION=3.50.0

RUN apt-get update && DEBIAN_FRONTEND=noninteractive && \
      apt-get install --no-install-recommends -y \
      wget \
      unzip \
      python3-pip

RUN useradd -m -s /bin/bash -u 1000 ubuntu
USER ubuntu

RUN wget --content-disposition --no-check-certificate \
      https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip -O /tmp/ngccli_linux.zip && \
    unzip /tmp/ngccli_linux.zip -d ~ && \
    rm /tmp/ngccli_linux.zip

ENV PATH=/home/ubuntu/ngc-cli:$PATH

RUN pip install -U "huggingface_hub[cli]"

Push the container to a registry that the nodes in your cluster can access.

Apply a manifest like the following example that runs the container.

When you create your manifest, refer to the key considerations for the pod specification:

Mount the same PVC that the NIM microservice accesses for the model.
Specify the same user ID and group ID that the NIM cache container used. The following manifest shows the default values.
Specify the NGC_CLI_API_KEY and NGC_CLI_ORG environment variables. The value for the organization might be different.
Start the pod in the nim-service namespace so that the pod can access the ngc-secret secret.

apiVersion: v1
kind: Pod
metadata:
  name: ngc-cli
spec:
  containers:
  - env:
    - name: NGC_CLI_API_KEY
      valueFrom:
        secretKeyRef:
          key: NGC_API_KEY
          name: ngc-api-secret
    - name: NGC_CLI_ORG
      value: "nemo-microservices/ea-participants"
    - name: NIM_PEFT_SOURCE
      value: "/model-store/loras"
    image: <private-registry>/<image-name>:<image-tag>
    command: ["sleep"]
    args: ["inf"]
    name: ngc-cli
    securityContext:
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
    volumeMounts:
    - mountPath: /model-store
      name: model-store
  restartPolicy: Never
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsUser: 1000
    seLinuxOptions:
      level: s0:c28,c2
  volumes:
  - name: model-store
    persistentVolumeClaim:
      claimName: meta-llama3-8b-instruct-pvc

Access the pod:
```
$ kubectl exec -it -n nim-service ngc-cli -- bash
```
The pod might report groups: cannot find name for group ID 2000. You can ignore the message.

From the terminal in the pod, download the LoRA adapters.

Make a directory for the LoRA adapters:

$ mkdir $NIM_PEFT_SOURCE
$ cd $NIM_PEFT_SOURCE

Download the adapters:

$ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
$ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"

Rename the directories to match the naming convention for the LoRA model directory structure:

$ mv llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-math
$ mv llama3-8b-instruct-lora_vnemo-squad-v1 llama3-8b-squad

Press Ctrl+D to exit the pod and then run kubectl delete pod -n nim-service ngc-cli.

When you create a NIM service instance, specify the NIM_PEFT_SOURCE environment variable:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  ...
  env:
  - name: NIM_PEFT_SOURCE
    value: "/model-store/loras"

After the NIM microservice is running, monitor the logs for records like the following:

{"level": "INFO", ..., "message": "LoRA models synchronizer successfully initialized!"}
{"level": "INFO", ..., "message": "Synchronizing LoRA models with local LoRA directory ..."}
{"level": "INFO", ..., "message": "Done synchronizing LoRA models with local LoRA directory"}

The preceding steps are sample commands for downloading the NeMo format LoRA adapters. Refer to Parameter-Efficient Fine-Tuning in the NVIDIA NIM for LLMs documentation for information about using the Hugging Face Transformers format, the model directory structure, adapters for other models, and so on.

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with caching models.

NIM Pods Shows ContainerCreating Status for a Long Time#

After starting a NIM cache, you can check the cache pod status by running the following command:

$ kubectl get pods -n nim-service

Example Output

meta-llama3-8b-instruct-pod         0/1     ContainerCreating   0          2m33s
nv-embedqa-1b-v2-pod                0/1     ContainerCreating   0          2m33s
nv-rerankqa-1b-v2-pod               0/1     ContainerCreating   0          2m33s

You may notice the init NIM cache pod show a status of ContainerCreating for serveral minutes. This can happen when the NIM container image take a long time to download and may cause your NIM cache service to show no status or a status of NotReady while the NIM conatiner is downloading. See NIM Cache Reports No Status for more troubleshooting details.

NIM Cache Reports No Status#

If you run kubectl get nimcache -n nim-service and the output does not report a status, perform the following actions to get more information:

Determine the state of the caching jobs:
```
$ kubectl get jobs -n nim-service
```
Example Output
```
NAME                            COMPLETIONS   DURATION   AGE
meta-llama3-8b-instruct-job     1/1           8s         2m57s
```
View the logs from the job with a command like kubectl logs -n nim-service job/meta-llama3-8b-instruct-job.

If the caching job is no longer available, delete the NIM cache resource and reapply the manifest.

Describe the NIM cache resource and review the conditions:

$ kubectl describe nimcache -n nim-service <nim-cache-name>

Partial Output

Status:
  Conditions:
    Last Transition Time:  2025-04-17T15:47:29Z
    Message:               The PVC has been created for caching NIM model
    Reason:                PVCCreated
    Status:                True
    Type:                  NIM_CACHE_PVC_CREATED
    Last Transition Time:  2025-04-17T15:50:36Z
    Message:
    Reason:                Reconciled
    Status:                False
    Type:                  NIM_CACHE_RECONCILE_FAILED
    Last Transition Time:  2025-04-17T15:50:36Z
    Message:               The Job to cache NIM is in pending state
    Reason:                JobPending
    Status:                True
    Type:                  NIM_CACHE_JOB_PENDING
    Last Transition Time:  2025-04-17T15:50:36Z
    Message:               The Job to cache NIM has successfully completed
    Reason:                JobCompleted
    Status:                True
    Type:                  NIM_CACHE_JOB_COMPLETED
    
...
 
  State:           Ready
Events:
  Type     Reason           Age                    From                 Message
  ----     ------           ----                   ----                 -------
  Warning  ReconcileFailed  6m                     nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile failed, msg: Pod "meta-llama3-8b-instruct-pod" not found
  Normal   Started          3m28s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Started
  Normal   InProgress       3m28s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: InProgress
  Normal   Pending          2m53s (x2 over 3m28s)  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Pending
  Normal   Ready            2m53s                  nimcache-controller  NIMCache meta-llama3-8b-instruct reconcile success, new state: Ready

The preceding output shows a NIM cache resource that eventually succeeded in downloading and caching a model profile.

The NIM_CACHE_RECONCILE_FAILED condition and ReconcileFailed event reason were reported during the interval that the caching job was created by the Operator, but before the pod was running because the image was downloading from NGC. In the output, the status for that condition is set to False to indicate that the condition is no longer accurate.

View the Operator logs by running kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator.

NIM Cache Event Failures Report `ReconcileFailed` But NIM Pod is `ContainerCreate`#

Some NIM container images may take a long time to download the container image due to image size or network connectivity. If this happens, you may NIM cache report no status, a status of NotReady, or event failures like ReconcileFailed reported while the pods for the caching job are still being created.

Run kubectl get pods -n <nim-namespace> to check the status of the NIM Cache pod. Once the container download is completed, the pod will start normally, and the NIM Cache status should update to a running status. View Displaying the NIM Cache Status for more details on statuses.

Deleting a NIM Cache#

To delete a NIM cache perform the following steps.

View the NIM cache custom resources:

$ kubectl get nimcaches.apps.nvidia.com -A

Example Output

NAMESPACE     NAME                      STATUS   PVC           AGE
nim-service   meta-llama3-8b-instruct   ready    model-store   2024-08-08T13:14:30Z

Delete the custom resource:
```
$ kubectl delete nimcache -n nim-service meta-llama3-8b-instruct
```
If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:
```
$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
   -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
```
Example Output
```
meta-llama3-8b-instruct: true
```

Next Steps#

Deploy NIM microservices either by adding NIM service custom resources or managing several services in a single NIM pipeline custom resource.

Caching Models#

About Model Profiles and Caching#

Benefits of Caching Models#

About the NIM Cache Custom Resource Definition#

Using NVIDIA NGC as NIM Cache Source#

Using NeMo Data Store as NIM Cache Source#

Using Hugging Face Hub as NIM Cache Source#

NIM Cache configuration#

Prerequisites#

Procedure#

Displaying the NIM Cache Status#

Displaying Cached Model Profiles#

Caching Models in Air-Gapped Environments#

Supporting Custom CA Certificates#

Downloading the Models#

Copying the Models to a Container#

Copying the Models From the Container to the PVC#

Using LoRA Models and Adapters#

Troubleshooting#

NIM Pods Shows ContainerCreating Status for a Long Time#

NIM Cache Reports No Status#

NIM Cache Event Failures Report ReconcileFailed But NIM Pod is ContainerCreate#

Deleting a NIM Cache#

Next Steps#

NIM Cache Event Failures Report `ReconcileFailed` But NIM Pod is `ContainerCreate`#