Caching Models
About Model Profiles and Caching
The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPUs, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles.
The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID. For more information, refer to Model Profiles in the NVIDIA NIM for LLMs documentation.
The Operator compliments model selection by the NIM microservices in the following ways:
Support for caching and running a specified model profile ID, like the NIM microservices.
Support for caching all model profiles.
Support for influencing model selection by specifying the engine, precision, tensor parallelism, and so on to match. The NIM cache custom resource provides the way to specify the model selection criteria.
NVIDIA recommends caching models for NVIDIA NIM microservices. By caching a model, the microservice startup time is improved. For deployments that scale to more than one NIM microservice pod, a single cached model in persistent volume with a network storage class can serve multiple pods.
For single-node clusters, the Local Path Provisioner from Rancher Labs is sufficient for research and development. For production, NVIDIA recommends installing a provisioner that provides a network storage class.
About the NIM Cache Custom Resource Definition
A NIM cache is a Kubernetes custom resource, nimcaches.apps.nvidia.com
.
You create and delete NIM cache resources to manage model caching.
When you create a NIM cache resource, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.
If you specify one or more model profile IDs to cache, the Operator starts a job that caches the model profiles that you specified.
If you did not specify model profile IDs, but do specify engine: tensorrt_llm
or engine: tensorrt
, the Operator attempts to match the model profiles with the GPUs on the nodes in the cluster.
The Operator uses the value of the nvidia.com/gpu.product
node label that is set by Node Feature Discovery.
You can let the Operator automatically detect the model profiles to cache or you can constrain the model profiles
by specifying values for spec.source.ngc.model
, such as the engine, GPU model, and so on, that must match the model profile.
If you delete a NIM cache resource that was created with spec.storage.pvc.create: true
, the NIM Operator deletes
the persistent volume (PV) and persistent volume claim (PVC).
Refer to the following sample manifest:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-8b-instruct
spec:
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
pullSecret: ngc-secret
authSecret: ngc-api-secret
model:
engine: tensorrt_llm
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: <storage-class-name>
size: "50Gi"
volumeAccessMode: ReadWriteMany
resources: {}
Refer to the following table for information about the commonly modified fields:
Field |
Description |
Default Value |
---|---|---|
|
Specifies custom CA certificates that might be required in environments with an HTTP proxy. Refer to Supporting Custom CA Certificates for more information. |
None |
|
Specifies node selector labels to schedule the caching job. |
None |
|
Specifies a model caching constraint based on the engine. Common values are as follows:
Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information. By default, the caching job matches model profiles for all engines. |
None |
|
Specifies a list of model caching constraints to use the specified GPU model, by PCI ID or product name. By default, the caching job detects all GPU models in the cluster and matches model profiles for all GPUs. The following partial specification requests a model profile that is compatible with an NVIDIA L40S. spec:
source:
ngc:
model:
...
gpus:
- ids:
- "26b5"
If GPU Operator or Node Feature Discovery is running on your cluster, you can determine the PCI IDs
for the GPU models on your nodes by viewing the node labels that begin with The following partial specification requests a model profile that is compatible with an NVIDIA A100. spec:
source:
ngc:
model:
...
gpus:
- product: "a100"
The product name, such as |
None |
|
When set to Refer to Using LoRA Models and Adapters for more information. |
|
|
Specifies the model profile quantization to match.
Common values are |
None |
|
Specifies an array of model profiles to cache. When you specify this field, automatic profile selection is disabled and all other The following partial specification requests a specific model profile. spec:
source:
ngc:
model:
modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
profiles:
- 8835c31...
You can determine the model profiles by running the list-model-profiles command. You can specify |
None |
|
Specifies the model profile quality of service to match.
Values are |
None |
|
Specifies the model profile tensor parallelism to match.
The node that runs the NIM microservice and serves the model must have at least the specified number of GPUs.
Common values are |
None |
|
Specifies the container image that can cache model profiles. |
None |
|
When set to |
|
|
Specifies the PVC name. This field is required if you specify |
The NIM cache resource name with a |
|
Specifies the size, in Gi, for the PVC to create. This field is required if you specify |
None |
|
Specifies the storage class for the PVC to create. |
None |
|
Specifies to create a subpath on the PVC and cache the model profiles in the directory. |
None |
|
Specifies the access mode for the PVC to create. |
None |
|
Specifies the tolerations for the caching job. |
None |
Prerequisites
Installed the NVIDIA NIM Operator.
A persistent volume provisioner that uses network storage such as NFS, S3, vSAN, and so on. The models are downloaded and stored in persistent storage.
You can create a PVC and specify the name when you create the NIM cache resource or you can request that the Operator creates the PVC.
The sample manifests show commonly used container images for the
spec.source.ngc.modelPuller
field. To cache alternative models, consider the following approaches to learning container image names:Browse models at https://build.nvidia.com/explore/discover. For models that you can run anywhere, click the Docker tab to view the image name and tag.
Browse https://catalog.ngc.nvidia.com/containers. You can filter the images by enabling the NVIDIA NIM checkbox.
Run
ngc registry image list "nim/*"
to display NIM images. Refer to ngc registry in the NVIDIA NGC CLI User Guide for information about the command.Refer to the NVIDIA NIM documentation page.
Procedure
Create the namespace:
$ kubectl create namespace nim-service
Add secrets that use your NGC API key.
Add a Docker registry secret for downloading the NIM container image from NVIDIA NGC:
$ kubectl create secret -n nim-service docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<ngc-api-key>
Add a generic secret that the model puller init container uses to download the model from NVIDIA NGC:
$ kubectl create secret -n nim-service generic ngc-api-secret \ --from-literal=NGC_API_KEY=<ngc-api-key>
Create a file, such as
cache-all.yaml
, with contents like the following example:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-8b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 pullSecret: ngc-secret authSecret: ngc-api-secret model: engine: tensorrt_llm tensorParallelism: "1" storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {} --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: nv-embedqa-e5-v5 spec: source: ngc: modelPuller: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.4 pullSecret: ngc-secret authSecret: ngc-api-secret model: profiles: - all storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {} --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: nv-rerankqa-mistral-4b-v3 spec: source: ngc: modelPuller: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:1.0.4 pullSecret: ngc-secret authSecret: ngc-api-secret model: profiles: - all storage: pvc: create: true storageClass: <storage-class-name> size: "50Gi" volumeAccessMode: ReadWriteMany resources: {}
Apply the manifest:
$ kubectl apply -n nim-service -f cache-all.yaml
Optional: View information about the caching progress.
Confirm a persistent volume and claim are created:
$ kubectl get -n nim-service pvc,pv
Example Output
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/meta-llama3-8b-instruct-pvc Bound pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce 50Gi RWO nfs-client <unset> 10m NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE persistentvolume/pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce 50Gi RWO Delete Bound nim-service/meta-llama3-8b-instruct-pvc nfs-client <unset> 10m
View the NIM cache resource to view the status:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service
Example Output
NAME STATUS PVC AGE meta-llama3-8b-instruct Ready meta-llama3-8b-instruct-pvc 2024-09-19T13:20:53Z nv-embedqa-e5-v5 Ready nv-embedqa-e5-v5-pvc 2024-09-18T21:11:37Z nv-rerankqa-mistral-4b-v3 Ready nv-rerankqa-mistral-4b-v3-pvc 2024-09-18T21:11:37Z
Displaying the NIM Cache Status
Display the NIM cache status:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service
Example Output
NAME STATUS PVC AGE meta-llama3-8b-instruct Ready meta-llama3-8b-instruct 2024-08-09T20:54:28Z nv-embedqa-e5-v5 Ready nv-embedqa-e5-v5-pvc 2024-08-09T20:54:28Z nv-rerankqa-mistral-4b-v3 Ready nv-rerankqa-mistral-4b-v3-pvc 2024-08-09T20:54:28Z
The NIM cache object can report the following statuses:
Status |
Description |
---|---|
Failed |
The job failed to download and cache the model profile. |
InProgress |
The job is downloading the model profile from NGC. |
NotReady |
The job is not ready. This status can be reported shortly after creating the NIM cache resource while the image for the pod is downloaded from NGC. For more information, run |
Pending |
The job is created, but has not yet started and become active. |
PVC-Created |
The Operator creates a PVC for the model profile cache if you set |
Ready |
The job downloaded and cached the model profile. |
Started |
The Operator creates a job to download the model profile from NGC. |
Displaying Cached Model Profiles
View the
.status.profiles
field of the custom resource:$ kubectl get nimcaches.apps.nvidia.com -n nim-service \ meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .
Example Output
[ { "config": { "feat_lora": "false", "llm_engine": "vllm", "precision": "fp16", "tp": "2" }, "model": "meta/llama3-8b-instruct", "name": "19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f", "release": "1.0.3" } ]
Caching Models in Air-Gapped Environments
You can run a NIM microservice container on a host with network access, cache the model, create a container with the model cache, and then run a job to copy the model cache to a PVC.
For more information about the steps in the following procedure that run the NIM container and download model profiles, refer to Serving Models from Local Assets in the NVIDIA NIM for LLMs documentation.
The following sections show one way to cache models and add them to a PVC.
Supporting Custom CA Certificates
If your cluster has an HTTP proxy that requires custom certificates, you can add them in a config map and mount them into the NIM cache job. You can use self-signed or custom CA certificates.
Create a config map with the certificates:
$ kubectl create configmap -n nim-service ca-certs --from-file=<path-to-cert-file-1> --from-file=<path-to-cert-file-2>
When you create the NIM cache job, specify the name of the config map and the path to mount them in the container:
spec: certConfig: name: ca-certs mountPath: /usr/local/share/ca-certificates/
Downloading the Models
Export your NGC API key as an environment variable:
export NGC_API_KEY=M2...
Run the NIM container image locally, list the model profiles, and download the model profile.
Start the container:
$ mkdir cache $ docker run --rm -it \ -v ./cache:/opt/nim/.cache \ -u $(id -u):$(id -g) \ -e NGC_API_KEY \ nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 \ bash
Replace the container image and tag with the NIM microservice that you want to cache models for.
List the model profiles:
$ list-model-profiles
Partial Output
... MODEL PROFILES - Compatible with system and runnable: - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) ...
Download and cache the model profiles:
$ download-to-cache --profile 1903...
Exit the container:
$ exit
Copying the Models to a Container
Make a
Dockerfile
that copies the model profiles into the container:FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212 COPY cache /cache
Build and push the container with the model profiles to a container registry that the air-gapped cluster has access to:
$ docker build <private-registry-name>/<model-name>:<tag> $ docker push <private-registry-name>/<model-name>:<tag>
Copying the Models From the Container to the PVC
Optional: Create a PVC if you do not already have one:
Create a manifest file, such as
model-cache-pvc.yaml
, with contents like the following example:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache-pvc namespace: nim-service spec: storageClassName: <storage-class> resources: requests: storage: 10Gi
Apply the manifest:
$ kubectl apply -f model-cache-pvc.yaml
Create a job that copies the model profiles to the PVC.
Create a manifest file, such as
copy-cache-job.yaml
, with contents like the following example:apiVersion: batch/v1 kind: Job metadata: name: copy-cache namespace: nim-service spec: ttlSecondsAfterFinished: 3600 template: spec: restartPolicy: Never containers: - name: copy-cache command: - "/bin/sh" args: - "-c" - "cd /cache && cp -R . /model-store && find /model-store" image: <private-registry-name>/<model-name>:<tag> imagePullPolicy: IfNotPresent securityContext: allowPrivilegeEscalation: false runAsGroup: 2000 runAsNonRoot: true runAsUser: 1000 volumeMounts: - mountPath: /model-store name: nim-cache-volume readOnly: false imagePullSecrets: - name: my-secret securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 volumes: - name: nim-cache-volume persistentVolumeClaim: claimName: model-cache-pvc backoffLimit: 4
Apply the manifest:
$ kubectl apply -f copy-cache-job.yaml
Using LoRA Models and Adapters
NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.
You must download the LoRA adapters manually and make them available to the NIM microservice. The following steps describe one way to meet the requirement.
Specify
lora: true
in the NIM cache manifest that you apply:kind: NIMCache metadata: name: meta-llama3-8b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 pullSecret: ngc-secret authSecret: ngc-api-secret model: tensorParallelism: "1" lora: true gpus: - product: "a100" storage: pvc: create: true size: "50Gi" storageClass: <storage-class> volumeAccessMode: ReadWriteMany resources: {}
Apply the manifest and wait until
kubectl get nimcache -n nim-service meta-llama3-8b-instruct
shows the NIM cache isReady
.Use the NGC CLI or Hugging Face CLI to download the LoRA adapters.
Build a container that includes the CLIs, such as the following example:
FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212 ARG NGC_CLI_VERSION=3.50.0 RUN apt-get update && DEBIAN_FRONTEND=noninteractive && \ apt-get install --no-install-recommends -y \ wget \ unzip \ python3-pip RUN useradd -m -s /bin/bash -u 1000 ubuntu USER ubuntu RUN wget --content-disposition --no-check-certificate \ https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip -O /tmp/ngccli_linux.zip && \ unzip /tmp/ngccli_linux.zip -d ~ && \ rm /tmp/ngccli_linux.zip ENV PATH=/home/ubuntu/ngc-cli:$PATH RUN pip install -U "huggingface_hub[cli]"
Push the container to a registry that the nodes in your cluster can access.
Apply a manifest like the following example that runs the container.
When you create your manifest, refer to the key considerations for the pod specification:
Mount the same PVC that the NIM microservice accesses for the model.
Specify the same user ID and group ID that the NIM cache container used. The following manifest shows the default values.
Specify the
NGC_CLI_API_KEY
andNGC_CLI_ORG
environment variables. The value for the organization might be different.Start the pod in the
nim-service
namespace so that the pod can access thengc-secret
secret.
apiVersion: v1 kind: Pod metadata: name: ngc-cli spec: containers: - env: - name: NGC_CLI_API_KEY valueFrom: secretKeyRef: key: NGC_API_KEY name: ngc-api-secret - name: NGC_CLI_ORG value: "nemo-microservices/ea-participants" - name: NIM_PEFT_SOURCE value: "/model-store/loras" image: <private-registry>/<image-name>:<image-tag> command: ["sleep"] args: ["inf"] name: ngc-cli securityContext: capabilities: drop: - ALL runAsNonRoot: true volumeMounts: - mountPath: /model-store name: model-store restartPolicy: Never securityContext: fsGroup: 2000 runAsGroup: 2000 runAsUser: 1000 seLinuxOptions: level: s0:c28,c2 volumes: - name: model-store persistentVolumeClaim: claimName: meta-llama3-8b-instruct-pvc
Access the pod:
$ kubectl exec -it -n nim-service ngc-cli -- bash
The pod might report
groups: cannot find name for group ID 2000
. You can ignore the message.From the terminal in the pod, download the LoRA adapters.
Make a directory for the LoRA adapters:
$ mkdir $NIM_PEFT_SOURCE $ cd $NIM_PEFT_SOURCE
Download the adapters:
$ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1" $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
Rename the directories to match the naming convention for the LoRA model directory structure:
$ mv llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-math $ mv llama3-8b-instruct-lora_vnemo-squad-v1 llama3-8b-squad
Press Ctrl+D to exit the pod and then run
kubectl delete pod -n nim-service ngc-cli
.
When you create a NIM service instance, specify the NIM_PEFT_SOURCE
environment variable:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
...
env:
- name: NIM_PEFT_SOURCE
value: "/model-store/loras"
After the NIM microservice is running, monitor the logs for records like the following:
{"level": "INFO", ..., "message": "LoRA models synchronizer successfully initialized!"}
{"level": "INFO", ..., "message": "Synchronizing LoRA models with local LoRA directory ..."}
{"level": "INFO", ..., "message": "Done synchronizing LoRA models with local LoRA directory"}
The preceding steps are sample commands for downloading the NeMo format LoRA adapters. Refer to Parameter-Efficient Fine-Tuning in the NVIDIA NIM for LLMs documentation for information about using the Hugging Face Transformers format, the model directory structure, adapters for other models, and so on.
Troubleshooting
NIM Cache Reports No Status
If you run kubectl get nimcache -n nim-service
and the output does not report a status, perform the following actions to get more information:
Determine the state of the caching jobs:
$ kubectl get jobs -n nim-service
Example Output
NAME COMPLETIONS DURATION AGE meta-llama3-8b-instruct-job 1/1 8s 2m57s
View the logs from the job with a command like
kubectl logs -n nim-service job/meta-llama3-8b-instruct-job
.If the caching job is no longer available, delete the NIM cache resource and reapply the manifest.
Describe the NIM cache resource and review the conditions:
$ kubectl describe nimcache -n nim-service <nim-cache-name>
Partial Output
Status: Conditions: Last Transition Time: 2024-09-19T13:20:53Z Message: The PVC has been created for caching NIM model Reason: PVCCreated Status: True Type: NIM_CACHE_PVC_CREATED Last Transition Time: 2024-09-19T13:26:24Z Message: Reason: Reconciled Status: False Type: NIM_CACHE_RECONCILE_FAILED Last Transition Time: 2024-09-19T13:24:36Z Message: The Job to cache NIM has been created Reason: JobCreated Status: True Type: NIM_CACHE_JOB_CREATED Last Transition Time: 2024-09-19T13:25:50Z Message: The Job to cache NIM is in pending state Reason: JobPending Status: True Type: NIM_CACHE_JOB_PENDING Last Transition Time: 2024-09-19T13:25:50Z Message: The Job to cache NIM has successfully completed Reason: JobCompleted Status: True Type: NIM_CACHE_JOB_COMPLETED
The preceding output shows a NIM cache resource that eventually succeeded in downloading and caching a model profile.
The
NIM_CACHE_RECONCILE_FAILED
condition was reported during the interval that the caching job was created by the Operator, but before the pod was running because the image was downloading from NGC. In the output, the status for that condition is set toFalse
to indicate that the condition is no longer accurate.View the Operator logs by running
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator
.
No Profiles Are Selected for Caching
If the NIM cache controller does not automatically select a model profile to cache, you can use two methods to view the model profiles that are available:
The cache controller copies the model profiles into a config map. You can review the config map with a command like the following example to identify a model profile that uses a community backend such as vLLM or ONNX.
$ kubectl get cm -n nim-service meta-llama3-8b-instruct-manifest -o yaml | less
Run the container on a host that is configured for Docker, NVIDIA Container Toolkit, and an NVIDIA GPU. Refer to the list-model-profiles command in the NVIDIA NIM for LLMs documentation.
After you determine a model profile that is compatible with your GPU, specify the model profile ID in the spec.source.ngc.model.profiles
field and reapply the manifest.
Deleting a NIM Cache
To delete a NIM cache perform the following steps.
View the NIM cache custom resources:
$ kubectl get nimcaches.apps.nvidia.com -A
Example Output
NAMESPACE NAME STATUS PVC AGE nim-service meta-llama3-8b-instruct ready model-store 2024-08-08T13:14:30Z
Delete the custom resource:
$ kubectl delete nimcache -n nim-service meta-llama3-8b-instruct
If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
Example Output
meta-llama3-8b-instruct: true
Next Steps
Deploy NIM microservices either by adding NIM service custom resources or managing several services in a single NIM pipeline custom resource.