Deploying on Kubernetes with Helm Chart#

You can deploy CACHED NIM with a Helm chart. The Helm chart simplifies CACHED NIM deployment on Kubernetes. It supports deployment with optional cluster, GPU, and storage configurations.

The Helm chart downloads the model and starts the service to begin running inferences.

NIMs are designed to be run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use the Helm chart, you must have a Kubernetes cluster with appropriate GPU nodes and GPU Operator installed.

Benefits of Helm Chart Deployment#

Using a Helm chart to deploy on Kubernetes has the following benefits compared to manual deployment:

  • Enables using Kubernetes Nodes and horizontally scaling the service

  • Encapsulates the complexity of running Docker commands directly

  • Enables monitoring metrics from the NIM

Setting Up the Environment#

If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, refer to the User Guide.

The Helm chart requires that you have a secret with your NGC API key configured for downloading private images, and one with your NGC API key, which is named ngc-api in the following sections. The secrets should have the same key, but have different formats (dockerconfig.json vs opaque). Refer to the following Creating Secrets section for details.

These instructions require that you have exported your NGC_API_KEY to the environment. Use the following command to export your key.

export NGC_API_KEY="<YOUR NGC API KEY>"

Fetching the Helm Chart#

You can download the Helm chart from NGC by executing the following command:

helm fetch https://helm.ngc.nvidia.com/ohlfw0olaadg/ea-participants/charts/cached-nim-0.2.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY

Namespace#

You can choose to deploy to whichever namespace is appropriate, but this document uses the namespace cached-nim. Use the following command to create that namespace.

kubectl create namespace cached-nim

Creating Secrets#

Use the following script to create the required secrets for the Helm chart.

# [Linux only] Encode nvcr registry config as base64
NGC_REGISTRY_PASSWORD=$(echo -n $DOCKER_CONFIG | base64 -w0)

# [MacOS only] Encode nvcr registry config as base64
NGC_REGISTRY_PASSWORD=$(echo -n $DOCKER_CONFIG | base64 -b0)

cat <<EOF > imagepull.yaml
apiVersion: v1
kind: Secret
metadata:
  name: nvcrimagepullsecret
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: ${NGC_REGISTRY_PASSWORD}
EOF

kubectl apply -n cached-nim -f imagepull.yaml
kubectl create -n cached-nim secret generic ngc-api \
  --from-literal=NGC_API_KEY=${NGC_API_KEY} \
  --from-literal=NGC_CLI_API_KEY=${NGC_API_KEY}

Configuration Considerations#

By default, the following deployment commands create a single deployment with one replica using the cached model. Use the following options to modify how the model behaves. Refer to Parameters for information about parameters.

  • image.repository – The container (CACHED NIM) to deploy

  • image.tag – The version of that container (CACHED NIM)

  • Storage options, based on the environment and cluster in use

  • resources – Use this option when a model requires more than the default of one GPU. Refer to the support matrix and resource requirements.

  • env – An array of environment variables presented to the container, if advanced configuration is needed

Storage#

This NIM uses persistent storage for storing downloaded models, and sample commands in this guide require the local-nfs storage class. Use the following commands to install the local-nfs storage class and provisioner in your Kubernetes cluster.

helm repo add nfs-ganesha-server-and-external-provisioner https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/
helm install nfs-server nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner --set storageClass.name=local-nfs

Advanced Storage Configuration#

Storage is a particular concern when setting up NIMs. Models can be quite large, and you can fill a disk downloading models to emptyDir volumes. We recommend that you mount persistent storage of some kind on your pod.

This chart supports two general categories:

  • Persistent Volume Claims (enabled with persistence.enabled)

  • hostPath (enabled with persistences.hostPath)

By default, the chart uses the standard storage class and creates a PersistentVolume and a PersistentVolumeClaim.

If you do not have a Storage Class Provisioner that creates PersistentVolumes automatically, set the value persistence.createPV=true. This is also necessary when you use persistence.hostPath on minikube.

If you have an existing PersistentVolumeClaim where you’d like the models to be stored at, pass that value in at persistence.exsitingClaimName.

Refer to the Helm options in Parameters.

Deploying#

Basic deployment

helm upgrade --install \
  --namespace cached-nim \
  cached-nim \
  --set persistence.class="local-nfs" \
  cached-nim-0.2.0.tgz

You can also change the version of the cached model in use by adding the following after --namespace

--set image.tag=0.2.1 \

After deploying, use the following command to check whether the pod is running, as the initial image pull and model download can take upwards of 15 minutes.

kubectl get pods -n cached-nim

This command should eventually return something similar to the following when the pod is running.

NAME              READY   STATUS    RESTARTS   AGE
cached-nim-0      1/1     Running   0          8m44s

You can use the following command to check events for failures:

kubectl get events -n cached-nim --sort-by='.lastTimestamp'

Running Inference#

In the previous example the API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands require that the nvidia/cached model has been deployed.

If required, change the “model” value in the request JSON body to use a different model.

Use the following command to port-forward the service to your local machine to test inference.

kubectl port-forward -n cached-nim service/cached-nim 8000:8000

Create a directory data and copy in some .png formatted images so that data looks like so:

$ mkdir data

$ ls -l data/
sample1.png
sample2.png
sample3.png
sample4.png

Send an inference request by running the following commands and Python 3.11 script.

# Create a virtual env (venv) for this test to isolate the dependencies
python3 -m venv cached_venv
source cached_venv/bin/activate

# Install pillow and requests libraries into your python 3 environment
pip3 install pillow requests
# cached_inference_test.py
import json
from base64 import b64encode
from io import BytesIO
from pathlib import Path

import requests
from PIL import Image


images = []
# Send all images from the '$script_path/data' directory for inference
for image_path in sorted(Path(__file__).parent.glob("*.png")):
    buf = BytesIO()
    Image.open(image_path).save(buf, format="PNG")
    images.append(
        {
            "type": "image_url",
            "image_url": {
                "url": (
                    "data:image/jpeg;base64,"
                    f'{b64encode(buf.getvalue()).decode("utf-8")}'
                ),
            },
        }
    )

resp = requests.post(
    url="http://0.0.0.0:8000/v1/infer",
    headers={"Content-Type": "application/json"},
    json={
        "model": "cached",
        "messages": [{"content": images}],
    },
)

if resp.status_code == 200:
    print(json.dumps(resp.json()))
else:
    print(f"Request failed with status code {resp.status_code}: {resp.text}")
# Run inference on the .png files in the /data directory
python3 cached_inference_test.py

Logging#

Use the following command to view the container logs.

kubectl logs --selector=app.kubernetes.io/name=cached-nim -n cached-nim