Deploy NVIDIA NIM for Image OCR (NeMo Retriever OCR) on Kubernetes#

You can deploy NVIDIA NIM for Image OCR (NeMo Retriever OCR) on Kubernetes by using Helm.

The Helm chart simplifies NIM deployment on Kubernetes. The chart supports deployment with a variety of possible cluster, GPU, and storage configurations.

NIM microservices run on hosts with NVIDIA GPUs. The GPU model and count depend on the model. To use Helm, you must have a Kubernetes cluster with GPU nodes.

To simplify cluster administration for GPU driver installation, consider installing NVIDIA GPU Operator to manage the driver installation and lifecycle.

Benefits of Helm Chart Deployment#

The benefits of Helm Charts are as follows:

  • Help manage the deployment and lifecycle of the microservice.

  • Use a centralized distribution mechanism for charts. The charts are hosted on NVIDIA NGC.

  • Enable you to customize the deployment when you specify a custom values file.

Setting Up the Environment#

If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, see the information in the User Guide.

This helm chart requires that you have a secret with your NGC API key configured for downloading private images, and one with your NGC API key (below named ngc-api). These will likely have the same key in it, but they will have different formats (dockerconfig.json vs opaque). See Creating Secrets below.

These instructions will assume that you have your NGC_API_KEY exported in the environment.

export NGC_API_KEY="<YOUR NGC API KEY>"

Fetching the Helm Chart#

You can fetch the helm chart from NGC by executing the following command:

helm fetch https://helm.ngc.nvidia.com/nim/nvidia/charts/nvidia-nim-nemotron-ocr-v1-1.3.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY

Namespace#

You can choose to deploy to whichever namespace is appropriate, but for documentation purposes we will deploy to a namespace named nemotron-ocr-v1-nim.

kubectl create namespace nemotron-ocr-v1-nim

Creating Secrets#

Use the following procedure to create the secrets for this chart.

  1. Add an image pull secret for downloading the container image from NVIDIA NGC.

    kubectl create secret -n nemotron-ocr-v1-nim docker-registry ngc-secret \
     --docker-server=nvcr.io \
     --docker-username='$oauthtoken' \
     --docker-password="${NGC_API_KEY}"
    
  2. Add a generic secret that the container uses for downloading models from NVIDIA NGC.

    kubectl create -n nemotron-ocr-v1-nim secret generic ngc-api --from-literal=NGC_API_KEY=${NGC_API_KEY}
    

Configuration Considerations#

The following deployment commands will by default create a single deployment with one replica using the nemotron-ocr-v1 model. The following options can be used to make modifications to the behavior. See Parameters for a description of the Helm parameters.

  • image.repository – The container (nemotron-ocr-v1) to deploy.

  • image.tag – The version of that container (1.3.0).

  • Storage options, based on the environment and cluster in use.

  • resources – Use this option when a model requires more than the default of one GPU. See below for support matrix and resource requirements.

  • env – Which is an array of environment variables presented to the container, if advanced configuration is needed.

Storage#

This NIM downloads models at startup. By default, models are stored in an emptyDir volume, which means they are re-downloaded each time the pod restarts. For production use, we recommend that you use persistent storage to avoid repeated downloads.

The following instructions use a local-nfs storage class provisioner. If your cluster already has a storage class provisioner, you can skip this step and specify your storage class in the deploy command.

helm repo add nfs-ganesha-server-and-external-provisioner https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/
helm install nfs-server nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner --set storageClass.name=local-nfs

Advanced Storage Configuration#

Storage is a particular concern when setting up NIMs. Models can be quite large, and you can fill a disk downloading things to emptyDirs or other locations around your pod image. We recommend that you mount persistent storage of some kind on your pod.

This chart supports two general categories:

  1. Persistent Volume Claims (enabled by persistence.enabled)

  2. hostPath (enabled by hostPath.enabled)

By default, persistence.enabled is set to false and the chart uses an emptyDir volume for model storage. To use persistent storage, set persistence.enabled=true in your Helm command.

When you set persistence.enabled=true, the chart creates a PersistentVolumeClaim that uses the cluster’s default storage class, or the class specified by persistence.storageClass.

If you want to store the models on an existing PersistentVolumeClaim, specify that value in persistence.existingClaim:

--set persistence.enabled=true \
--set persistence.existingClaim=my-existing-pvc

To use a local directory on the node instead of a PVC:

--set hostPath.enabled=true \
--set hostPath.path=/path/to/local/model-store

For more information, refer to Parameters.

Deploying#

Use the following command to create a deployment with persistent storage.

helm upgrade --install \
  --namespace nemotron-ocr-v1-nim \
  image-ocr \
  --set persistence.enabled=true \
  --set persistence.storageClass="local-nfs" \
  nvidia-nim-nemotron-ocr-v1-1.3.0.tgz

Use the following command to deploy without persistent storage (models are stored in emptyDir and are re-downloaded on pod restart).

helm upgrade --install \
  --namespace nemotron-ocr-v1-nim \
  image-ocr \
  nvidia-nim-nemotron-ocr-v1-1.3.0.tgz

You can also change the version of the model in use by adding the following after --namespace.

--set image.tag=1.3.0 \

After deploying check the pods to ensure that it is running, initial image pull and model download can take upwards of 15 minutes.

kubectl get pods -n nemotron-ocr-v1-nim

The pod should eventually end up in the running state.

NAME              READY   STATUS    RESTARTS   AGE
nvidia-nim-nemotron-ocr-v1-0   1/1     Running   0          8m44s

Check events for failures:

kubectl get events -n nemotron-ocr-v1-nim

Running Inference#

In the previous example the API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands require that the nvidia/nemotron-ocr-v1 model has been deployed.

If required, change the “model” value in the request JSON body to use a different model.

Use the following command to port-forward the service to your local machine to test inference.

kubectl port-forward -n nemotron-ocr-v1-nim service/image-ocr-nvidia-nim-nemotron-ocr-v1 8000:8000

Create a directory data/structured-imgs and copy in some .png formatted images so that data looks like so:

$ mkdir -p data/structured-imgs

$ ls -l data/structured-imgs
sample1.png
sample2.png
sample3.png
sample4.png

Send an inference request by running the following commands and Python 3.11 script.

# Create a virtual env (venv) for this test to isolate the dependencies
python3 -m venv nemotron_ocr_venv
source nemotron_ocr_venv/bin/activate

# Install pillow and requests libraries into your python 3 environment
pip3 install requests pillow
# nemotron_ocr_inference_test.py
import base64
import json
import time
from io import BytesIO
from pathlib import Path

import requests
from PIL import Image

images = []
image_paths = list(Path("data/structured-imgs").glob("*"))
for image_path in image_paths:
    image = Image.open(image_path)
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    base64_image = base64.b64encode(buffered.getvalue()).decode("utf-8")
    image_url = f"data:image/png;base64,{base64_image}"
    image = {"type": "image_url", "url": image_url}
    images.append(image)

payload = {"input": images}

start = time.time()
print(json.dumps(requests.post("http://localhost:8000/v1/infer", json=payload).json()))
print(f"{len(image_paths)} images completed in {time.time() - start} seconds")

# Run inference on the .png files in the /data directory
python3 nemotron_ocr_inference_test.py

Viewing Log Messages#

Use the following command to view the container log messages in the docker logs.

# Step 1: Find the actual pod name
kubectl get pods -n nemotron-ocr-v1-nim

# Step 2: Use the actual pod name from step 1
kubectl logs -f -n nemotron-ocr-v1-nim <pod-name-from-step-1>