Helm and Kubernetes#

This page describes how to deploy NVIDIA NIM for LLMs on Kubernetes using the NIM Helm chart.

Prerequisites#

Before deploying NIM with Helm, make sure you have the following:

A running Kubernetes cluster with GPU-capable nodes
Configured kubectl access to the target cluster
Helm 3.0.0 or later
An NGC API key for accessing the NIM Helm chart, pulling NIM container images, and downloading model artifacts
A storage class that supports persistent volumes for model caching

Note

To provision Kubernetes with NVIDIA Cloud Native Stack (CNS), refer to Using the Ansible Playbooks.

Fetch and extract the Helm chart before deployment. Go to the NGC Catalog and select the nim-llm Helm chart to pick a version. In most cases, you should select the latest version.

export HELM_CHART_VERSION="<version_number>"
helm fetch "https://helm.ngc.nvidia.com/nim/charts/nim-llm-${HELM_CHART_VERSION}.tgz" \
  --username='$oauthtoken' \
  --password="${NGC_API_KEY}"
tar -xzf "nim-llm-${HELM_CHART_VERSION}.tgz"

Configure Helm#

The following Helm values are the most important settings for a NIM deployment:

image.repository: NIM container image to deploy.
image.tag: NIM container image tag.
model.ngcAPISecret and imagePullSecrets: credentials required to pull images and model artifacts.
persistence: storage settings for model cache.
resources: GPU limits based on model requirements.
env: optional advanced runtime configuration.

Use the following commands to inspect chart documentation and defaults:

helm show readme nim-llm/
helm show values nim-llm/

Cache and Temporary Directories#

Several environment variables are derived from model.nimCache (default: /model-store) at deploy time:

Environment Variable	Derived Value	Purpose
`NIM_CACHE_PATH`	`<nimCache>`	Primary model cache
`HF_HOME`	`<nimCache>/huggingface/hub`	Hugging Face cache
`OUTLINES_CACHE_DIR`	`<nimCache>/outlines`	Outlines grammar cache for structured output

These variables are set automatically in both single-node and multi-node deployments. If you change model.nimCache, ensure that the underlying volume mount is writable and has sufficient space for all cache subdirectories.

To override OUTLINES_CACHE_DIR independently, add it to the env section in values.yaml:

env:
  - name: OUTLINES_CACHE_DIR
    value: /custom/path/outlines

Minimal Example#

Complete the following steps to deploy the minimal Helm example:

Create the secrets before installing the chart:
```
export NGC_API_KEY=<your_ngc_api_key>
```

Create the image pull secret:

kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="${NGC_API_KEY}"

Create the NGC API key secret:

kubectl create secret generic nvidia-nim-secrets \
  --from-literal=NGC_API_KEY="${NGC_API_KEY}"

Create values.yaml with a minimal configuration:

cat <<'EOF' > values.yaml
image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.3"
model:
  ngcAPISecret: "nvidia-nim-secrets"
persistence:
  enabled: true
  storageClass: "nfs-client"
  accessMode: ReadWriteMany
  size: 50Gi
resources:
  limits:
    nvidia.com/gpu: 1
imagePullSecrets:
  - name: "ngc-secret"
EOF

Note

Set persistence.storageClass to a StorageClass that is available in your Kubernetes cluster.

Tip

Adjust persistence.size based on your model size and expected cache usage.

Install the release:

helm install my-nim nim-llm/ -f values.yaml

These values are intentionally minimal and work as a starting point in most clusters.

Enable LoRA with Helm#

Optional: Complete the following steps to enable LoRA adapters with Helm.

Create a dedicated PVC for the LoRA adapters:

cat <<'EOF' > nvidia-nim-lora-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-lora-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 10Gi
EOF

kubectl apply -f nvidia-nim-lora-pvc.yaml

Add LoRA adapters to the PVC under /loras. For each adapter, create one directory that contains the adapter artifacts:
```
/loras/
  adapter_name/
    adapter_config.json
    adapter_model.safetensors   # or adapter_model.bin
```
Note

NIM loads adapters from NIM_PEFT_SOURCE (/loras in this example). If the PVC is empty, no LoRA adapters are available at runtime.

Update values.yaml:

env:
  - name: NIM_PEFT_SOURCE
    value: /loras
extraVolumes:
  lora-adapter:
    persistentVolumeClaim:
      claimName: nvidia-nim-lora-pvc
extraVolumeMounts:
  lora-adapter:
    mountPath: /loras

Apply the updated values:

helm upgrade my-nim nim-llm/ -f values.yaml

For detailed LoRA configuration and runtime behavior, refer to Fine-Tuning with LoRA.

Container Security Context#

The containerSecurityContext Helm value sets the Kubernetes container-level security context for the NIM container. It is applied in both single-node and multi-node (LeaderWorkerSet) deployments.

Unlike podSecurityContext, the container-level security context supports the capabilities field, which is required for certain multi-node configurations.

NVL72 Multi-Node Deployments#

GB200 and GB300 NVL72 systems use IMEX (Inter-Memory Exchange) channels for cross-node NVLink communication. The containers must have the following Linux capabilities:

containerSecurityContext:
  capabilities:
    add:
      - SYS_PTRACE
      - IPC_LOCK

SYS_PTRACE — Required for CUDA IPC and IMEX channel operations across GPUs.
IPC_LOCK — Required for pinning memory used by GPUDirect RDMA.

To set these with Helm:

helm install my-nim nim-llm/ -f values.yaml \
  --set containerSecurityContext.capabilities.add[0]=SYS_PTRACE \
  --set containerSecurityContext.capabilities.add[1]=IPC_LOCK

Or add them to your values.yaml:

containerSecurityContext:
  capabilities:
    add:
      - SYS_PTRACE
      - IPC_LOCK

Pod Priority Class#

Use the priorityClassName Helm value to assign a Kubernetes PriorityClass to NIM pods. Without a priority class, pods are treated as lowest priority in the cluster and may be preempted by other workloads.

This is applied to single-node deployments and multi-node (LeaderWorkerSet) leader and worker pods.

priorityClassName: "high-priority"

Note

The PriorityClass resource must already exist in your cluster before referencing it.

To create a PriorityClass:

kubectl apply -f - <<'EOF'
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for NIM inference pods."
EOF

Verify Deployment#

Verify that the pods and service are ready, and then test inference with port forwarding.

Check that the pods are running:

kubectl get pods -l app.kubernetes.io/instance=my-nim

Check that the service is available:

kubectl get svc -l app.kubernetes.io/instance=my-nim

Port-forward the service for local testing:

kubectl port-forward svc/my-nim-nim-llm 8000:8000

Call the readiness endpoint to confirm that the service is ready:
```
curl -sS http://127.0.0.1:8000/v1/health/ready
```

A healthy deployment returns an HTTP 200 response from the readiness endpoint.