Helm and Kubernetes#

This page describes how to deploy NVIDIA NIM for LLMs on Kubernetes using the NIM Helm chart.

Prerequisites#

Before deploying NIM with Helm, make sure you have the following:

  • A running Kubernetes cluster with GPU-capable nodes

  • Configured kubectl access to the target cluster

  • Helm 3.0.0 or later

  • An NGC API key for accessing the NIM Helm chart, pulling NIM container images, and downloading model artifacts

  • A storage class that supports persistent volumes for model caching

Note

To provision Kubernetes with NVIDIA Cloud Native Stack (CNS), refer to Using the Ansible Playbooks.

Fetch and extract the Helm chart before deployment. Go to the NGC Catalog and select the nim-llm Helm chart to pick a version. In most cases, you should select the latest version.

export HELM_CHART_VERSION="<version_number>"
helm fetch "https://helm.ngc.nvidia.com/nim/charts/nim-llm-${HELM_CHART_VERSION}.tgz" \
  --username='$oauthtoken' \
  --password="${NGC_API_KEY}"
tar -xzf "nim-llm-${HELM_CHART_VERSION}.tgz"

Configure Helm#

The following Helm values are the most important settings for a NIM deployment:

  • image.repository: NIM container image to deploy.

  • image.tag: NIM container image tag.

  • model.ngcAPISecret and imagePullSecrets: credentials required to pull images and model artifacts.

  • persistence: storage settings for model cache.

  • resources: GPU limits based on model requirements.

  • env: optional advanced runtime configuration.

Use the following commands to inspect chart documentation and defaults:

helm show readme nim-llm/
helm show values nim-llm/

Cache and Temporary Directories#

Several environment variables are derived from model.nimCache (default: /model-store) at deploy time:

Environment Variable

Derived Value

Purpose

NIM_CACHE_PATH

<nimCache>

Primary model cache

HF_HOME

<nimCache>/huggingface/hub

Hugging Face cache

OUTLINES_CACHE_DIR

<nimCache>/outlines

Outlines grammar cache for structured output

These variables are set automatically in both single-node and multi-node deployments. If you change model.nimCache, ensure that the underlying volume mount is writable and has sufficient space for all cache subdirectories.

To override OUTLINES_CACHE_DIR independently, add it to the env section in values.yaml:

env:
  - name: OUTLINES_CACHE_DIR
    value: /custom/path/outlines

Minimal Example#

Complete the following steps to deploy the minimal Helm example:

  1. Create the secrets before installing the chart:

    export NGC_API_KEY=<your_ngc_api_key>
    
  2. Create the image pull secret:

    kubectl create secret docker-registry ngc-secret \
      --docker-server=nvcr.io \
      --docker-username='$oauthtoken' \
      --docker-password="${NGC_API_KEY}"
    
  3. Create the NGC API key secret:

    kubectl create secret generic nvidia-nim-secrets \
      --from-literal=NGC_API_KEY="${NGC_API_KEY}"
    
  4. Create values.yaml with a minimal configuration:

    cat <<'EOF' > values.yaml
    image:
      repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
      tag: "2.0.3"
    model:
      ngcAPISecret: "nvidia-nim-secrets"
    persistence:
      enabled: true
      storageClass: "nfs-client"
      accessMode: ReadWriteMany
      size: 50Gi
    resources:
      limits:
        nvidia.com/gpu: 1
    imagePullSecrets:
      - name: "ngc-secret"
    EOF
    

    Note

    Set persistence.storageClass to a StorageClass that is available in your Kubernetes cluster.

    Tip

    Adjust persistence.size based on your model size and expected cache usage.

  5. Install the release:

    helm install my-nim nim-llm/ -f values.yaml
    

These values are intentionally minimal and work as a starting point in most clusters.

Enable LoRA With Helm#

Optional: Complete the following steps to enable LoRA adapters with Helm.

  1. Create a dedicated PVC for the LoRA adapters:

    cat <<'EOF' > nvidia-nim-lora-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: nvidia-nim-lora-pvc
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: nfs-client
      resources:
        requests:
          storage: 10Gi
    EOF
    
    kubectl apply -f nvidia-nim-lora-pvc.yaml
    
  2. Add LoRA adapters to the PVC under /loras. For each adapter, create one directory that contains the adapter artifacts:

    /loras/
      adapter_name/
        adapter_config.json
        adapter_model.safetensors   # or adapter_model.bin
    

    Note

    NIM loads adapters from NIM_PEFT_SOURCE (/loras in this example). If the PVC is empty, no LoRA adapters are available at runtime.

  3. Update values.yaml:

    env:
      - name: NIM_PEFT_SOURCE
        value: /loras
    extraVolumes:
      lora-adapter:
        persistentVolumeClaim:
          claimName: nvidia-nim-lora-pvc
    extraVolumeMounts:
      lora-adapter:
        mountPath: /loras
    
  4. Apply the updated values:

    helm upgrade my-nim nim-llm/ -f values.yaml
    

For detailed LoRA configuration and runtime behavior, refer to Fine-Tuning with LoRA.

Container Security Context#

The containerSecurityContext Helm value sets the Kubernetes container-level security context for the NIM container. It is applied in both single-node and multi-node (LeaderWorkerSet) deployments.

Unlike podSecurityContext, the container-level security context supports the capabilities field, which is required for certain multi-node configurations.

NVL72 Multi-Node Deployments#

GB200 and GB300 NVL72 systems use IMEX (Inter-Memory Exchange) channels for cross-node NVLink communication. The containers must have the following Linux capabilities:

containerSecurityContext:
  capabilities:
    add:
      - SYS_PTRACE
      - IPC_LOCK
  • SYS_PTRACE — Required for CUDA IPC and IMEX channel operations across GPUs.

  • IPC_LOCK — Required for pinning memory used by GPUDirect RDMA.

To set these with Helm:

helm install my-nim nim-llm/ -f values.yaml \
  --set containerSecurityContext.capabilities.add[0]=SYS_PTRACE \
  --set containerSecurityContext.capabilities.add[1]=IPC_LOCK

Or add them to your values.yaml:

containerSecurityContext:
  capabilities:
    add:
      - SYS_PTRACE
      - IPC_LOCK

Verify Deployment#

Verify that the pods and service are ready, and then test inference with port forwarding.

  1. Check that the pods are running:

    kubectl get pods -l app.kubernetes.io/instance=my-nim
    
  2. Check that the service is available:

    kubectl get svc -l app.kubernetes.io/instance=my-nim
    
  3. Port-forward the service for local testing:

    kubectl port-forward svc/my-nim-nim-llm 8000:8000
    
  4. Call the readiness endpoint to confirm that the service is ready:

    curl -sS http://127.0.0.1:8000/v1/health/ready
    

A healthy deployment returns an HTTP 200 response from the readiness endpoint.