KServe#

This page describes how to deploy NVIDIA NIM for LLMs on KServe in a Kubernetes environment.

Prerequisites#

A running Kubernetes cluster with KServe enabled and GPU nodes available.
kubectl configured for the target cluster.
NGC credentials for pulling model images and downloading artifacts.
A persistent storage class for model cache.

Note

If your Kubernetes cluster is provisioned by NVIDIA Cloud Native Stack (CNS), refer to Enable KServe on CNS. Otherwise, refer to the KServe quickstart environment guide.

Create Common Resources#

Create the same credentials and cache PVC used by a Helm deployment:

# Set this variable in your shell before running these commands.
# Example: export NGC_API_KEY="nvapi-..."

kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="${NGC_API_KEY}"

kubectl create secret generic nvidia-nim-secrets \
  --from-literal=NGC_API_KEY="${NGC_API_KEY}"

cat <<'EOF' > nvidia-nim-cache-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-cache-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 200Gi
EOF

kubectl apply -f nvidia-nim-cache-pvc.yaml

Note

Set storageClassName to a StorageClass that is available in your Kubernetes cluster.

Tip

Adjust PVC storage based on your model size and expected cache usage.

Deploy NIM on KServe#

Create kserve-nim.yaml with a ClusterServingRuntime and an InferenceService:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct
spec:
  supportedModelFormats:
    - name: nvidia-nim-llama-3.1-8b-instruct
      version: "2.0.0"
      autoSelect: true
      priority: 1
  protocolVersions:
    - v2
  imagePullSecrets:
    - name: ngc-secret
  containers:
    - name: kserve-container
      image: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.0
      env:
        - name: NIM_CACHE_PATH
          value: /opt/nim/.cache
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: nvidia-nim-secrets
              key: NGC_API_KEY
      ports:
        - containerPort: 8000
      livenessProbe:
        httpGet:
          path: /v1/health/live
          port: 8000
      readinessProbe:
        httpGet:
          path: /v1/health/ready
          port: 8000
      volumeMounts:
        - name: dshm
          mountPath: /dev/shm
        - name: nim-cache
          mountPath: /opt/nim/.cache
  volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 16Gi
    - name: nim-cache
      persistentVolumeClaim:
        claimName: nvidia-nim-cache-pvc
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b-instruct-1xgpu
spec:
  predictor:
    minReplicas: 1
    model:
      runtime: nvidia-nim-llama-3.1-8b-instruct
      modelFormat:
        name: nvidia-nim-llama-3.1-8b-instruct
      resources:
        requests:
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/gpu: "1"

Apply the manifest:

kubectl apply -f kserve-nim.yaml

This manifest is intentionally minimal and works as a starting point in most clusters.

Optional: Enable LoRA on KServe#

Use the same LoRA setup from Optional: Enable LoRA With Helm to create the LoRA PVC and populate /loras with adapter files.

Update runtime settings in kserve-nim.yaml:

env:
  - name: NIM_PEFT_SOURCE
    value: /loras
volumeMounts:
  - name: lora-adapter
    mountPath: /loras
volumes:
  - name: lora-adapter
    persistentVolumeClaim:
      claimName: nvidia-nim-lora-pvc

Reapply the manifest:

kubectl apply -f kserve-nim.yaml

Verify Deployment#

kubectl get inferenceservice llama-3-1-8b-instruct-1xgpu
kubectl get pods -l serving.kserve.io/inferenceservice=llama-3-1-8b-instruct-1xgpu
kubectl describe inferenceservice llama-3-1-8b-instruct-1xgpu

A successful rollout shows the InferenceService in a ready state.