KServe#

This page describes how to deploy NVIDIA NIM for LLMs on KServe in a Kubernetes environment.

Prerequisites#

Before deploying NIM on KServe, make sure you have the following:

A Kubernetes cluster with KServe enabled and GPU-capable nodes
Configured kubectl access to the target cluster
An NGC API key for pulling NIM container images and downloading model artifacts
A storage class that supports persistent volumes for model caching

Note

If your Kubernetes cluster is provisioned with NVIDIA Cloud Native Stack (CNS), refer to Enable KServe on CNS. For other environments, refer to the KServe quickstart environment guide.

Create Common Resources#

Create the same credentials and cache PVC that a Helm deployment uses.

Set the NGC API key in your shell:
```
export NGC_API_KEY="nvapi-..."
```

Create the image pull secret:

kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="${NGC_API_KEY}"

Create the NGC API key secret:

kubectl create secret generic nvidia-nim-secrets \
  --from-literal=NGC_API_KEY="${NGC_API_KEY}"

Create the cache PVC manifest:

cat <<'EOF' > nvidia-nim-cache-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-cache-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 200Gi
EOF

Apply the cache PVC manifest:

kubectl apply -f nvidia-nim-cache-pvc.yaml

Note

Set storageClassName to a StorageClass that is available in your Kubernetes cluster.

Tip

Adjust the PVC storage value based on your model size and expected cache usage.

Deploy NIM on KServe#

Complete the following steps to deploy the minimal KServe example:

Create kserve-nim.yaml with a ClusterServingRuntime and an InferenceService:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct
spec:
  supportedModelFormats:
    - name: nvidia-nim-llama-3.1-8b-instruct
      version: "2.0.7"
      autoSelect: true
      priority: 1
  protocolVersions:
    - v2
  imagePullSecrets:
    - name: ngc-secret
  containers:
    - name: kserve-container
      image: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.7
      env:
        - name: NIM_CACHE_PATH
          value: /opt/nim/.cache
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: nvidia-nim-secrets
              key: NGC_API_KEY
      ports:
        - containerPort: 8000
      livenessProbe:
        httpGet:
          path: /v1/health/live
          port: 8000
      readinessProbe:
        httpGet:
          path: /v1/health/ready
          port: 8000
      volumeMounts:
        - name: dshm
          mountPath: /dev/shm
        - name: nim-cache
          mountPath: /opt/nim/.cache
  volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 16Gi
    - name: nim-cache
      persistentVolumeClaim:
        claimName: nvidia-nim-cache-pvc
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b-instruct-1xgpu
spec:
  predictor:
    minReplicas: 1
    model:
      runtime: nvidia-nim-llama-3.1-8b-instruct
      modelFormat:
        name: nvidia-nim-llama-3.1-8b-instruct
      resources:
        requests:
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/gpu: "1"

Apply the manifest:
```
kubectl apply -f kserve-nim.yaml
```

This manifest is intentionally minimal and works as a starting point in most clusters.

Enable LoRA on KServe#

Optional: Use the same LoRA setup from Enable LoRA with Helm to create the LoRA PVC and populate /loras with adapter files.

Update runtime settings in kserve-nim.yaml:

env:
  - name: NIM_PEFT_SOURCE
    value: /loras
volumeMounts:
  - name: lora-adapter
    mountPath: /loras
volumes:
  - name: lora-adapter
    persistentVolumeClaim:
      claimName: nvidia-nim-lora-pvc

Reapply the manifest:
```
kubectl apply -f kserve-nim.yaml
```

Verify Deployment#

Verify that the InferenceService and pod are ready.

Check the InferenceService status:

kubectl get inferenceservice llama-3-1-8b-instruct-1xgpu

Check that the pod is running:

kubectl get pods -l serving.kserve.io/inferenceservice=llama-3-1-8b-instruct-1xgpu

Review detailed InferenceService status and recent events:

kubectl describe inferenceservice llama-3-1-8b-instruct-1xgpu

A successful rollout shows the InferenceService in a ready state.