KServe#

This page describes how to deploy NVIDIA NIM for LLMs on KServe in a Kubernetes environment.

Prerequisites#

Before deploying NIM on KServe, make sure you have the following:

  • A Kubernetes cluster with KServe enabled and GPU-capable nodes

  • Configured kubectl access to the target cluster

  • An NGC API key for pulling NIM container images and downloading model artifacts

  • A storage class that supports persistent volumes for model caching

Note

If your Kubernetes cluster is provisioned with NVIDIA Cloud Native Stack (CNS), refer to Enable KServe on CNS. For other environments, refer to the KServe quickstart environment guide.

Create Common Resources#

Create the same credentials and cache PVC that a Helm deployment uses.

  1. Set the NGC API key in your shell:

    export NGC_API_KEY="nvapi-..."
    
  2. Create the image pull secret:

    kubectl create secret docker-registry ngc-secret \
      --docker-server=nvcr.io \
      --docker-username='$oauthtoken' \
      --docker-password="${NGC_API_KEY}"
    
  3. Create the NGC API key secret:

    kubectl create secret generic nvidia-nim-secrets \
      --from-literal=NGC_API_KEY="${NGC_API_KEY}"
    
  4. Create the cache PVC manifest:

    cat <<'EOF' > nvidia-nim-cache-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: nvidia-nim-cache-pvc
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: nfs-client
      resources:
        requests:
          storage: 200Gi
    EOF
    
  5. Apply the cache PVC manifest:

    kubectl apply -f nvidia-nim-cache-pvc.yaml
    

Note

Set storageClassName to a StorageClass that is available in your Kubernetes cluster.

Tip

Adjust the PVC storage value based on your model size and expected cache usage.

Deploy NIM on KServe#

Complete the following steps to deploy the minimal KServe example:

  1. Create kserve-nim.yaml with a ClusterServingRuntime and an InferenceService:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ClusterServingRuntime
    metadata:
      name: nvidia-nim-llama-3.1-8b-instruct
    spec:
      supportedModelFormats:
        - name: nvidia-nim-llama-3.1-8b-instruct
          version: "2.0.1"
          autoSelect: true
          priority: 1
      protocolVersions:
        - v2
      imagePullSecrets:
        - name: ngc-secret
      containers:
        - name: kserve-container
          image: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.1
          env:
            - name: NIM_CACHE_PATH
              value: /opt/nim/.cache
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: nvidia-nim-secrets
                  key: NGC_API_KEY
          ports:
            - containerPort: 8000
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
          readinessProbe:
            httpGet:
              path: /v1/health/ready
              port: 8000
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: nim-cache
              mountPath: /opt/nim/.cache
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
        - name: nim-cache
          persistentVolumeClaim:
            claimName: nvidia-nim-cache-pvc
    ---
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: llama-3-1-8b-instruct-1xgpu
    spec:
      predictor:
        minReplicas: 1
        model:
          runtime: nvidia-nim-llama-3.1-8b-instruct
          modelFormat:
            name: nvidia-nim-llama-3.1-8b-instruct
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
    
  2. Apply the manifest:

    kubectl apply -f kserve-nim.yaml
    

This manifest is intentionally minimal and works as a starting point in most clusters.

Optional: Enable LoRA on KServe#

Use the same LoRA setup from Optional: Enable LoRA With Helm to create the LoRA PVC and populate /loras with adapter files.

  1. Update runtime settings in kserve-nim.yaml:

    env:
      - name: NIM_PEFT_SOURCE
        value: /loras
    volumeMounts:
      - name: lora-adapter
        mountPath: /loras
    volumes:
      - name: lora-adapter
        persistentVolumeClaim:
          claimName: nvidia-nim-lora-pvc
    
  2. Reapply the manifest:

    kubectl apply -f kserve-nim.yaml
    

Verify Deployment#

Verify that the InferenceService and pod are ready.

  1. Check the InferenceService status:

    kubectl get inferenceservice llama-3-1-8b-instruct-1xgpu
    
  2. Check that the pod is running:

    kubectl get pods -l serving.kserve.io/inferenceservice=llama-3-1-8b-instruct-1xgpu
    
  3. Review detailed InferenceService status and recent events:

    kubectl describe inferenceservice llama-3-1-8b-instruct-1xgpu
    

A successful rollout shows the InferenceService in a ready state.