KServe#
This page describes how to deploy NVIDIA NIM for LLMs on KServe in a Kubernetes environment.
Prerequisites#
Before deploying NIM on KServe, make sure you have the following:
A Kubernetes cluster with KServe enabled and GPU-capable nodes
Configured
kubectlaccess to the target clusterAn NGC API key for pulling NIM container images and downloading model artifacts
A storage class that supports persistent volumes for model caching
Note
If your Kubernetes cluster is provisioned with NVIDIA Cloud Native Stack (CNS), refer to Enable KServe on CNS. For other environments, refer to the KServe quickstart environment guide.
Create Common Resources#
Create the same credentials and cache PVC that a Helm deployment uses.
Set the NGC API key in your shell:
export NGC_API_KEY="nvapi-..."
Create the image pull secret:
kubectl create secret docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password="${NGC_API_KEY}"
Create the NGC API key secret:
kubectl create secret generic nvidia-nim-secrets \ --from-literal=NGC_API_KEY="${NGC_API_KEY}"
Create the cache PVC manifest:
cat <<'EOF' > nvidia-nim-cache-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: nvidia-nim-cache-pvc spec: accessModes: - ReadWriteMany storageClassName: nfs-client resources: requests: storage: 200Gi EOF
Apply the cache PVC manifest:
kubectl apply -f nvidia-nim-cache-pvc.yaml
Note
Set storageClassName to a StorageClass that is available in your Kubernetes cluster.
Tip
Adjust the PVC storage value based on your model size and expected cache usage.
Deploy NIM on KServe#
Complete the following steps to deploy the minimal KServe example:
Create
kserve-nim.yamlwith aClusterServingRuntimeand anInferenceService:apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: nvidia-nim-llama-3.1-8b-instruct spec: supportedModelFormats: - name: nvidia-nim-llama-3.1-8b-instruct version: "2.0.1" autoSelect: true priority: 1 protocolVersions: - v2 imagePullSecrets: - name: ngc-secret containers: - name: kserve-container image: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.1 env: - name: NIM_CACHE_PATH value: /opt/nim/.cache - name: NGC_API_KEY valueFrom: secretKeyRef: name: nvidia-nim-secrets key: NGC_API_KEY ports: - containerPort: 8000 livenessProbe: httpGet: path: /v1/health/live port: 8000 readinessProbe: httpGet: path: /v1/health/ready port: 8000 volumeMounts: - name: dshm mountPath: /dev/shm - name: nim-cache mountPath: /opt/nim/.cache volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 16Gi - name: nim-cache persistentVolumeClaim: claimName: nvidia-nim-cache-pvc --- apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-3-1-8b-instruct-1xgpu spec: predictor: minReplicas: 1 model: runtime: nvidia-nim-llama-3.1-8b-instruct modelFormat: name: nvidia-nim-llama-3.1-8b-instruct resources: requests: nvidia.com/gpu: "1" limits: nvidia.com/gpu: "1"
Apply the manifest:
kubectl apply -f kserve-nim.yaml
This manifest is intentionally minimal and works as a starting point in most clusters.
Optional: Enable LoRA on KServe#
Use the same LoRA setup from Optional: Enable LoRA With Helm to create the LoRA PVC and populate /loras with adapter files.
Update runtime settings in
kserve-nim.yaml:env: - name: NIM_PEFT_SOURCE value: /loras volumeMounts: - name: lora-adapter mountPath: /loras volumes: - name: lora-adapter persistentVolumeClaim: claimName: nvidia-nim-lora-pvc
Reapply the manifest:
kubectl apply -f kserve-nim.yaml
Verify Deployment#
Verify that the InferenceService and pod are ready.
Check the
InferenceServicestatus:kubectl get inferenceservice llama-3-1-8b-instruct-1xgpu
Check that the pod is running:
kubectl get pods -l serving.kserve.io/inferenceservice=llama-3-1-8b-instruct-1xgpu
Review detailed
InferenceServicestatus and recent events:kubectl describe inferenceservice llama-3-1-8b-instruct-1xgpu
A successful rollout shows the InferenceService in a ready state.