KServe#
This page describes how to deploy NVIDIA NIM for LLMs on KServe in a Kubernetes environment.
Prerequisites#
A running Kubernetes cluster with KServe enabled and GPU nodes available.
kubectlconfigured for the target cluster.NGC credentials for pulling model images and downloading artifacts.
A persistent storage class for model cache.
Note
If your Kubernetes cluster is provisioned by NVIDIA Cloud Native Stack (CNS), refer to Enable KServe on CNS. Otherwise, refer to the KServe quickstart environment guide.
Create Common Resources#
Create the same credentials and cache PVC used by a Helm deployment:
# Set this variable in your shell before running these commands.
# Example: export NGC_API_KEY="nvapi-..."
kubectl create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="${NGC_API_KEY}"
kubectl create secret generic nvidia-nim-secrets \
--from-literal=NGC_API_KEY="${NGC_API_KEY}"
cat <<'EOF' > nvidia-nim-cache-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvidia-nim-cache-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-client
resources:
requests:
storage: 200Gi
EOF
kubectl apply -f nvidia-nim-cache-pvc.yaml
Note
Set storageClassName to a StorageClass that is available in your Kubernetes cluster.
Tip
Adjust PVC storage based on your model size and expected cache usage.
Deploy NIM on KServe#
Create kserve-nim.yaml with a ClusterServingRuntime and an InferenceService:
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: nvidia-nim-llama-3.1-8b-instruct
spec:
supportedModelFormats:
- name: nvidia-nim-llama-3.1-8b-instruct
version: "2.0.0"
autoSelect: true
priority: 1
protocolVersions:
- v2
imagePullSecrets:
- name: ngc-secret
containers:
- name: kserve-container
image: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.0
env:
- name: NIM_CACHE_PATH
value: /opt/nim/.cache
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: nvidia-nim-secrets
key: NGC_API_KEY
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
readinessProbe:
httpGet:
path: /v1/health/ready
port: 8000
volumeMounts:
- name: dshm
mountPath: /dev/shm
- name: nim-cache
mountPath: /opt/nim/.cache
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 16Gi
- name: nim-cache
persistentVolumeClaim:
claimName: nvidia-nim-cache-pvc
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-1-8b-instruct-1xgpu
spec:
predictor:
minReplicas: 1
model:
runtime: nvidia-nim-llama-3.1-8b-instruct
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Apply the manifest:
kubectl apply -f kserve-nim.yaml
This manifest is intentionally minimal and works as a starting point in most clusters.
Optional: Enable LoRA on KServe#
Use the same LoRA setup from Optional: Enable LoRA With Helm to create the LoRA PVC and populate /loras with adapter files.
Update runtime settings in kserve-nim.yaml:
env:
- name: NIM_PEFT_SOURCE
value: /loras
volumeMounts:
- name: lora-adapter
mountPath: /loras
volumes:
- name: lora-adapter
persistentVolumeClaim:
claimName: nvidia-nim-lora-pvc
Reapply the manifest:
kubectl apply -f kserve-nim.yaml
Verify Deployment#
kubectl get inferenceservice llama-3-1-8b-instruct-1xgpu
kubectl get pods -l serving.kserve.io/inferenceservice=llama-3-1-8b-instruct-1xgpu
kubectl describe inferenceservice llama-3-1-8b-instruct-1xgpu
A successful rollout shows the InferenceService in a ready state.