Managing NIM Services
About NIM Services
A NIM service is a Kubernetes custom resource, nimservices.apps.nvidia.com
.
You create and delete NIM service resources to manage NVIDIA NIM microservices.
Refer to the following sample manifest:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
image:
repository: nvcr.io/nim/meta/llama3-8b-instruct
tag: 1.0.3
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: meta-llama3-8b-instruct
profile: ''
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
Refer to the following table for information about the commonly modified fields:
Field |
Description |
Default Value |
---|---|---|
|
Specifies to add the user-supplied annotations to the pod. |
None |
|
Specifies the name of a generic secret that contains NGC_API_KEY. |
None |
|
Specifies environment variables to set in the NIM microservice container. |
None |
|
When set to If you have an ingress controller, values like the following sample configures an ingress for the ingress:
enabled: true
spec:
ingressClassName: nginx
rules:
- host: demo.nvidia.example.com
http:
paths:
- backend:
service:
name: meta-llama3-8b-instruct
port:
number: 8000
path: /v1/chat/completions
pathType: Prefix
|
|
|
Specifies the network port number for the NIM microservice.
The frequently used value is |
None |
|
Specifies the Kubernetes service type to create for the NIM microservice. |
|
|
Specifies the group for the pods.
This value is used to set the security context of the pod in the |
|
|
Specifies repository, tag, pull policy, and pull secret for the container image. |
None |
|
Specifies the user-supplied labels to add to the pod. |
None |
|
When set to |
|
|
Specifies the resource requirements for the pods. |
None |
|
Specifies the desired number of pods in the replica set for the NIM microservice. |
|
|
When set to The |
|
|
Specifies the name of the NIM cache that has the cached model profiles for the NIM microservice.
Specify values for the |
None |
|
If you did not create a NIM cache resource to download and cache your model, you can specify this field to download model profiles.
This field has the following subfields: To have the Operator create a PVC for the model profiles, specify |
None |
|
When set to |
|
|
Specifies the tolerations for the pods. |
None |
|
Specifies the user ID for the pod.
This value is used to set the security context of the pod in the |
|
Prerequisites
Optional: Added a NIM cache resource for the NIM microservice. If you created a NIM cache resource, specify the name in the
spec.nimCache.name
field.If you prefer to have the service download a model to storage, refer to Example: Create a PVC Instead of Using a NIM Cache for a sample manifest.
Procedure
Create a file, such as
service-all.yaml
, with contents like the following example:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: meta-llama3-8b-instruct spec: image: repository: nvcr.io/nim/meta/llama3-8b-instruct tag: 1.0.3 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: meta-llama3-8b-instruct profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000 --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: nv-embedqa-e5-v5 spec: image: repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5 tag: 1.0.4 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: nv-embedqa-e5-v5 profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000 --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: nv-rerankqa-mistral-4b-v3 spec: image: repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3 tag: 1.0.4 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: nv-rerankqa-mistral-4b-v3 profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000
Apply the manifest:
$ kubectl apply -n nim-service -f service-all.yaml
Optional: View information about the NIM services:
$ kubectl describe nimservices.apps.nvidia.com -n nim-service
Partial Output
... Conditions: Last Transition Time: 2024-08-12T19:09:43Z Message: Deployment is ready Reason: Ready Status: True Type: Ready Last Transition Time: 2024-08-12T19:09:43Z Message: Reason: Ready Status: False Type: Failed State: Ready
Verification
Start a pod that has access to the
curl
command. Substitute any pod that has the command and meets your organization’s security requirements:$ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
After the pod starts, you are connected to the
ash
shell in the pod.Connect to the chat completions endpoint on the NIM for LLMs container:
curl -X "POST" \
'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
"role": "user"
}],
"top_p": 1,
"n": 1,
"max_tokens": 1024,
"stream": false,
"frequency_penalty": 0.0,
"stop": ["STOP"]
}'
The command connects to the service in the nim-service
namespace, meta-llama3-8b-instruct.nim-service
.
The command specifies the model to use, meta/llama3-8b-instruct
.
Replace these values if you use a different service name, namespace, or model.
Press Ctrl+D to exit and delete the pod.
Configuring Horizontal Pod Autoscaling
Prerequisites
Prometheus is installed. NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts and used a command like the following example:
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm repo update $ helm install prometheus prometheus-community/prometheus --namespace prometheus --create-namespace
If you do not have a default storage class, add the following command-line arguments:
--set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
Prometheus Adapter is installed. NVIDIA development and testing used the same Prometheus Community Kubernetes Helm Charts and used a command like the following example:
$ helm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace prometheus-adapter \ --create-namespace \ --set prometheus.url=http://prometheus-server.prometheus.svc.cluster.local \ --set prometheus.port="80"
Autoscaling NIM for LLMs
NVIDIA NIM for LLMs provides several service metrics. Refer to Observability in the NVIDIA NIM for LLMs documentation for information about the metrics.
Annotate the service resource related to NIM for LLMs:
$ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true
Prometheus might require several minutes to begin collecting metrics from the service.
Optional: Confirm Prometheus collects the metrics.
If you have access to the Prometheus dashboard, search for a service metric such as
gpu_cache_usage_perc
.You can query Prometheus Adapter:
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq .
Example Output
{ "kind": "MetricValueList", "apiVersion": "custom.metrics.k8s.io/v1beta1", "metadata": {}, "items": [ { "describedObject": { "kind": "Service", "namespace": "nim-service", "name": "meta-llama3-8b-instruct", "apiVersion": "/v1" }, "metricName": "gpu_cache_usage_perc", "timestamp": "2024-09-12T15:14:20Z", "value": "0", "selector": null } ] }
Create a file, such as
service-hpa.yaml
, with contents like the following example:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: meta-llama3-8b-instruct spec: image: repository: nvcr.io/nim/meta/llama3-8b-instruct tag: 1.0.3 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: meta-llama3-8b-instruct profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000 scale: enabled: true hpa: maxReplicas: 2 minReplicas: 1 metrics: - type: Object object: metric: name: gpu_cache_usage_perc describedObject: apiVersion: v1 kind: Service name: meta-llama3-8b-instruct target: type: Value value: "0.5"
Apply the manifest:
$ kubectl apply -n nim-service -f service-hpa.yaml
Optional: Confirm the horizontal pod autoscaler resource is created:
$ kubectl get hpa -n nim-service
Example Output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE meta-llama3-8b-instruct Deployment/meta-llama3-8b-instruct 0/500m 1 2 1 40s
Autoscaling Embedding and Reranking Services
The embedding and reranking services do not expose service metrics. To scale these services, you can monitor the pod-level metrics provided by NVIDIA DCGM Exporter. Refer to the metrics-config.yaml in the GitHub repository for the default metric names.
Optional: Confirm Prometheus collects the metrics.
If you have access to the Prometheus dashboard, search for a service metric such as
DCGM_FI_DEV_FB_USED
.You can query Prometheus Adapter:
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/pods/*/DCGM_FI_DEV_FB_USED" | jq .
Partial Output
{ "describedObject": { "kind": "Pod", "namespace": "nim-service", "name": "nv-embedqa-e5-v5-78cb9874c4-ghpmc", "apiVersion": "/v1" }, "metricName": "DCGM_FI_DEV_GPU_UTIL", "timestamp": "2024-09-12T16:16:53Z", "value": "0", "selector": null }
Create a file, such as
service-hpa-dcgm.yaml
, with contents like the following example:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: nv-embedqa-e5-v5 spec: image: repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5 tag: 1.0.4 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: nv-embedqa-e5-v5 profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000 scale: enabled: true hpa: maxReplicas: 2 minReplicas: 1 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_FB_USED target: type: AverageValue averageValue: "1000" --- apiVersion: apps.nvidia.com/v1alpha1 kind: NIMService metadata: name: nv-rerankqa-mistral-4b-v3 spec: image: repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3 tag: 1.0.4 pullPolicy: IfNotPresent pullSecrets: - ngc-secret authSecret: ngc-api-secret storage: nimCache: name: nv-rerankqa-mistral-4b-v3 profile: '' replicas: 1 resources: limits: nvidia.com/gpu: 1 expose: service: type: ClusterIP port: 8000 scale: enabled: true hpa: maxReplicas: 2 minReplicas: 1 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_FB_USED target: type: AverageValue averageValue: "1000"
Apply the manifest:
$ kubectl apply -n nim-service -f service-hpa-dcgm.yaml
Sample Manifests
Example: Create a PVC Instead of Using a NIM Cache
As an alternative to creating NIM cache resources to download and cache NIM model profiles, you can specify that the Operator create a PVC and the NIM service downloads and runs a NIM model profile.
Create and apply a manifest like the following example:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
image:
repository: nvcr.io/nim/meta/llama3-8b-instruct
tag: 1.0.3
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
replicas: 1
storage:
pvc:
create: true
storageClass: <storage-class-name>
size: 10Gi
volumeAccessMode: ReadWriteMany
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
Example: Air-Gapped Environment
For air-gapped environments, you must download the model profiles for the NIM microservices from a host that has internet access. You must manually create a PVC and then transfer the model profile files into the PVC.
Typically, the Operator determines the PVC name by dereferencing it from the NIM cache resource. When there is no NIM cache resource, such as an air-gapped environment, you must specify the PVC name.
Create and apply a manifest like the following example:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
image:
repository: nvcr.io/nim/meta/llama3-8b-instruct
tag: 1.0.3
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
replicas: 1
storage:
pvc:
name: <existing-pvc-name>
readOnly: true
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
Deleting a NIM Service
To delete a NIM service perform the following steps.
View the NIM services custom resources:
$ kubectl get nimservices.apps.nvidia.com -A
Example Output
NAMESPACE NAME STATUS AGE nim-service meta-llama3-8b-instruct Ready 2024-08-12T17:16:05Z
Delete the custom resource:
$ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct
If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:
$ kubectl get nimcaches.apps.nvidia.com -n nim-service \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
Example Output
meta-llama3-8b-instruct: true
Next Steps
Deploy applications to use the NIM services, such as the Sample RAG Application.