NIM Operator Support#
Note
Support for the NIM Operator is available in UCS Tools starting with beta release 2.11.0-rc1.
The NIM-Operator deploys NIMs to Kubernetes by applying Custom Resource Definitions (CRDs) to create deployments and services for given NIMs. UCS Tools supports the NIM Operator in blueprint (application) definitions. Behind the scenes, UCS Tools creates a NIMPipeline CRD that the operator understands to deploy one or more NIMs.
Consider the VSS blueprint as an example (see VSS documentation for more details). The application YAML configuration is as follows:
specVersion: '2.5.0'
version: 2.3.0
doc: README.md
name: nvidia-blueprint-vss
description: Video Search and Summarization Agent Blueprint
dependencies:
- ucf.svc.vss:2.3.0
- ucf.svc.etcd:2.1.0
- ucf.svc.minio:2.1.0
- ucf.svc.milvus:2.1.0
- ucf.svc.neo4j:2.1.0
- ucf.svc.riva:2.3.0
components:
- name: vss
type: ucf.svc.vss
parameters:
vlmModelType: vila-1.5
vlmModelPath: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
llmModel: meta/llama-3.3-70b-instruct
llmModelChat: meta/llama-3.3-70b-instruct
imagePullSecrets:
- name: ngc-docker-reg-secret
resources:
limits:
nvidia.com/gpu: 2
secrets:
# openai-api-key: openai-api-key
#nvidia-api-key: nvidia-api-key
ngc-api-key: ngc-api-key
graph-db-username: graph-db-username
graph-db-password: graph-db-password
- name: etcd
type: ucf.svc.etcd
- name: minio
type: ucf.svc.minio
- name: milvus
type: ucf.svc.milvus
- name: neo4j
type: ucf.svc.neo4j
secrets:
db-username: graph-db-username
db-password: graph-db-password
- name: riva
type: ucf.svc.riva
parameters:
enabled: false
imagePullSecrets:
- name: ngc-docker-reg-secret
- name: rag
type: nim-operator
parameters:
services:
- name: llm-nim
enabled: true
spec:
metrics:
enabled: true
serviceMonitor:
interval: 15s
scrapeTimeout: 6s
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
tag: 1.3.3
pullPolicy: IfNotPresent
pullSecrets:
- ngc-docker-reg-secret
authSecret: ngc-api-key-secret
storage:
pvc:
create: true
storageClass: microk8s-hostpath
name: meta-llama
volumeAccessMode: ReadWriteMany
size: 10Gi
replicas: 1
resources:
limits:
nvidia.com/gpu: 2
expose:
service:
type: ClusterIP
port: 8000
scale:
enabled: true
hpa:
maxReplicas: 2
minReplicas: 1
metrics:
- type: Object
object:
metric:
name: gpu_cache_usage_perc
describedObject:
apiVersion: v1
kind: Service
name: llm-nim
target:
type: Value
value: '0.3'
- name: nemo-embedding
enabled: true
spec:
image:
repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
tag: 1.3.1
pullPolicy: IfNotPresent
pullSecrets:
- ngc-docker-reg-secret
authSecret: ngc-api-key-secret
storage:
pvc:
create: true
storageClass: microk8s-hostpath
name: nemo-embedding
volumeAccessMode: ReadWriteMany
size: 10Gi
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
- name: nemo-rerank
enabled: true
spec:
image:
repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
tag: 1.3.1
pullPolicy: IfNotPresent
pullSecrets:
- ngc-docker-reg-secret
authSecret: ngc-api-key-secret
storage:
pvc:
create: true
storageClass: microk8s-hostpath
name: nemo-rerank
volumeAccessMode: ReadWriteMany
size: 10Gi
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
connections:
milvus/etcd: etcd/http-api
milvus/minio: minio/http-api
vss/milvus: milvus/http-api1 # port 19530
vss/neo4j-bolt: neo4j/bolt
vss/llm-openai-api: rag/llm-nim
vss/nemo-embed: rag/nemo-embedding
vss/nemo-rerank: rag/nemo-rerank
vss/riva-api: riva/http-api
secrets:
# openai-api-key:
# k8sSecret:
# secretName: openai-api-key-secret
# key: OPENAI_API_KEY
#nvidia-api-key:
# k8sSecret:
# secretName: nvidia-api-key-secret
# key: NVIDIA_API_KEY
ngc-api-key:
k8sSecret:
secretName: ngc-api-key-secret
key: NGC_API_KEY
graph-db-username:
k8sSecret:
secretName: graph-db-creds-secret
key: username
graph-db-password:
k8sSecret:
secretName: graph-db-creds-secret
key: password
The NIM Operator is represented by the built-in component named nim-operator
(which is used as a component in the application above with the name “rag”). In this example, the NIM Operator is configured to deploy three NIMs, where each NIM in the NIMPipeline definition is represented by a NIMService CRD (also part of the NIM Operator set of CRDs):
Nemo Embedding
Nemo Reranking
LLM NIM
Each of these NIMService objects has an associated Kubernetes service defined in the spec.expose.service
field. The service name will be the same as the name of the associated NIMService.
At the bottom of the configuration, the VSS client application makes three connections to the NIMService objects defined by the NIM Operator:
vss/llm-openai-api: rag/llm-nim
vss/nemo-embed: rag/nemo-embedding
vss/nemo-rerank: rag/nemo-rerank
To make a connection to a NIM Operator service, use the name of the operator component in the application (“rag” in this case) and the Kubernetes service name.
Building this application with the command:
ucf_app_builder_cli app build via-nim-blueprint/via-blueprint/via-blueprint.yaml
generates the output blueprint Helm chart named nvidia-blueprint-vss-2.3.0
. The NIM Operator’s NIMPipeline CRD manifest will be output to the “templates” subdirectory at nvidia-blueprint-vss-2.3.0/templates/nvidia-nim-operator-pipeline.yaml
.
Prerequisites for Deploying Your Blueprint#
GPU Resources#
You’ll need at least 4 GPUs, although the current configuration is designed for 6. This configuration has been tested on A100 GPUs and will also work with H100 GPUs.
The llm-nim NIM is configured to use 2 GPUs, but you can reduce this to 1. Similarly, the VSS NIMService is also configured for 2 GPUs, but 1 should be sufficient for non-intensive workloads. The Nemo Rerank and NeMo Embedding NIMs are each configured with 1 GPU.
For HPA scaling to work properly, you’ll need an extra GPU set aside (as HPA is configured with minReplicas equal to 1 and maxReplicas equal to 2). Therefore, you’ll want to deploy on a system with at least 5 GPUs if llm-nim and VSS are both set to use 1 GPU, or at least 7 GPUs if using the default GPU counts.
NIM Operator#
Before using Helm to install the blueprint, ensure that the NIM Operator is running in your Kubernetes cluster. See the NIM-Operator documentation for installation instructions.
Add the Prometheus Community Helm Repository#
This repository will be used when installing Prometheus:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Prometheus Kubernetes Stack#
The kube-prometheus-stack Helm chart is available in the prometheus-community GitHub repository. This includes Prometheus, the Prometheus Operator (which provides CRDs such as ServiceMonitor, PodMonitor, etc.), and Grafana.
If you are using a single-node Kubernetes setup with MicroK8s, you can run microk8s enable observability
, which will deploy this stack for you.
Updating the Prometheus Configuration#
After installing the Prometheus Kubernetes stack, you should update the “Prometheus” CRD, which contains the configuration for Prometheus. We recommend updating the serviceMonitorSelector
field and setting it to {}
. By default, at least for MicroK8s, it’s set to:
serviceMonitorSelector:
matchLabels:
release: kube-prometheus-stack
Changing it to:
serviceMonitorSelector: {}
means that it will select any ServiceMonitor resource, not just those with the label release: kube-prometheus-stack
. You might also want to verify that this is set to select ServiceMonitors in all namespaces:
serviceMonitorNamespaceSelector: {}
You can update the settings similarly for the podMonitorNamespaceSelector
and podMonitorSelector
fields.
Prometheus Adapter#
The Prometheus Adapter is used to provide the custom metrics API. The HPA configured earlier for the llm-nim service in the NIM Operator manifest will query the custom metrics API server to determine when to scale the llm-nim pods.
Install the prometheus-adapter Helm chart from the prometheus-community GitHub repository:
helm install prometheus-adapter prometheus-community/prometheus-adapter --set-literal=prometheus.url=http://<prometheus-service-name>.<prometheus-namespace>.svc
Take special care to override the prometheus.url
Helm chart value of the Prometheus Adapter, as the default is http://prometheus.default.svc
, as indicated in the chart’s values.yaml file. If the prometheus.url
field is not configured properly at deployment time, the HPA resource for the llm-nim in the example above will never be able to determine the current metric usage it’s configured to monitor, as it relies on the Prometheus Adapter’s ability to ingest Prometheus metrics (which the adapter exposes via the custom metrics API).
Deploying the VSS Blueprint#
When the blueprint was built using UCS Tools earlier, it generated the blueprint Helm chart folder named nvidia-blueprint-vss-2.3.0
. You can deploy this chart in your Kubernetes environment as follows:
helm install nvidia-blueprint-vss nvidia-blueprint-vss-2.3.0 --namespace nvidia-blueprint-vss
Interacting with the VSS Blueprint via the VIA Python Client CLI#
Use the VIA Python client CLI to upload images or videos and make requests to summarize them.
Prometheus Dashboards#
Because you have access to a Grafana dashboard via the kube-prometheus-stack that was installed earlier, there are several dashboards you can explore.
NIM Example Dashboard#
See this section of the NIM LLM documentation for instructions on accessing and installing the NIM Dashboard JSON file.
DCGM Dashboard#
The NVIDIA Data Center GPU Manager Exporter (DCGM-Exporter) is installed as part of the GPU Operator. There is a Kubernetes service for DCGM-Exporter that exports GPU metrics on the /metrics
endpoint. Below is an example of calling its /metrics
endpoint manually in a MicroK8s environment:
kubectl port-forward service/nvidia-dcgm-exporter 9400:9400 -n gpu-operator-resources
curl 10.152.183.130:9400/metrics
Based on the Prometheus configuration we made earlier, the DCGM-Exporter will be scraped by Prometheus (there is a ServiceMonitor resource in the gpu-operator-resources namespace named nvidia-dcgm-exporter). In Grafana, you can install the DCGM-Exporter dashboard to visualize these metrics directly in Grafana.