NIM Operator Support#

Note

Support for the NIM Operator is available in UCS Tools starting with beta release 2.11.0-rc1.

The NIM-Operator deploys NIMs to Kubernetes by applying Custom Resource Definitions (CRDs) to create deployments and services for given NIMs. UCS Tools supports the NIM Operator in blueprint (application) definitions. Behind the scenes, UCS Tools creates a NIMPipeline CRD that the operator understands to deploy one or more NIMs.

Consider the VSS blueprint as an example (see VSS documentation for more details). The application YAML configuration is as follows:

specVersion: '2.5.0'

version: 2.3.0

doc: README.md

name: nvidia-blueprint-vss

description: Video Search and Summarization Agent Blueprint

dependencies:
- ucf.svc.vss:2.3.0
- ucf.svc.etcd:2.1.0
- ucf.svc.minio:2.1.0
- ucf.svc.milvus:2.1.0
- ucf.svc.neo4j:2.1.0
- ucf.svc.riva:2.3.0

components:
- name: vss
  type: ucf.svc.vss
  parameters:
    vlmModelType: vila-1.5
    vlmModelPath: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
    llmModel: meta/llama-3.3-70b-instruct
    llmModelChat: meta/llama-3.3-70b-instruct
    imagePullSecrets:
    - name: ngc-docker-reg-secret
    resources:
      limits:
        nvidia.com/gpu: 2

  secrets:
    # openai-api-key: openai-api-key
    #nvidia-api-key: nvidia-api-key
    ngc-api-key: ngc-api-key
    graph-db-username: graph-db-username
    graph-db-password: graph-db-password

- name: etcd
  type: ucf.svc.etcd
- name: minio
  type: ucf.svc.minio
- name: milvus
  type: ucf.svc.milvus

- name: neo4j
  type: ucf.svc.neo4j
  secrets:
    db-username: graph-db-username
    db-password: graph-db-password

- name: riva
  type: ucf.svc.riva
  parameters:
    enabled: false
    imagePullSecrets:
    - name: ngc-docker-reg-secret

- name: rag
  type: nim-operator
  parameters:
    services:
    - name: llm-nim
      enabled: true
      spec:
        metrics:
          enabled: true
        serviceMonitor:
          interval: 15s
          scrapeTimeout: 6s
        image:
          repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
          tag: 1.3.3
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-docker-reg-secret
        authSecret: ngc-api-key-secret
        storage:
          pvc:
            create: true
            storageClass: microk8s-hostpath
            name: meta-llama
            volumeAccessMode: ReadWriteMany
            size: 10Gi
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 2
        expose:
          service:
            type: ClusterIP
            port: 8000
        scale:
          enabled: true
          hpa:
            maxReplicas: 2
            minReplicas: 1
            metrics:
            - type: Object
              object:
                metric:
                  name: gpu_cache_usage_perc
                describedObject:
                  apiVersion: v1
                  kind: Service
                  name: llm-nim
                target:
                  type: Value
                  value: '0.3'
    - name: nemo-embedding
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-docker-reg-secret
        authSecret: ngc-api-key-secret
        storage:
          pvc:
            create: true
            storageClass: microk8s-hostpath
            name: nemo-embedding
            volumeAccessMode: ReadWriteMany
            size: 10Gi
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
    - name: nemo-rerank
      enabled: true
      spec:
        image:
          repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
          tag: 1.3.1
          pullPolicy: IfNotPresent
          pullSecrets:
          - ngc-docker-reg-secret
        authSecret: ngc-api-key-secret
        storage:
          pvc:
            create: true
            storageClass: microk8s-hostpath
            name: nemo-rerank
            volumeAccessMode: ReadWriteMany
            size: 10Gi
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 1
        expose:
          service:
            type: ClusterIP
            port: 8000

connections:
  milvus/etcd: etcd/http-api
  milvus/minio: minio/http-api
  vss/milvus: milvus/http-api1  # port 19530
  vss/neo4j-bolt: neo4j/bolt
  vss/llm-openai-api: rag/llm-nim
  vss/nemo-embed: rag/nemo-embedding
  vss/nemo-rerank: rag/nemo-rerank
  vss/riva-api: riva/http-api

secrets:
  # openai-api-key:
  #   k8sSecret:
  #     secretName: openai-api-key-secret
  #     key: OPENAI_API_KEY
  #nvidia-api-key:
  #  k8sSecret:
  #    secretName: nvidia-api-key-secret
  #    key: NVIDIA_API_KEY
  ngc-api-key:
    k8sSecret:
      secretName: ngc-api-key-secret
      key: NGC_API_KEY
  graph-db-username:
    k8sSecret:
      secretName: graph-db-creds-secret
      key: username
  graph-db-password:
    k8sSecret:
      secretName: graph-db-creds-secret
      key: password

The NIM Operator is represented by the built-in component named nim-operator (which is used as a component in the application above with the name “rag”). In this example, the NIM Operator is configured to deploy three NIMs, where each NIM in the NIMPipeline definition is represented by a NIMService CRD (also part of the NIM Operator set of CRDs):

Nemo Embedding
Nemo Reranking
LLM NIM

Each of these NIMService objects has an associated Kubernetes service defined in the spec.expose.service field. The service name will be the same as the name of the associated NIMService.

At the bottom of the configuration, the VSS client application makes three connections to the NIMService objects defined by the NIM Operator:

vss/llm-openai-api: rag/llm-nim
vss/nemo-embed: rag/nemo-embedding
vss/nemo-rerank: rag/nemo-rerank

To make a connection to a NIM Operator service, use the name of the operator component in the application (“rag” in this case) and the Kubernetes service name.

Building this application with the command:

ucf_app_builder_cli app build via-nim-blueprint/via-blueprint/via-blueprint.yaml

generates the output blueprint Helm chart named nvidia-blueprint-vss-2.3.0. The NIM Operator’s NIMPipeline CRD manifest will be output to the “templates” subdirectory at nvidia-blueprint-vss-2.3.0/templates/nvidia-nim-operator-pipeline.yaml.

Prerequisites for Deploying Your Blueprint#

GPU Resources#

You’ll need at least 4 GPUs, although the current configuration is designed for 6. This configuration has been tested on A100 GPUs and will also work with H100 GPUs.

The llm-nim NIM is configured to use 2 GPUs, but you can reduce this to 1. Similarly, the VSS NIMService is also configured for 2 GPUs, but 1 should be sufficient for non-intensive workloads. The Nemo Rerank and NeMo Embedding NIMs are each configured with 1 GPU.

For HPA scaling to work properly, you’ll need an extra GPU set aside (as HPA is configured with minReplicas equal to 1 and maxReplicas equal to 2). Therefore, you’ll want to deploy on a system with at least 5 GPUs if llm-nim and VSS are both set to use 1 GPU, or at least 7 GPUs if using the default GPU counts.

NIM Operator#

Before using Helm to install the blueprint, ensure that the NIM Operator is running in your Kubernetes cluster. See the NIM-Operator documentation for installation instructions.

Add the Prometheus Community Helm Repository#

This repository will be used when installing Prometheus:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Prometheus Kubernetes Stack#

The kube-prometheus-stack Helm chart is available in the prometheus-community GitHub repository. This includes Prometheus, the Prometheus Operator (which provides CRDs such as ServiceMonitor, PodMonitor, etc.), and Grafana.

If you are using a single-node Kubernetes setup with MicroK8s, you can run microk8s enable observability, which will deploy this stack for you.

Updating the Prometheus Configuration#

After installing the Prometheus Kubernetes stack, you should update the “Prometheus” CRD, which contains the configuration for Prometheus. We recommend updating the serviceMonitorSelector field and setting it to {}. By default, at least for MicroK8s, it’s set to:

serviceMonitorSelector:
    matchLabels:
        release: kube-prometheus-stack

Changing it to:

serviceMonitorSelector: {}

means that it will select any ServiceMonitor resource, not just those with the label release: kube-prometheus-stack. You might also want to verify that this is set to select ServiceMonitors in all namespaces:

serviceMonitorNamespaceSelector: {}

You can update the settings similarly for the podMonitorNamespaceSelector and podMonitorSelector fields.

Prometheus Adapter#

The Prometheus Adapter is used to provide the custom metrics API. The HPA configured earlier for the llm-nim service in the NIM Operator manifest will query the custom metrics API server to determine when to scale the llm-nim pods.

Install the prometheus-adapter Helm chart from the prometheus-community GitHub repository:

helm install prometheus-adapter prometheus-community/prometheus-adapter --set-literal=prometheus.url=http://<prometheus-service-name>.<prometheus-namespace>.svc

Take special care to override the prometheus.url Helm chart value of the Prometheus Adapter, as the default is http://prometheus.default.svc, as indicated in the chart’s values.yaml file. If the prometheus.url field is not configured properly at deployment time, the HPA resource for the llm-nim in the example above will never be able to determine the current metric usage it’s configured to monitor, as it relies on the Prometheus Adapter’s ability to ingest Prometheus metrics (which the adapter exposes via the custom metrics API).

Deploying the VSS Blueprint#

When the blueprint was built using UCS Tools earlier, it generated the blueprint Helm chart folder named nvidia-blueprint-vss-2.3.0. You can deploy this chart in your Kubernetes environment as follows:

helm install nvidia-blueprint-vss nvidia-blueprint-vss-2.3.0 --namespace nvidia-blueprint-vss

Interacting with the VSS Blueprint via the VIA Python Client CLI#

Use the VIA Python client CLI to upload images or videos and make requests to summarize them.

Prometheus Dashboards#

Because you have access to a Grafana dashboard via the kube-prometheus-stack that was installed earlier, there are several dashboards you can explore.

NIM Example Dashboard#

See this section of the NIM LLM documentation for instructions on accessing and installing the NIM Dashboard JSON file.

DCGM Dashboard#

The NVIDIA Data Center GPU Manager Exporter (DCGM-Exporter) is installed as part of the GPU Operator. There is a Kubernetes service for DCGM-Exporter that exports GPU metrics on the /metrics endpoint. Below is an example of calling its /metrics endpoint manually in a MicroK8s environment:

kubectl port-forward service/nvidia-dcgm-exporter 9400:9400 -n gpu-operator-resources
curl 10.152.183.130:9400/metrics

Based on the Prometheus configuration we made earlier, the DCGM-Exporter will be scraped by Prometheus (there is a ServiceMonitor resource in the gpu-operator-resources namespace named nvidia-dcgm-exporter). In Grafana, you can install the DCGM-Exporter dashboard to visualize these metrics directly in Grafana.