NVIDIA NIM Operator Observability#

The NIM Operator provides Prometheus metrics for serveral metrics to track NIM cache, service, and pipelines deployed to your cluster as well as serveral commonn Kubernetes operator metrics. To start using metrics with the NIM Operator, you must install Prometheus and add spec.metrics to your NIM Operator custom resources.

To collect metrics for a Triton server for a non-LLM NIM microservice, expose the metrics port by setting spec.expose.service.metricsPort in your NIM service. Then refer to NVIDIA Triton Inferencing Server Metrics documentation for details on getting these metrics directly from the Triton server.

Install Prometheus#

Use Promethues to scrape metric data from microservices.

  1. Install Prometheus with Helm.

    $ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    $ helm repo update
    
    $ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace
    

    NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts.

  2. Optional: If you do not have a default storage class, add the following command-line arguments:

    --set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
    
  3. Install Prometheus Adapter.

    helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter     --namespace prometheus-adapter     --create-namespace     --set prometheus.url=http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local     --set prometheus.port="9090"    
    

Configure Metrics#

To enable metrics, add spec.metrics into your NIMService or NeMo microservice custom resource.

  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack

Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:

$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true

Metrics#

Metric Name

Type

Description

nimCache_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

  • Failed

  • InProgress

  • NotReady

  • PVC-Created

  • Pending

  • Ready

  • Started

  • Unknown

  • all

nimService_status_total

gauge

Specifies the number of NIM service resources in each of the following states:

  • Failed

  • NotReady

  • Pending

  • Ready

  • Unknown

  • all

nimPipeline_status_total

gauge

Specifies the number of NIM pipeline resources in each of the following states:

  • Failed

  • NotReady

  • Ready

  • Unknown

  • all

In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:

  • controller_runtime_active_workers

  • controller_runtime_max_concurrent_reconciles

  • controller_runtime_reconcile_errors_total

  • controller_runtime_reconcile_panics_total

  • controller_runtime_reconcile_time_seconds

  • controller_runtime_reconcile_total

  • controller_runtime_terminal_reconcile_errors_total

  • controller_runtime_webhook_panics_total

  • workqueue_adds_total

  • workqueue_depth

  • workqueue_longest_running_processor_seconds

  • workqueue_queue_duration_seconds

  • workqueue_retries_total

  • workqueue_unfinished_work_seconds

  • workqueue_work_duration_seconds

Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.