NVIDIA NIM Operator Observability#

The NIM Operator provides Prometheus metrics for serveral metrics to track NIM cache, service, and pipelines deployed to your cluster as well as serveral commonn Kubernetes operator metrics. To start using metrics with the NIM Operator, you must install Prometheus and add spec.metrics to your NIM Operator custom resources.

To collect metrics for a Triton server for a non-LLM NIM microservice, expose the metrics port by setting spec.expose.service.metricsPort in your NIM service. Then refer to NVIDIA Triton Inferencing Server Metrics documentation for details on getting these metrics directly from the Triton server.

Install Prometheus#

Use Promethues to scrape metric data from microservices.

Install Prometheus with Helm.

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update

$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace

NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts.

Optional: If you do not have a default storage class, add the following command-line arguments:

--set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>

Install Prometheus Adapter.

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter     --namespace prometheus-adapter     --create-namespace     --set prometheus.url=http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local     --set prometheus.port="9090"    

Configure Metrics#

To enable metrics, add spec.metrics into your NIMService or NeMo microservice custom resource.

  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack

Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:

$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true

Metrics#

Metric Name	Type	Description
`nimCache_status_total`	gauge	Specifies the number of NIM cache resources in each of the following states: Failed InProgress NotReady PVC-Created Pending Ready Started Unknown all
`nimService_status_total`	gauge	Specifies the number of NIM service resources in each of the following states: Failed NotReady Pending Ready Unknown all
`nimPipeline_status_total`	gauge	Specifies the number of NIM pipeline resources in each of the following states: Failed NotReady Ready Unknown all

Metric Name

Type

Description

nimCache_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

Failed
InProgress
NotReady
PVC-Created
Pending
Ready
Started
Unknown
all

nimService_status_total

gauge

Specifies the number of NIM service resources in each of the following states:

Failed
NotReady
Pending
Ready
Unknown
all

nimPipeline_status_total

gauge

Specifies the number of NIM pipeline resources in each of the following states:

Failed
NotReady
Ready
Unknown
all

In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:

controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds

Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.