NVIDIA NIM Operator Observability#
The NIM Operator provides Prometheus metrics for serveral metrics to track NIM cache, service, and pipelines deployed to your cluster as well as serveral commonn Kubernetes operator metrics.
To start using metrics with the NIM Operator, you must install Prometheus and add spec.metrics
to your NIM Operator custom resources.
To collect metrics for a Triton server for a non-LLM NIM microservice, expose the metrics port by setting spec.expose.service.metricsPort
in your NIM service.
Then refer to NVIDIA Triton Inferencing Server Metrics documentation for details on getting these metrics directly from the Triton server.
Install Prometheus#
Use Promethues to scrape metric data from microservices.
Install Prometheus with Helm.
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm repo update $ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace
NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts.
Optional: If you do not have a default storage class, add the following command-line arguments:
--set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
Install Prometheus Adapter.
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --namespace prometheus-adapter --create-namespace --set prometheus.url=http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local --set prometheus.port="9090"
Configure Metrics#
To enable metrics, add spec.metrics
into your NIMService or NeMo microservice custom resource.
metrics:
enabled: true
serviceMonitor:
additionalLabels:
release: kube-prometheus-stack
Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:
$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true
Metrics#
Metric Name |
Type |
Description |
---|---|---|
|
gauge |
Specifies the number of NIM cache resources in each of the following states:
|
|
gauge |
Specifies the number of NIM service resources in each of the following states:
|
|
gauge |
Specifies the number of NIM pipeline resources in each of the following states:
|
In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:
controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds
Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.