NIM Operator Observability#
Install Prometheus#
Use Promethues to scrape metric data from microservices.
Install Prometheus with Helm.
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm repo update $ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace
NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts.
Optional: If you do not have a default storage class, add the following command-line arguments:
--set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
Install Prometheus Adapter.
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --namespace prometheus-adapter --create-namespace --set prometheus.url=http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local --set prometheus.port="9090"
Configuration Metrics#
To enable metrics, add spec.metrics
into your NIMService or NeMo microservice custom resource.
metrics:
enabled: true
serviceMonitor:
additionalLabels:
release: kube-prometheus-stack
Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:
$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true
Metrics#
Metric Name |
Type |
Description |
---|---|---|
|
gauge |
Specifies the number of NIM cache resources in each of the following states:
|
|
gauge |
Specifies the number of NIM service resources in each of the following states:
|
|
gauge |
Specifies the number of NIM pipeline resources in each of the following states:
|
In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:
controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds
Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.