NIM Operator Observability#
Configuration#
Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:
$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true
Alternatively, if your Prometheus instance reads metrics from service monitor custom resources, apply a manifest like the following example:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nim-operator
labels:
app: nim-operator
release: prometheus # This value can differ according to your Prometheus installation.
spec:
endpoints:
- interval: 30s
port: https
scheme: https
scrapeTimeout: 10s
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
insecureSkipVerify: true
namespaceSelector:
matchNames:
- nim-operator
selector:
matchLabels:
app.kubernetes.io/name: k8s-nim-operator
Metrics#
Metric Name |
Type |
Description |
---|---|---|
|
gauge |
Specifies the number of NIM cache resources in each of the following states:
|
|
gauge |
Specifies the number of NIM cache resources in each of the following states:
|
|
gauge |
Specifies the number of NIM pipeline resources in each of the following states:
|
In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:
controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds
Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.