NIM Operator Observability#

Configuration#

Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:

$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true

Alternatively, if your Prometheus instance reads metrics from service monitor custom resources, apply a manifest like the following example:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nim-operator
  labels:
    app: nim-operator
    release: prometheus   # This value can differ according to your Prometheus installation.
spec:
  endpoints:
  - interval: 30s
    port: https
    scheme: https
    scrapeTimeout: 10s
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    matchNames:
    - nim-operator
  selector:
    matchLabels:
      app.kubernetes.io/name: k8s-nim-operator

Metrics#

Metric Name	Type	Description
`nimCache_status_total`	gauge	Specifies the number of NIM cache resources in each of the following states: Failed InProgress NotReady PVC-Created Pending Ready Started Unknown all
`nimService_status_total`	gauge	Specifies the number of NIM cache resources in each of the following states: Failed NotReady Pending Ready Unknown all
`nimPipeline_status_total`	gauge	Specifies the number of NIM pipeline resources in each of the following states: Failed NotReady Ready Unknown all

Metric Name

Type

Description

nimCache_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

Failed
InProgress
NotReady
PVC-Created
Pending
Ready
Started
Unknown
all

nimService_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

Failed
NotReady
Pending
Ready
Unknown
all

nimPipeline_status_total

gauge

Specifies the number of NIM pipeline resources in each of the following states:

Failed
NotReady
Ready
Unknown
all

In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:

controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds

Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.