NIM Operator Observability

Configuration

Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:

$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true

Alternatively, if your Prometheus instance reads metrics from service monitor custom resources, apply a manifest like the following example:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nim-operator
  labels:
    app: nim-operator
    release: prometheus   # This value can differ according to your Prometheus installation.
spec:
  endpoints:
  - interval: 30s
    port: https
    scheme: https
    scrapeTimeout: 10s
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    matchNames:
    - nim-operator
  selector:
    matchLabels:
      app.kubernetes.io/name: k8s-nim-operator

Metrics

Metric Name

Type

Description

nimCache_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

  • Failed

  • InProgress

  • NotReady

  • PVC-Created

  • Pending

  • Ready

  • Started

  • Unknown

  • all

nimService_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

  • Failed

  • NotReady

  • Pending

  • Ready

  • Unknown

  • all

nimPipeline_status_total

gauge

Specifies the number of NIM pipeline resources in each of the following states:

  • Failed

  • NotReady

  • Ready

  • Unknown

  • all

In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:

  • controller_runtime_active_workers

  • controller_runtime_max_concurrent_reconciles

  • controller_runtime_reconcile_errors_total

  • controller_runtime_reconcile_panics_total

  • controller_runtime_reconcile_time_seconds

  • controller_runtime_reconcile_total

  • controller_runtime_terminal_reconcile_errors_total

  • controller_runtime_webhook_panics_total

  • workqueue_adds_total

  • workqueue_depth

  • workqueue_longest_running_processor_seconds

  • workqueue_queue_duration_seconds

  • workqueue_retries_total

  • workqueue_unfinished_work_seconds

  • workqueue_work_duration_seconds

Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.