NIM Operator Observability#

Install Prometheus#

Use Promethues to scrape metric data from microservices.

  1. Install Prometheus with Helm.

    $ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    $ helm repo update
    
    $ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace
    

    NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts.

  2. Optional: If you do not have a default storage class, add the following command-line arguments:

    --set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
    
  3. Install Prometheus Adapter.

    helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter     --namespace prometheus-adapter     --create-namespace     --set prometheus.url=http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local     --set prometheus.port="9090"    
    

Configuration Metrics#

To enable metrics, add spec.metrics into your NIMService or NeMo microservice custom resource.

  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack

Annotate the NIM Operator metrics service so that Prometheus scrapes metrics:

$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true

Metrics#

Metric Name

Type

Description

nimCache_status_total

gauge

Specifies the number of NIM cache resources in each of the following states:

  • Failed

  • InProgress

  • NotReady

  • PVC-Created

  • Pending

  • Ready

  • Started

  • Unknown

  • all

nimService_status_total

gauge

Specifies the number of NIM service resources in each of the following states:

  • Failed

  • NotReady

  • Pending

  • Ready

  • Unknown

  • all

nimPipeline_status_total

gauge

Specifies the number of NIM pipeline resources in each of the following states:

  • Failed

  • NotReady

  • Ready

  • Unknown

  • all

In addition to the metrics for the custom resources in the preceding table, the Operator produces the following common Kubernetes controller metrics:

  • controller_runtime_active_workers

  • controller_runtime_max_concurrent_reconciles

  • controller_runtime_reconcile_errors_total

  • controller_runtime_reconcile_panics_total

  • controller_runtime_reconcile_time_seconds

  • controller_runtime_reconcile_total

  • controller_runtime_terminal_reconcile_errors_total

  • controller_runtime_webhook_panics_total

  • workqueue_adds_total

  • workqueue_depth

  • workqueue_longest_running_processor_seconds

  • workqueue_queue_duration_seconds

  • workqueue_retries_total

  • workqueue_unfinished_work_seconds

  • workqueue_work_duration_seconds

Refer to https://book.kubebuilder.io/reference/metrics-reference for information about the common metrics.