What can I help you with?
DOCA Platform Framework (DPF) Documentation v25.4

Observability

This guide covers the steps to enable observability components such as Prometheus, Grafana, Parca and kube-state-metrics for the operator. By default, these components are disabled in the Helm chart and need to be enabled explicitly.

Purpose of kube-state-metrics

kube-state-metrics is a service that listens to the Kubernetes API server and generates metrics about the state of the objects (such as deployments, nodes, pods, etc.) managed by Kubernetes. In the context of our operator, we deploy kube-state-metrics to expose detailed metrics for the custom resource definitions (CRDs) managed by the operator.

Enabling Observability Components

To enable Prometheus, Grafana, Parca and Kube-State-Metrics, you can modify the values.yaml file as shown below:

Copy
Copied!
            

prometheus: enabled: true grafana: enabled: true kube-state-metrics: enabled: true parca: enabled: true

Alternatively, if you have deployed your operator already you can enable these components using Helm command-line options. This can be helpful for testing the monitoring stack. Please don't use this in production.

Copy
Copied!
            

helm -n dpf-operator-system \ upgrade dpf-operator dpf-repository/dpf-operator \ --version=v0.1.0-latest \ --values <(helm -n dpf-operator-system get values dpf-operator) \ --set grafana.enabled=true \ --set prometheus.enabled=true \ --set parca.enabled=true \ --set kube-state-metrics.enabled=true

Included Grafana Dashboards

We have preconfigured three Grafana dashboards to provide monitoring and insights into the operator and its controllers:

1) DOCA Platform Framework State: This dashboard provides a high-level overview of the operator and its controllers, highlighting key metrics such as resource status, condition states, and time to readiness. 2) Controller Runtime Dashboard: This dashboard provides detailed metrics and visualizations for the controllers, including information on reconciliation times, queue depths, and error rates. 3) Kubernetes API Server Requests Dashboard: This dashboard monitors the requests made to the Kubernetes API server, helping you to identify any performance bottlenecks or excessive API usage.

These dashboards are automatically deployed when Grafana is enabled. Once enabled, you can access them through the Grafana web UI under the "Dashboards" section.

Setting the Grafana Admin Password

You can set the Grafana admin password manually by configuring the grafana.adminPassword value in the values.yaml file:

Copy
Copied!
            

grafana: adminPassword: <your-password>

Alternatively, if you prefer Grafana to generate a custom password, leave grafana.adminPassword unset. After deployment, you can retrieve the autogenerated password using the following command:

Copy
Copied!
            

kubectl -n dpf-operator-system get secret dpf-operator-grafana -ojsonpath='{.data.admin-password}' | base64 -d

This command fetches the password from the Kubernetes secret created for Grafana.

Note on Storage Solution

By default, Grafana and Prometheus use hostPath for storage. This is not recommended for production environments due to the potential for data loss and lack of scalability. You should configure a more reliable storage solution.

To change the storage solution, you need to modify the respective storage configurations in the values.yaml file:

Copy
Copied!
            

prometheus: server: persistentVolume: enabled: true storageClass: <your-storage-class> grafana: persistence: enabled: true storageClassName: <your-storage-class>

Make sure to replace <your-storage-class> with the appropriate storage class for your environment.

Parca also uses local storage by default. Parca supports an S3 storage bucket for storing larger collections of profile data.

© Copyright 2025, NVIDIA. Last updated on May 20, 2025.