Monitoring & Observability#
The NVIDIA Cluster Agent and Operator provide built-in monitoring through Prometheus metrics, structured logging, and OpenTelemetry tracing.
Monitoring Data#
Metrics#
Prerequisites
To use the PodMonitor and ServiceMonitor examples below, you must first install the Prometheus Operator. Follow the Prometheus Operator installation guide to set this up in your cluster.
The cluster agent and operator emit Prometheus-style metrics. The following metric labels are available by default. The full list of available metrics are updated regularly and therefore not listed.
Metric Label |
Metric Label Description |
|---|---|
nvca_event_name |
The name of the event |
nvca_nca_id |
The NCA ID of this NVCA instance |
nvca_cluster_name |
The NVCA cluster name |
nvca_cluster_group |
The NVCA cluster group |
nvca_version |
The NVCA version |
Cluster maintainers can scrape the available metrics. See a full example of how to do this with an OpenTelemetry Collector in the cluster here.
Use the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:
Sample NVCA Operator PodMonitor
1apiVersion: monitoring.coreos.com/v1
2kind: PodMonitor
3metadata:
4 labels:
5 app.kubernetes.io/component: metrics
6 app.kubernetes.io/instance: prometheus-agent
7 app.kubernetes.io/name: metrics-nvca-operator
8 jobLabel: metrics-nvca-operator
9 release: prometheus-agent
10 prometheus.agent/podmonitor-discover: "true"
11 name: metrics-nvca-operator
12 namespace: monitoring
13spec:
14 podMetricsEndpoints:
15 - port: http
16 scheme: http
17 path: /metrics
18 jobLabel: jobLabel
19 selector:
20 matchLabels:
21 app.kubernetes.io/name: nvca-operator
22 namespaceSelector:
23 matchNames:
24 - nvca-operator
Sample NVCA ServiceMonitor
1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4 labels:
5 app.kubernetes.io/component: metrics
6 app.kubernetes.io/instance: prometheus-agent
7 app.kubernetes.io/name: metrics-nvca
8 jobLabel: metrics-nvca
9 release: prometheus-agent
10 prometheus.agent/servicemonitor-discover: "true"
11 name: prometheus-agent-nvca
12 namespace: monitoring
13spec:
14 endpoints:
15 - port: nvca
16 jobLabel: jobLabel
17 selector:
18 matchLabels:
19 app.kubernetes.io/name: nvca
20 namespaceSelector:
21 matchNames:
22 - nvca-system
Logs#
Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.
Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl:
1kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20
Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:
1kubectl logs -l app.kubernetes.io/instance=nvca -n nvca-system --tail 20
Warning
Current function-level inference container logs are not supported for functions deployed on non-NVIDIA-managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third-party tool, there are no public egress limitations for containers.
Tracing#
The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector receiver is Lightstep.
Enable Tracing with Lightstep
Get your Lightstep access token from the Lightstep UI and set to
LS_ACCESS_TOKENenvironment variable.Get the NVCF cluster name:
1nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
Apply the tracing configuration:
1kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type=merge --patch="{\"spec\":{\"overrides\":{\"featureGate\":{\"otelConfig\":{\"exporter\":\"lightstep\",\"serviceName\":\"nvcf-nvca\",\"accessToken\":\"${LS_ACCESS_TOKEN}\"}}}}}"