For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Service Keys
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • CLI
  • Observability
    • Observability
    • Example Dashboards
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Local Development
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
      • NGC-Managed Clusters
      • Helm-Managed Clusters
      • Configuration
      • Monitoring
      • Reference
      • Container Cache
      • GXCache
      • KAI Scheduler
    • Function Lifecycle
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • Monitoring Data
  • Metrics
  • Logs
  • Tracing
Managed (Legacy)Cluster Management

Monitoring & Observability

||View as Markdown|
Previous

Configuration

Next

Reference

The NVIDIA Cluster Agent and Operator provide built-in monitoring through Prometheus metrics, structured logging, and OpenTelemetry tracing.

Monitoring Data

Metrics

Prerequisites

To use the PodMonitor and ServiceMonitor examples below, you must first install the Prometheus Operator. Follow the Prometheus Operator installation guide to set this up in your cluster.

The cluster agent and operator emit Prometheus-style metrics. The following metric labels are available by default. The full list of available metrics are updated regularly and therefore not listed.

Metric LabelMetric Label Description
nvca_event_nameThe name of the event
nvca_nca_idThe NCA ID of this NVCA instance
nvca_cluster_nameThe NVCA cluster name
nvca_cluster_groupThe NVCA cluster group
nvca_versionThe NVCA version

Cluster maintainers can scrape the available metrics. See a full example of how to do this with an OpenTelemetry Collector in the cluster here.

Use the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:

Sample NVCA Operator PodMonitor

1 apiVersion: monitoring.coreos.com/v1
2 kind: PodMonitor
3 metadata:
4 labels:
5 app.kubernetes.io/component: metrics
6 app.kubernetes.io/instance: prometheus-agent
7 app.kubernetes.io/name: metrics-nvca-operator
8 jobLabel: metrics-nvca-operator
9 release: prometheus-agent
10 prometheus.agent/podmonitor-discover: "true"
11 name: metrics-nvca-operator
12 namespace: monitoring
13 spec:
14 podMetricsEndpoints:
15 - port: http
16 scheme: http
17 path: /metrics
18 jobLabel: jobLabel
19 selector:
20 matchLabels:
21 app.kubernetes.io/name: nvca-operator
22 namespaceSelector:
23 matchNames:
24 - nvca-operator

Sample NVCA ServiceMonitor

1 apiVersion: monitoring.coreos.com/v1
2 kind: ServiceMonitor
3 metadata:
4 labels:
5 app.kubernetes.io/component: metrics
6 app.kubernetes.io/instance: prometheus-agent
7 app.kubernetes.io/name: metrics-nvca
8 jobLabel: metrics-nvca
9 release: prometheus-agent
10 prometheus.agent/servicemonitor-discover: "true"
11 name: prometheus-agent-nvca
12 namespace: monitoring
13 spec:
14 endpoints:
15 - port: nvca
16 jobLabel: jobLabel
17 selector:
18 matchLabels:
19 app.kubernetes.io/name: nvca
20 namespaceSelector:
21 matchNames:
22 - nvca-system

Logs

Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.

Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl:

$ kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20

Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:

$ kubectl logs -l app.kubernetes.io/instance=nvca -n nvca-system --tail 20

Current function-level inference container logs are not supported for functions deployed on non-NVIDIA-managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third-party tool, there are no public egress limitations for containers.

Tracing

The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector receiver is Lightstep.

Enable Tracing with Lightstep

  1. Get your Lightstep access token from the Lightstep UI and set to LS_ACCESS_TOKEN environment variable.
  2. Get the NVCF cluster name:
$ nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
  1. Apply the tracing configuration:
$ kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type=merge --patch="{\"spec\":{\"overrides\":{\"featureGate\":{\"otelConfig\":{\"exporter\":\"lightstep\",\"serviceName\":\"nvcf-nvca\",\"accessToken\":\"${LS_ACCESS_TOKEN}\"}}}}}"