This page provides guidance on configuring observability for self-hosted NVCF control-plane, including metrics, logging, and tracing.
Common operator questions and where to look on this page or in linked references.
Self-hosted NVCF control-plane observability enables users to monitor the health and performance of their NVCF deployment. The observability solution is designed to be:
The observability solution currently provides:
Looking for a quick start? If you want to quickly deploy example observability components to explore metrics, logs, and dashboards, see self-hosted-example-dashboards.
The example deployments are designed for development and testing only, and are not suitable for production use. For production deployments, follow the guidance on this page to integrate with your own observability infrastructure.
NVCF self-hosted observability is currently in Early Access (EA). During EA, NVCF provides interfaces and documentation for you to integrate with your own observability backend:
What’s Provided:
Your Responsibility:
The following control-plane services expose metrics and logs for monitoring:
Core NVCF Services:
Supporting Services:
Worker Pod Components:
All control-plane services expose Prometheus-compatible metrics endpoints. You can scrape these metrics using:
Metrics Documentation:
Detailed metrics documentation is available for each service, including metric names,
types, labels, and descriptions. See the per-service metrics reference under the
Metrics section.
Log Format:
Log Collection:
You can collect logs using any Kubernetes-compatible log aggregator:
System Logs:
System logs are available at standard UNIX locations and from the systemd journal.
Distributed tracing support via OpenTelemetry Protocol (OTLP) is planned for a future release:
global.observability.tracingYou configure observability by integrating with your own backend:
Use Prometheus Operator with the provided ServiceMonitor examples:
Or configure Prometheus scrape targets manually in your prometheus.yml.
Deploy a log collector as a DaemonSet to ship logs to your backend:
Configure your log collector to:
Enable distributed tracing by setting Helm values under
global.observability.tracing. The control-plane exports traces via OTLP
to your own OTLP-compatible collector. Set collectorEndpoint,
collectorPort, and collectorProtocol to match your collector’s address.
collectorProtocol is the endpoint URI scheme expected by the stack, not the
OTLP transport.
Helm overrides example:
Configuration fields:
enabled: Set to true to enable OTLP trace export from control-plane
services.collectorEndpoint: DNS name or address of your OTLP collector (e.g.,
OpenTelemetry Collector, Jaeger collector). Use a Kubernetes service DNS name
such as <service>.<namespace>.svc.cluster.local when the collector runs
in-cluster.collectorPort: Port on which the collector accepts OTLP traffic (e.g.,
4317 for gRPC, 4318 for HTTP depending on your collector setup).collectorProtocol: URI scheme used to build the collector endpoint
(http or https). This value does not select the OTLP transport.Ensure your collector is deployed and reachable from the NVCF control-plane namespace, and that it forwards traces to your backend (Jaeger, Tempo, Zipkin, or another OTLP-compatible system).
Reference Grafana dashboards are provided for control-plane services showing critical metrics for key services:
ESS (Encrypted Secrets Service)
Cassandra
Vault
Invocation Service
NVCF API
SIS (SPOT Instance Service)
Worker Pods (Utils Container, Init Container, Inference Container)
State Metrics Service (Available in GA)
Dashboard Location:
Dashboards are provided in native Grafana JSON format for file-provisioning.
Load dashboards into Grafana by placing them in /etc/grafana/provisioning/dashboards/ on startup.
Published dashboards will be available in the nv-cloud-function-helpers public GitHub repository.
For troubleshooting common observability issues:
Metrics not appearing:
Verify the service is exposing metrics:
Check ServiceMonitor or scrape configuration:
Verify network policies allow scraping:
Check service logs for errors:
Logs not being collected:
Verify log collector DaemonSet is running:
Check collector can access pod logs:
Verify log backend is reachable:
Check for log redaction or filtering rules:
Metrics Endpoints:
Metrics endpoints should be accessed over HTTP in-cluster only
All sensitive log data should be redacted by the log collector (currently, this is the responsibility of the log collector, not the service)
User-provided observability backend should be properly secured with RBAC, TLS/SSL, and other security best practices.
NVCF self-hosted control-plane observability is compatible with:
For the latest compatibility information, see the release notes.