Example Dashboards Deployment
These Helm charts are provided as examples only and are not intended for production use.
The nvcf-observability-reference-stack and nvcf-example-dashboards Helm charts are
designed to help users understand useful metrics, logs, and traces, and to serve as
inspiration for creating custom dashboards tailored to their observability needs.
Important limitations:
- No security hardening
- No SSL/TLS encryption
- No authentication or authorization
- No support for production workloads
- Not supported by NVIDIA for uses beyond example and reference purposes
For production deployments, users should integrate with their own observability infrastructure following the guidance in self-hosted-observability.
Overview
This guide provides step-by-step instructions for deploying the NVCF observability reference stack and example dashboards in a local or development environment. These example components demonstrate how to collect and visualize metrics, logs, and traces from a self-hosted NVCF deployment.
The example stack includes:
- nvcf-observability-reference-stack: A reference implementation of an observability backend (Prometheus, Grafana, Loki, Tempo, and OpenTelemetry Collector)
- nvcf-example-dashboards: Pre-configured Grafana dashboards showing key NVCF control-plane metrics
Prerequisites
Before deploying the example dashboards, you need:
- A Kubernetes cluster
- Self-hosted NVCF control-plane deployed and running
helmCLI installedkubectlconfigured to access your cluster
To view traces in the example stack, the control-plane must be deployed
with tracing enabled and configured to send OTLP traces to the
observability collector. See Tracing Configuration
for the required Helm overrides under global.observability.tracing (e.g.,
enabled, collectorEndpoint, collectorPort,
collectorProtocol).
Enabling tracing after initial deployment
If observability was not configured during the initial self-hosted NVCF cluster
deployment (i.e., global.observability.tracing.enabled was set to false),
you can enable tracing later so that the control-plane sends OTLP traces to the
deployed observability collector. Use the following steps.
-
Edit the environment file (
environments/<environment-name>.yaml) to enable tracing and setcollectorEndpointto the deployed collector’s service address. For the example observability reference stack, the collector runs in theobservabilitynamespace. Example: -
Apply the configuration changes from your control-plane deployment directory:
Replace
<environment-name>with your environment name (e.g.,eks-example).
Deployment Steps
Step 1: Install the Observability Reference Stack
Once your self-managed NVCF control-plane is up and running, install the observability reference stack:
This will deploy:
- Prometheus for metrics collection
- Grafana for visualization
- Loki for log aggregation
- Tempo for distributed tracing
- OpenTelemetry Collector for telemetry processing
- Fluent Bit for log collection
The OTel Collector and Fluent Bit cluster-scoped config are disabled in this initial install; they are enabled in Step 2.
The Grafana installation in the observability reference stack includes the
ae3e-plotly-panel plugin for use with the example dashboards.
Verify the deployment:
Step 2: Enable the OTel Collector and Fluent Bit Cluster Config
The initial install does not enable the OTel Collector or the Fluent Bit
cluster-scoped config by default. A second Helm upgrade is required to set
otel-collector.enabled=true and fluentBitClusterConfig.enabled=true so
that the observability-gateway-collector service (and related gateway
services) are deployed and Fluent Bit cluster resources (ClusterFilter,
ClusterOutput, ClusterFluentBitConfig) are managed by Helm. This has to be
split into separate installs due to a chicken and egg problem with the CRDs.
These services are needed for the control-plane to send OTLP traces to Tempo.
Run the following upgrade (use the same chart version and namespace as in Step 1):
Verify the OTel Collector is running:
You should see the observability-gateway-collector pod and the
observability-gateway-collector service (and related gateway services).
Step 3: Install the Example Dashboards
Once the observability reference stack is deployed and the OTel Collector is enabled, install the example dashboards:
This will configure Grafana with pre-built dashboards for:
- NVCF API
- Invocation Service
- SPOT Instance Service (SIS)
- Encrypted Secrets Service (ESS)
- Cassandra
- Vault
- Worker Pods (Utils, Init, and Inference containers)
Access Grafana:
Then open your browser to http://localhost:3000 and log in to view the dashboards.
Step 4: Generate Dashboard Data
To populate the dashboards with meaningful data, you need to deploy and invoke functions. Deploy and invoke functions using your own commands or tools. The example dashboards will automatically populate as your NVCF control-plane handles function requests.
Cleanup and Uninstallation
When you’re finished testing or want to remove the example observability stack, follow these steps:
Step 1: Delete Custom Resources
First, delete any custom resources created by the observability stack:
If you enabled fluentBitClusterConfig.enabled=true in Step 2 of deployment,
the Fluent Bit cluster-scoped resources (ClusterFilter, ClusterOutput,
ClusterFluentBitConfig) are managed by Helm and will be removed when you
uninstall the Helm release in the next step.
Step 2: Uninstall Helm Releases
Uninstall both Helm releases:
Step 3: Delete the Namespace
Finally, delete the observability namespace:
If you deployed NVCF to a namespace other than observability, make sure to only delete
the observability namespace, not your NVCF control-plane namespace.
Troubleshooting
Pods not starting:
Check pod status and logs:
Dashboards not showing data:
-
Verify Prometheus is scraping metrics:
-
Verify NVCF services are exposing metrics:
-
Check that ServiceMonitors are created:
Grafana login issues:
The default credentials are typically admin/admin or may be configured via Helm values.
Check the Helm chart documentation or values for the correct credentials.
Next Steps
After exploring the example dashboards:
- Review the metrics, logs, and traces being collected
- Identify which metrics are most relevant to your use case
- Design and implement your own production-ready observability solution
- Integrate with your existing enterprise observability platforms
- Configure alerting based on your operational requirements
For production deployments, see self-hosted-observability for guidance on integrating with your own observability infrastructure.
Related Documentation
- self-hosted-observability: Production observability configuration
- nvcf-observability-reference-stack on NGC
- nvcf-example-dashboards on NGC