Example Dashboards Deployment#

Warning

These Helm charts are provided as examples only and are not intended for production use.

The nvcf-observability-reference-stack and nvcf-example-dashboards Helm charts are designed to help users understand useful metrics, logs, and traces, and to serve as inspiration for creating custom dashboards tailored to their observability needs.

Important limitations:

No security hardening
No SSL/TLS encryption
No authentication or authorization
No support for production workloads
Not supported by NVIDIA for uses beyond example and reference purposes

For production deployments, users should integrate with their own observability infrastructure following the guidance in Observability Configuration.

Overview#

This guide provides step-by-step instructions for deploying the NVCF observability reference stack and example dashboards in a local or development environment. These example components demonstrate how to collect and visualize metrics, logs, and traces from a self-hosted NVCF deployment.

The example stack includes:

nvcf-observability-reference-stack: A reference implementation of an observability backend (Prometheus, Grafana, Loki, Tempo, and OpenTelemetry Collector)
nvcf-example-dashboards: Pre-configured Grafana dashboards showing key NVCF control-plane metrics

Prerequisites#

Before deploying the example dashboards, you need:

A Kubernetes cluster
Self-hosted NVCF control-plane deployed and running
helm CLI installed
kubectl configured to access your cluster

Note

To view traces in the example stack, the control-plane must be deployed with tracing enabled and configured to send OTLP traces to the observability collector. See Tracing Configuration for the required Helm overrides under global.observability.tracing (e.g., enabled, collectorEndpoint, collectorPort, collectorProtocol).

Enabling tracing after initial deployment#

If observability was not configured during the initial self-hosted NVCF cluster deployment (i.e., global.observability.tracing.enabled was set to false), you can enable tracing later so that the control-plane sends OTLP traces to the deployed observability collector. Use the following steps.

Edit the environment file (environments/<environment-name>.yaml) to enable tracing and set collectorEndpoint to the deployed collector’s service address. For the example observability reference stack, the collector runs in the observability namespace. Example:
```
global:
  observability:
    tracing:
      enabled: true
      collectorEndpoint: "otel-collector-gateway-collector.observability.svc.cluster.local"
      collectorPort: 4317
      collectorProtocol: http
```
Apply the configuration changes from your control-plane deployment directory:
```
HELMFILE_ENV=<environment-name> helmfile sync
```
Replace <environment-name> with your environment name (e.g., eks-example).

Deployment Steps#

Step 1: Install the Observability Reference Stack#

Once your self-managed NVCF control-plane is up and running, install the observability reference stack:

# Check NGC for the latest version of this Helm chart:
# https://registry.ngc.nvidia.com/orgs/0833294136851237/teams/nvcf-ncp-staging/containers/nvcf-observability-reference-stack/tags

helm upgrade \
  --install observability \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-observability-reference-stack \
  --version 1.7.0 \
  --namespace observability \
  --create-namespace

This will deploy:

Prometheus for metrics collection
Grafana for visualization
Loki for log aggregation
Tempo for distributed tracing
OpenTelemetry Collector for telemetry processing
Fluent Bit for log collection

The OTel Collector and Fluent Bit cluster-scoped config are disabled in this initial install; they are enabled in Step 2.

Note

The Grafana installation in the observability reference stack includes the ae3e-plotly-panel plugin for use with the example dashboards.

Verify the deployment:

# Check that all pods are running
kubectl get pods -n observability

# Wait for all pods to be Ready
kubectl wait --for=condition=ready pod --all -n observability --timeout=300s

Step 2: Enable the OTel Collector and Fluent Bit Cluster Config#

The initial install does not enable the OTel Collector or the Fluent Bit cluster-scoped config by default. A second Helm upgrade is required to set otel-collector.enabled=true and fluentBitClusterConfig.enabled=true so that the observability-gateway-collector service (and related gateway services) are deployed and Fluent Bit cluster resources (ClusterFilter, ClusterOutput, ClusterFluentBitConfig) are managed by Helm. This has to be split into separate installs due to a chicken and egg problem with the CRDs. These services are needed for the control-plane to send OTLP traces to Tempo.

Run the following upgrade (use the same chart version and namespace as in Step 1):

helm upgrade observability \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-observability-reference-stack \
  --version 1.7.0 \
  --namespace observability \
  --reuse-values \
  --set fluentBitClusterConfig.enabled=true \
  --set otel-collector.enabled=true \
  --wait \
  --timeout 5m

Verify the OTel Collector is running:

kubectl get pods -n observability | grep observability-gateway
kubectl get svc -n observability | grep observability-gateway

You should see the observability-gateway-collector pod and the observability-gateway-collector service (and related gateway services).

Step 3: Install the Example Dashboards#

Once the observability reference stack is deployed and the OTel Collector is enabled, install the example dashboards:

# Check NGC for the latest version of this Helm chart:
# https://registry.ngc.nvidia.com/orgs/0833294136851237/teams/nvcf-ncp-staging/containers/nvcf-example-dashboards/tags

helm upgrade \
  --install nvcf-example-dashboards \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-example-dashboards \
  --version 1.6.0 \
  --namespace observability \
  --create-namespace

This will configure Grafana with pre-built dashboards for:

NVCF API
Invocation Service
SPOT Instance Service (SIS)
Encrypted Secrets Service (ESS)
Cassandra
Vault
Worker Pods (Utils, Init, and Inference containers)

Access Grafana:

# Port-forward to access Grafana UI
kubectl port-forward -n observability \
    svc/$(kubectl get svc -n observability -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}') \
    3000:80

Then open your browser to http://localhost:3000 and log in to view the dashboards.

Step 4: Generate Dashboard Data#

To populate the dashboards with meaningful data, you need to deploy and invoke functions. Deploy and invoke functions using your own commands or tools. The example dashboards will automatically populate as your NVCF control-plane handles function requests.

Cleanup and Uninstallation#

When you’re finished testing or want to remove the example observability stack, follow these steps:

Step 1: Delete Custom Resources#

First, delete any custom resources created by the observability stack:

# Delete FluentBit custom resources (namespace-scoped)
kubectl delete fluentbits.fluentbit.fluent.io --all -A

# Delete OpenTelemetry Collector custom resources
kubectl delete opentelemetrycollectors.opentelemetry.io --all -A

If you enabled fluentBitClusterConfig.enabled=true in Step 2 of deployment, the Fluent Bit cluster-scoped resources (ClusterFilter, ClusterOutput, ClusterFluentBitConfig) are managed by Helm and will be removed when you uninstall the Helm release in the next step.

Step 2: Uninstall Helm Releases#

Uninstall both Helm releases:

# Uninstall example dashboards
helm uninstall nvcf-example-dashboards -n observability

# Uninstall observability reference stack
helm uninstall observability -n observability

Step 3: Delete the Namespace#

Finally, delete the observability namespace:

# Delete the namespace (this will remove any remaining resources)
kubectl delete namespace observability

Note

If you deployed NVCF to a namespace other than observability, make sure to only delete the observability namespace, not your NVCF control-plane namespace.

Troubleshooting#

Pods not starting:

Check pod status and logs:

kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability
kubectl logs <pod-name> -n observability

Dashboards not showing data:

Verify Prometheus is scraping metrics:

# Port-forward to Prometheus
kubectl port-forward -n observability svc/prometheus 9090:9090

# Check targets at http://localhost:9090/targets

Verify NVCF services are exposing metrics:

# Port-forward to an NVCF service
kubectl port-forward -n nvcf svc/nvcf-api 8080:8080

# Curl the metrics endpoint
curl http://localhost:8080/metrics

Check that ServiceMonitors are created:
```
kubectl get servicemonitor -n nvcf
```

Grafana login issues:

The default credentials are typically admin/admin or may be configured via Helm values. Check the Helm chart documentation or values for the correct credentials.

Next Steps#

After exploring the example dashboards:

Review the metrics, logs, and traces being collected
Identify which metrics are most relevant to your use case
Design and implement your own production-ready observability solution
Integrate with your existing enterprise observability platforms
Configure alerting based on your operational requirements

For production deployments, see Observability Configuration for guidance on integrating with your own observability infrastructure.

Example Dashboards Deployment#

Overview#

Prerequisites#

Enabling tracing after initial deployment#

Deployment Steps#

Step 1: Install the Observability Reference Stack#

Step 2: Enable the OTel Collector and Fluent Bit Cluster Config#

Step 3: Install the Example Dashboards#

Step 4: Generate Dashboard Data#

Cleanup and Uninstallation#

Step 1: Delete Custom Resources#

Step 2: Uninstall Helm Releases#

Step 3: Delete the Namespace#

Troubleshooting#

Next Steps#

Related Documentation#