Example Dashboards Deployment#

Warning

These Helm charts are provided as examples only and are not intended for production use.

The nvcf-observability-reference-stack and nvcf-example-dashboards Helm charts are designed to help users understand useful metrics, logs, and traces, and to serve as inspiration for creating custom dashboards tailored to their observability needs.

Important limitations:

  • No security hardening

  • No SSL/TLS encryption

  • No authentication or authorization

  • No support for production workloads

  • Not supported by NVIDIA for uses beyond example and reference purposes

For production deployments, users should integrate with their own observability infrastructure following the guidance in Observability Configuration.

Overview#

This guide provides step-by-step instructions for deploying the NVCF observability reference stack and example dashboards in a local or development environment. These example components demonstrate how to collect and visualize metrics, logs, and traces from a self-hosted NVCF deployment.

The example stack includes:

  • nvcf-observability-reference-stack: A reference implementation of an observability backend (Prometheus, Grafana, Loki, Tempo, and OpenTelemetry Collector)

  • nvcf-example-dashboards: Pre-configured Grafana dashboards showing key NVCF control-plane metrics

Prerequisites#

Before deploying the example dashboards, you need:

  1. A Kubernetes cluster

  2. Self-hosted NVCF control-plane deployed and running

  3. helm CLI installed

  4. kubectl configured to access your cluster

Note

To view traces in the example stack, the control-plane must be deployed with tracing enabled and configured to send OTLP traces to the observability collector. See Tracing Configuration for the required Helm overrides under global.observability.tracing (e.g., enabled, collectorEndpoint, collectorPort, collectorProtocol).

Enabling tracing after initial deployment#

If observability was not configured during the initial self-hosted NVCF cluster deployment (i.e., global.observability.tracing.enabled was set to false), you can enable tracing later so that the control-plane sends OTLP traces to the deployed observability collector. Use the following steps.

  1. Edit the environment file (environments/<environment-name>.yaml) to enable tracing and set collectorEndpoint to the deployed collector’s service address. For the example observability reference stack, the collector runs in the observability namespace. Example:

    global:
      observability:
        tracing:
          enabled: true
          collectorEndpoint: "otel-collector-gateway-collector.observability.svc.cluster.local"
          collectorPort: 4317
          collectorProtocol: http
    
  2. Apply the configuration changes from your control-plane deployment directory:

    HELMFILE_ENV=<environment-name> helmfile sync
    

    Replace <environment-name> with your environment name (e.g., eks-example).

Deployment Steps#

Step 1: Install the Observability Reference Stack#

Once your self-managed NVCF control-plane is up and running, install the observability reference stack:

# Check NGC for the latest version of this Helm chart:
# https://registry.ngc.nvidia.com/orgs/0833294136851237/teams/nvcf-ncp-staging/containers/nvcf-observability-reference-stack/tags

helm upgrade \
  --install observability \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-observability-reference-stack \
  --version 1.7.0 \
  --namespace observability \
  --create-namespace

This will deploy:

  • Prometheus for metrics collection

  • Grafana for visualization

  • Loki for log aggregation

  • Tempo for distributed tracing

  • OpenTelemetry Collector for telemetry processing

  • Fluent Bit for log collection

The OTel Collector and Fluent Bit cluster-scoped config are disabled in this initial install; they are enabled in Step 2.

Note

The Grafana installation in the observability reference stack includes the ae3e-plotly-panel plugin for use with the example dashboards.

Verify the deployment:

# Check that all pods are running
kubectl get pods -n observability

# Wait for all pods to be Ready
kubectl wait --for=condition=ready pod --all -n observability --timeout=300s

Step 2: Enable the OTel Collector and Fluent Bit Cluster Config#

The initial install does not enable the OTel Collector or the Fluent Bit cluster-scoped config by default. A second Helm upgrade is required to set otel-collector.enabled=true and fluentBitClusterConfig.enabled=true so that the observability-gateway-collector service (and related gateway services) are deployed and Fluent Bit cluster resources (ClusterFilter, ClusterOutput, ClusterFluentBitConfig) are managed by Helm. This has to be split into separate installs due to a chicken and egg problem with the CRDs. These services are needed for the control-plane to send OTLP traces to Tempo.

Run the following upgrade (use the same chart version and namespace as in Step 1):

helm upgrade observability \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-observability-reference-stack \
  --version 1.7.0 \
  --namespace observability \
  --reuse-values \
  --set fluentBitClusterConfig.enabled=true \
  --set otel-collector.enabled=true \
  --wait \
  --timeout 5m

Verify the OTel Collector is running:

kubectl get pods -n observability | grep observability-gateway
kubectl get svc -n observability | grep observability-gateway

You should see the observability-gateway-collector pod and the observability-gateway-collector service (and related gateway services).

Step 3: Install the Example Dashboards#

Once the observability reference stack is deployed and the OTel Collector is enabled, install the example dashboards:

# Check NGC for the latest version of this Helm chart:
# https://registry.ngc.nvidia.com/orgs/0833294136851237/teams/nvcf-ncp-staging/containers/nvcf-example-dashboards/tags

helm upgrade \
  --install nvcf-example-dashboards \
  oci://nvcr.io/0833294136851237/nvcf-ncp-staging/nvcf-example-dashboards \
  --version 1.5.0 \
  --namespace observability \
  --create-namespace

This will configure Grafana with pre-built dashboards for:

  • NVCF API

  • Invocation Service

  • SPOT Instance Service (SIS)

  • Encrypted Secrets Service (ESS)

  • Cassandra

  • Vault

  • Worker Pods (Utils, Init, and Inference containers)

Access Grafana:

# Port-forward to access Grafana UI
kubectl port-forward -n observability \
    svc/$(kubectl get svc -n observability -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].metadata.name}') \
    3000:80

Then open your browser to http://localhost:3000 and log in to view the dashboards.

Step 4: Generate Dashboard Data#

To populate the dashboards with meaningful data, you need to deploy and invoke functions. Deploy and invoke functions using your own commands or tools. The example dashboards will automatically populate as your NVCF control-plane handles function requests.

Cleanup and Uninstallation#

When you’re finished testing or want to remove the example observability stack, follow these steps:

Step 1: Delete Custom Resources#

First, delete any custom resources created by the observability stack:

# Delete FluentBit custom resources (namespace-scoped)
kubectl delete fluentbits.fluentbit.fluent.io --all -A

# Delete OpenTelemetry Collector custom resources
kubectl delete opentelemetrycollectors.opentelemetry.io --all -A

If you enabled fluentBitClusterConfig.enabled=true in Step 2 of deployment, the Fluent Bit cluster-scoped resources (ClusterFilter, ClusterOutput, ClusterFluentBitConfig) are managed by Helm and will be removed when you uninstall the Helm release in the next step.

Step 2: Uninstall Helm Releases#

Uninstall both Helm releases:

# Uninstall example dashboards
helm uninstall nvcf-example-dashboards -n observability

# Uninstall observability reference stack
helm uninstall observability -n observability

Step 3: Delete the Namespace#

Finally, delete the observability namespace:

# Delete the namespace (this will remove any remaining resources)
kubectl delete namespace observability

Note

If you deployed NVCF to a namespace other than observability, make sure to only delete the observability namespace, not your NVCF control-plane namespace.

Troubleshooting#

Pods not starting:

Check pod status and logs:

kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability
kubectl logs <pod-name> -n observability

Dashboards not showing data:

  1. Verify Prometheus is scraping metrics:

    # Port-forward to Prometheus
    kubectl port-forward -n observability svc/prometheus 9090:9090
    
    # Check targets at http://localhost:9090/targets
    
  2. Verify NVCF services are exposing metrics:

    # Port-forward to an NVCF service
    kubectl port-forward -n nvcf svc/nvcf-api 8080:8080
    
    # Curl the metrics endpoint
    curl http://localhost:8080/metrics
    
  3. Check that ServiceMonitors are created:

    kubectl get servicemonitor -n nvcf
    

Grafana login issues:

The default credentials are typically admin/admin or may be configured via Helm values. Check the Helm chart documentation or values for the correct credentials.

Next Steps#

After exploring the example dashboards:

  1. Review the metrics, logs, and traces being collected

  2. Identify which metrics are most relevant to your use case

  3. Design and implement your own production-ready observability solution

  4. Integrate with your existing enterprise observability platforms

  5. Configure alerting based on your operational requirements

For production deployments, see Observability Configuration for guidance on integrating with your own observability infrastructure.