Observability Stack Configuration#
Configure storage and retention for Prometheus#
Create a new YAML file on the head node for this setup:
cat << EOF > prom_config.yaml prometheus: prometheusSpec: retention: 90d storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] storageClassName: local-path resources: requests: storage: 300Gi EOF
Install/upgrade CRDs:
helm pull oci://master.cm.cluster:5000/helm-charts/kube-prometheus-stack --untar \ --ca-file /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt kubectl apply -f kube-prometheus-stack/charts/crds/crds/ --server-side
With the configuration file in place, apply these values to the K8s cluster:
helm upgrade kube-prometheus-stack ./kube-prometheus-stack \ -n prometheus -f prom_config.yaml --reuse-values
Configure storage and retention for Loki#
This will require a redeployment of the Helm chart for Loki as we need to set a new storage value and we can’t resize this live.
Create a new YAML file on the head node for this setup:
cat << EOF > loki_config.yaml global: image: registry: master.cm.cluster:5000 memcached: image: repository: library/memcached backend: replicas: 3 chunksCache: writebackSizeLimit: 500MB commonConfig: replication_factor: 3 compactor: replicas: 0 deploymentMode: SimpleScalable distributor: maxUnavailable: 0 replicas: 0 indexGateway: maxUnavailable: 0 replicas: 0 ingester: replicas: 0 loki: limits_config: ingestion_rate_mb: 75 ingestion_burst_size_mb: 150 per_stream_rate_limit: 16MB per_stream_rate_limit_burst: 64MB retention_period: 90d auth_enabled: false commonConfig: replication_factor: 3 schemaConfig: configs: - from: '2025-09-04' index: period: 24h prefix: loki_index_ object_store: s3 schema: v13 store: tsdb storage: type: s3 bucketNames: chunks: loki-chunks-bucket ruler: loki-ruler-bucket admin: loki-admin-bucket minio: enabled: true persistence: size: 500Gi image: repository: master.cm.cluster:5000/minio/minio mcImage: repository: master.cm.cluster:5000/minio/mc querier: maxUnavailable: 0 replicas: 0 queryFrontend: maxUnavailable: 0 replicas: 0 queryScheduler: replicas: 0 read: replicas: 3 singleBinary: replicas: 0 write: replicas: 3 EOF
Uninstall the existing Helm chart:
helm uninstall loki -n loki
Delete any lingering PVCs:
kubectl delete pvc --all -n loki
Install the Helm chart for Loki, using the new values:
helm install loki oci://master.cm.cluster:5000/helm-charts/loki --ca-file \ /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt \ -n loki -f loki_config.yaml
Configure Observability#
Follow the instructions in the following sections on the nmc-observability-stack-configuration page:
Install NMC Grafana dashboards#
As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.
Prerequisites#
To take advantage of all dashboard features, install and configure the Infinity plugin and data sources to process data from the BCM REST API and UFM REST API.
Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:
cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"Generate a Kubernetes secret for the registry CA certificate so Grafana can access the local registry to install plugins.
kubectl create secret generic registry-ca -n prometheus --from-file \ ca.crt=/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt
Save the Grafana plugin upload script to a file called push-plugins.sh in the current directory
#!/usr/bin/env bash set -euo pipefail PROG="$(basename "$0")" die() { echo "$PROG: error: $*" >&2; exit 1; } log() { echo "$PROG: $*" >&2; } REGISTRY="master.cm.cluster:5000" SCHEME="https" CACERT="/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt" PLUGIN_PATTERNS=("grafana-metricsdrilldown-app" "grafana-lokiexplore-app" "yesoreyeram-infinity-datasource") VERBOSE="${VERBOSE:-0}" BUNDLE="" while [[ $# -gt 0 ]]; do case "$1" in --bundle) BUNDLE="$2"; shift 2 ;; -v) VERBOSE=1; shift ;; -h|--help) echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 0 ;; --) shift; break ;; -*) die "Unknown option: $1" ;; *) break ;; esac done [[ $# -eq 1 ]] || { echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 1; } [[ -n "$BUNDLE" ]] || die "--bundle is required" REPOSITORY="$1" [[ -d "$BUNDLE" ]] || die "Bundle directory not found: $BUNDLE" [[ -d "$BUNDLE/files" ]] || die "No files/ subdirectory in bundle: $BUNDLE" [[ -f "$CACERT" ]] || die "CA certificate not found: $CACERT" sha256_file() { command -v sha256sum &>/dev/null && sha256sum "$1" | awk '{print $1}' || shasum -a 256 "$1" | awk '{print $1}'; } sha256_str() { command -v sha256sum &>/dev/null && printf '%s' "$1" | sha256sum | awk '{print $1}' || printf '%s' "$1" | shasum -a 256 | awk '{print $1}'; } file_size() { stat -c%s "$1" 2>/dev/null || stat -f%z "$1"; } TMPDIR_WORK="$(mktemp -d)" trap 'rm -rf "$TMPDIR_WORK"' EXIT _curl() { curl -s --cacert "$CACERT" -o "$TMPDIR_WORK/response_body" -D "$TMPDIR_WORK/response_headers" -w "%{http_code}" "$@" 2>/dev/null; } _hdr() { grep -i "^${1}:" "$TMPDIR_WORK/response_headers" 2>/dev/null | head -1 | sed 's/^[^:]*:\s*//' | tr -d '\r'; } _api() { local m="$1" u="$2"; shift 2; local c; c=$(_curl -X "$m" "$@" "$u") || c="000"; [[ "${VERBOSE:-0}" == "1" ]] && echo "$PROG: [debug] $m $u → $c" >&2; echo "$c"; } blob_exists() { [[ "$(_api HEAD "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/$1")" == "200" ]]; } push_blob_file() { local file="$1" digest="$2" size="$3" label="${4:-blob}" if blob_exists "$digest"; then log "$label already present, skipping"; return 0; fi log "Uploading $label (${size} bytes)..." local code loc code=$(_api POST "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/uploads/") [[ "$code" == "202" ]] || die "Failed to initiate blob upload (HTTP $code)" loc=$(_hdr "Location") [[ -n "$loc" ]] || die "No Location header in upload response" [[ "$loc" == http* ]] || loc="${SCHEME}://${REGISTRY}${loc}" local sep="?"; [[ "$loc" == *"?"* ]] && sep="&" code=$(_curl -X PUT -H "Content-Type: application/octet-stream" -H "Content-Length: ${size}" --data-binary "@${file}" "${loc}${sep}digest=${digest}") || code="000" [[ "$code" == "201" ]] || die "Blob upload failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)" log "$label uploaded" } push_blob_str() { blob_exists "$2" && return 0 printf '%s' "$1" > "$TMPDIR_WORK/blob_str" push_blob_file "$TMPDIR_WORK/blob_str" "$2" "$3" "config blob" } push_manifest() { printf '%s' "$1" > "$TMPDIR_WORK/manifest.json" local code code=$(_curl -X PUT -H "Content-Type: application/vnd.oci.image.manifest.v1+json" --data-binary "@$TMPDIR_WORK/manifest.json" "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/manifests/$2") || code="000" [[ "$code" == "201" ]] || die "Manifest push failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)" } push_plugin() { local zipfile="$1" plugin_name="$2" local filename; filename="$(basename "$zipfile")" local tag="${filename%.zip}"; tag="${tag//+/-}" log "=== ${filename} ===" local layer_digest="sha256:$(sha256_file "$zipfile")" local layer_size; layer_size=$(file_size "$zipfile") local cfg='{}' cfg_digest cfg_size cfg_digest="sha256:$(sha256_str "$cfg")" cfg_size=${#cfg} local manifest manifest="{\"schemaVersion\":2,\"mediaType\":\"application/vnd.oci.image.manifest.v1+json\",\"artifactType\":\"application/vnd.oci.artifact.v1\",\"config\":{\"mediaType\":\"application/vnd.oci.empty.v1+json\",\"digest\":\"${cfg_digest}\",\"size\":${cfg_size}},\"layers\":[{\"mediaType\":\"application/zip\",\"digest\":\"${layer_digest}\",\"size\":${layer_size},\"annotations\":{\"org.opencontainers.image.title\":\"${filename}\"}}]}" push_blob_str "$cfg" "$cfg_digest" "$cfg_size" push_blob_file "$zipfile" "$layer_digest" "$layer_size" "$filename" push_manifest "$manifest" "$tag" echo "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/${layer_digest};${plugin_name}" } command -v curl &>/dev/null || die "curl is required" log "Scanning ${BUNDLE}/files/ for matching plugins..." urls=() while IFS= read -r -d '' zipfile; do name="$(basename "$zipfile")" matched="" for p in "${PLUGIN_PATTERNS[@]}"; do [[ "$name" == *"${p}"* ]] && matched="$p" && break; done [[ -n "$matched" ]] && urls+=("$(push_plugin "$zipfile" "$matched")") done < <(find "$BUNDLE/files" -maxdepth 1 -name "*.zip" -print0 | sort -z) [[ ${#urls[@]} -gt 0 ]] || die "No matching plugin zips found in ${BUNDLE}/files/" log "Done. Pushed ${#urls[@]} plugin(s)." echo "grafana:" > grafana_plugins.yaml echo " plugins:" >> grafana_plugins.yaml for url in "${urls[@]}"; do echo " - ${url}" >> grafana_plugins.yaml; done
Run the plugin upload script
Note
These steps assume that the bundle is present in the current directory. If it is elsewhere, update the
--bundleargument to the correct path.chmod +x push-plugins.sh && ./push-plugins.sh --bundle ./bundle grafana-plugins
The script will generate a file called
grafana_plugins.yaml, which will be used in a later step. If successful, the script’s output should look like the following:push-plugins.sh: Scanning ./bundle/files/ for matching plugins... push-plugins.sh: === grafana-lokiexplore-app-1.0.40.zip === push-plugins.sh: Uploading config blob (2 bytes)... push-plugins.sh: config blob uploaded push-plugins.sh: Uploading grafana-lokiexplore-app-1.0.40.zip (10480908 bytes)... push-plugins.sh: grafana-lokiexplore-app-1.0.40.zip uploaded push-plugins.sh: === grafana-metricsdrilldown-app-1.0.34.zip === push-plugins.sh: Uploading config blob (2 bytes)... push-plugins.sh: config blob uploaded push-plugins.sh: Uploading grafana-metricsdrilldown-app-1.0.34.zip (4146320 bytes)... push-plugins.sh: grafana-metricsdrilldown-app-1.0.34.zip uploaded push-plugins.sh: === yesoreyeram-infinity-datasource-3.7.4.zip === push-plugins.sh: Uploading config blob (2 bytes)... push-plugins.sh: config blob uploaded push-plugins.sh: Uploading yesoreyeram-infinity-datasource-3.7.4.zip (81446703 bytes)... push-plugins.sh: yesoreyeram-infinity-datasource-3.7.4.zip uploaded push-plugins.sh: Done. Pushed 3 plugin(s).
Create a new file with the following values. You will need credentials for accessing the UFM REST API:
cat << EOF > grafana_infinity.yaml grafana: enabled: true extraSecretMounts: - name: registry-ca mountPath: /etc/ssl/certs/ca-cert-registry.crt secretName: registry-ca readOnly: true optional: false subPath: ca.crt additionalDataSources: - name: BCM API type: yesoreyeram-infinity-datasource access: proxy isDefault: false basicAuth: true basicAuthUser: apiuser jsonData: auth_method: 'basicAuth' allowedHosts: - https://master tlsSkipVerify: true timeoutInSeconds: 60 secureJsonData: basicAuthPassword: apiuserpassword - name: UFM API type: yesoreyeram-infinity-datasource access: proxy isDefault: false basicAuth: true basicAuthUser: UFM_USER jsonData: auth_method: 'basicAuth' allowedHosts: - https://master tlsSkipVerify: true timeoutInSeconds: 60 secureJsonData: basicAuthPassword: UFM_PASSWORD EOF
Run the following command to apply the values:
helm upgrade kube-prometheus-stack ./kube-prometheus-stack -n prometheus \ --reuse-values -f grafana_infinity.yaml -f grafana_plugins.yaml
To access the UFM REST API, we need to install a port-forwarding rule within BCM as well. To configure this, you need to gather the IP address of the UFM appliance and replace it in the following command sequence:
cmsh -c "device; portforward create UFM_IP_ADDRESS 443 11443"This should be repeated on your secondary BCM head node, if you have one configured.
Note
The port-forwarding rules are not persisted on restarts of the BCM daemon process. If your head node reboots or the daemon restarts for any reason, you will need to rerun this port-forwarding sequence.
Installation#
Note
These steps assume that the bundle is present in the current directory.
Upload the .tgz file to the cluster head node.
Update values to enable folders within Grafana:
helm upgrade -n prometheus kube-prometheus-stack ./kube-prometheus-stack \ --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true \ --set grafana.sidecar.dashboards.enabled=true --reuse-values
Run the following command (update the filename if needed):
helm upgrade --install -n prometheus nmc-dashboards \ ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b200
Run the following command (update the filename if needed):
helm upgrade --install -n prometheus nmc-dashboards \ ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb200
Run the following command (update the filename if needed):
helm upgrade --install -n prometheus nmc-dashboards \ ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b300
Run the following command (update the filename if needed):
helm upgrade --install -n prometheus nmc-dashboards \ ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb300
Accessing dashboards#
In a browser, open:
https://<headnode>/grafanaLogin using default credentials. Get Grafana ‘admin’ user password by running:
kubectl get secret -n prometheus kube-prometheus-stack-grafana \ -o jsonpath="{.data.admin-password}" | base64 -d ; echo
Navigate to Dashboards in the Grafana UI.
Deployed dashboards will be denoted with a number, for example:
01. SuperPOD Overall Dashboard
Customization and BMS integration#
The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.
One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.
Most customers will need to:
Integrate their own BMS with NVIDIA BCM.
Update the dashboard’s data source and queries to reflect their own BMS setup.
These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.
Troubleshooting#
Error: Could not load plugin#
On rare occasions, some dashboards may fail to load properly and display an error:
BCM|UFM API plugin failed
Error: Could not load plugin
This fix for this is to scale the Grafana deployment down to 0 and back up again.
kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=0
sleep 30
kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=1
This will restart the Grafana deployment and the dashboards should load properly.