Observability Stack Configuration#

Configure storage and retention for Prometheus#

Create a new YAML file on the head node for this setup:

cat << EOF > prom_config.yaml
prometheus:
  prometheusSpec:
    retention: 90d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: local-path
          resources:
            requests:
              storage: 300Gi
EOF

Install/upgrade CRDs:

helm pull oci://master.cm.cluster:5000/helm-charts/kube-prometheus-stack --untar \
    --ca-file /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt
kubectl apply -f kube-prometheus-stack/charts/crds/crds/ --server-side

With the configuration file in place, apply these values to the K8s cluster:

helm upgrade kube-prometheus-stack ./kube-prometheus-stack \
    -n prometheus -f prom_config.yaml --reuse-values

Configure storage and retention for Loki#

This will require a redeployment of the Helm chart for Loki as we need to set a new storage value and we can’t resize this live.

Create a new YAML file on the head node for this setup:

cat << EOF > loki_config.yaml
global:
  image:
    registry: master.cm.cluster:5000
memcached:
  image:
    repository: library/memcached
backend:
  replicas: 3
chunksCache:
  writebackSizeLimit: 500MB
commonConfig:
  replication_factor: 3
compactor:
  replicas: 0
deploymentMode: SimpleScalable
distributor:
  maxUnavailable: 0
  replicas: 0
indexGateway:
  maxUnavailable: 0
  replicas: 0
ingester:
  replicas: 0
loki:
  limits_config:
    ingestion_rate_mb: 75
    ingestion_burst_size_mb: 150
    per_stream_rate_limit: 16MB
    per_stream_rate_limit_burst: 64MB
    retention_period: 90d
  auth_enabled: false
  commonConfig:
    replication_factor: 3
  schemaConfig:
    configs:
    - from: '2025-09-04'
      index:
        period: 24h
        prefix: loki_index_
      object_store: s3
      schema: v13
      store: tsdb
  storage:
    type: s3
    bucketNames:
    chunks: loki-chunks-bucket
    ruler: loki-ruler-bucket
    admin: loki-admin-bucket
minio:
  enabled: true
  persistence:
    size: 500Gi
  image:
    repository: master.cm.cluster:5000/minio/minio
  mcImage:
    repository: master.cm.cluster:5000/minio/mc
querier:
  maxUnavailable: 0
  replicas: 0
queryFrontend:
  maxUnavailable: 0
  replicas: 0
queryScheduler:
  replicas: 0
read:
  replicas: 3
singleBinary:
  replicas: 0
write:
  replicas: 3
EOF

Uninstall the existing Helm chart:
```
helm uninstall loki -n loki
```
Delete any lingering PVCs:
```
kubectl delete pvc --all -n loki
```

Install the Helm chart for Loki, using the new values:

helm install loki oci://master.cm.cluster:5000/helm-charts/loki --ca-file \
    /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt \
    -n loki -f loki_config.yaml

Configure Observability#

Follow the instructions in the following sections on the nmc-observability-stack-configuration page:

Install NMC Grafana dashboards#

As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.

Prerequisites#

To take advantage of all dashboard features, install and configure the Infinity plugin and data sources to process data from the BCM REST API and UFM REST API.

Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:
```
cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"
```

Generate a Kubernetes secret for the registry CA certificate so Grafana can access the local registry to install plugins.

kubectl create secret generic registry-ca -n prometheus --from-file \
    ca.crt=/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt

Save the Grafana plugin upload script to a file called push-plugins.sh in the current directory

#!/usr/bin/env bash
set -euo pipefail
PROG="$(basename "$0")"
die() { echo "$PROG: error: $*" >&2; exit 1; }
log() { echo "$PROG: $*" >&2; }
REGISTRY="master.cm.cluster:5000"
SCHEME="https"
CACERT="/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt"
PLUGIN_PATTERNS=("grafana-metricsdrilldown-app" "grafana-lokiexplore-app" "yesoreyeram-infinity-datasource")
VERBOSE="${VERBOSE:-0}"
BUNDLE=""
while [[ $# -gt 0 ]]; do
    case "$1" in
        --bundle) BUNDLE="$2"; shift 2 ;;
        -v) VERBOSE=1; shift ;;
        -h|--help) echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 0 ;;
        --) shift; break ;;
        -*) die "Unknown option: $1" ;;
        *) break ;;
    esac
done
[[ $# -eq 1 ]]    || { echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 1; }
[[ -n "$BUNDLE" ]] || die "--bundle is required"
REPOSITORY="$1"
[[ -d "$BUNDLE" ]]       || die "Bundle directory not found: $BUNDLE"
[[ -d "$BUNDLE/files" ]] || die "No files/ subdirectory in bundle: $BUNDLE"
[[ -f "$CACERT" ]]       || die "CA certificate not found: $CACERT"
sha256_file() { command -v sha256sum &>/dev/null && sha256sum "$1" | awk '{print $1}' || shasum -a 256 "$1" | awk '{print $1}'; }
sha256_str()  { command -v sha256sum &>/dev/null && printf '%s' "$1" | sha256sum | awk '{print $1}' || printf '%s' "$1" | shasum -a 256 | awk '{print $1}'; }
file_size()   { stat -c%s "$1" 2>/dev/null || stat -f%z "$1"; }
TMPDIR_WORK="$(mktemp -d)"
trap 'rm -rf "$TMPDIR_WORK"' EXIT
_curl() { curl -s --cacert "$CACERT" -o "$TMPDIR_WORK/response_body" -D "$TMPDIR_WORK/response_headers" -w "%{http_code}" "$@" 2>/dev/null; }
_hdr()  { grep -i "^${1}:" "$TMPDIR_WORK/response_headers" 2>/dev/null | head -1 | sed 's/^[^:]*:\s*//' | tr -d '\r'; }
_api()  { local m="$1" u="$2"; shift 2; local c; c=$(_curl -X "$m" "$@" "$u") || c="000"; [[ "${VERBOSE:-0}" == "1" ]] && echo "$PROG: [debug] $m $u → $c" >&2; echo "$c"; }
blob_exists() { [[ "$(_api HEAD "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/$1")" == "200" ]]; }
push_blob_file() {
    local file="$1" digest="$2" size="$3" label="${4:-blob}"
    if blob_exists "$digest"; then log "$label already present, skipping"; return 0; fi
    log "Uploading $label (${size} bytes)..."
    local code loc
    code=$(_api POST "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/uploads/")
    [[ "$code" == "202" ]] || die "Failed to initiate blob upload (HTTP $code)"
    loc=$(_hdr "Location")
    [[ -n "$loc" ]] || die "No Location header in upload response"
    [[ "$loc" == http* ]] || loc="${SCHEME}://${REGISTRY}${loc}"
    local sep="?"; [[ "$loc" == *"?"* ]] && sep="&"
    code=$(_curl -X PUT -H "Content-Type: application/octet-stream" -H "Content-Length: ${size}" --data-binary "@${file}" "${loc}${sep}digest=${digest}") || code="000"
    [[ "$code" == "201" ]] || die "Blob upload failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)"
    log "$label uploaded"
}
push_blob_str() {
    blob_exists "$2" && return 0
    printf '%s' "$1" > "$TMPDIR_WORK/blob_str"
    push_blob_file "$TMPDIR_WORK/blob_str" "$2" "$3" "config blob"
}
push_manifest() {
    printf '%s' "$1" > "$TMPDIR_WORK/manifest.json"
    local code
    code=$(_curl -X PUT -H "Content-Type: application/vnd.oci.image.manifest.v1+json" --data-binary "@$TMPDIR_WORK/manifest.json" "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/manifests/$2") || code="000"
    [[ "$code" == "201" ]] || die "Manifest push failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)"
}
push_plugin() {
    local zipfile="$1" plugin_name="$2"
    local filename; filename="$(basename "$zipfile")"
    local tag="${filename%.zip}"; tag="${tag//+/-}"
    log "=== ${filename} ==="
    local layer_digest="sha256:$(sha256_file "$zipfile")"
    local layer_size; layer_size=$(file_size "$zipfile")
    local cfg='{}' cfg_digest cfg_size
    cfg_digest="sha256:$(sha256_str "$cfg")"
    cfg_size=${#cfg}
    local manifest
    manifest="{\"schemaVersion\":2,\"mediaType\":\"application/vnd.oci.image.manifest.v1+json\",\"artifactType\":\"application/vnd.oci.artifact.v1\",\"config\":{\"mediaType\":\"application/vnd.oci.empty.v1+json\",\"digest\":\"${cfg_digest}\",\"size\":${cfg_size}},\"layers\":[{\"mediaType\":\"application/zip\",\"digest\":\"${layer_digest}\",\"size\":${layer_size},\"annotations\":{\"org.opencontainers.image.title\":\"${filename}\"}}]}"
    push_blob_str "$cfg" "$cfg_digest" "$cfg_size"
    push_blob_file "$zipfile" "$layer_digest" "$layer_size" "$filename"
    push_manifest "$manifest" "$tag"
    echo "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/${layer_digest};${plugin_name}"
}
command -v curl &>/dev/null || die "curl is required"
log "Scanning ${BUNDLE}/files/ for matching plugins..."
urls=()
while IFS= read -r -d '' zipfile; do
    name="$(basename "$zipfile")"
    matched=""
    for p in "${PLUGIN_PATTERNS[@]}"; do [[ "$name" == *"${p}"* ]] && matched="$p" && break; done
    [[ -n "$matched" ]] && urls+=("$(push_plugin "$zipfile" "$matched")")
done < <(find "$BUNDLE/files" -maxdepth 1 -name "*.zip" -print0 | sort -z)
[[ ${#urls[@]} -gt 0 ]] || die "No matching plugin zips found in ${BUNDLE}/files/"
log "Done. Pushed ${#urls[@]} plugin(s)."
echo "grafana:" > grafana_plugins.yaml
echo "  plugins:" >> grafana_plugins.yaml
for url in "${urls[@]}"; do echo "    - ${url}" >> grafana_plugins.yaml; done

Run the plugin upload script

Note

These steps assume that the bundle is present in the current directory. If it is elsewhere, update the --bundle argument to the correct path.
```
chmod +x push-plugins.sh && ./push-plugins.sh --bundle ./bundle grafana-plugins
```

The script will generate a file called grafana_plugins.yaml, which will be used in a later step. If successful, the script’s output should look like the following:

push-plugins.sh: Scanning ./bundle/files/ for matching plugins...
push-plugins.sh: === grafana-lokiexplore-app-1.0.40.zip ===
push-plugins.sh: Uploading config blob (2 bytes)...
push-plugins.sh: config blob uploaded
push-plugins.sh: Uploading grafana-lokiexplore-app-1.0.40.zip (10480908 bytes)...
push-plugins.sh: grafana-lokiexplore-app-1.0.40.zip uploaded
push-plugins.sh: === grafana-metricsdrilldown-app-1.0.34.zip ===
push-plugins.sh: Uploading config blob (2 bytes)...
push-plugins.sh: config blob uploaded
push-plugins.sh: Uploading grafana-metricsdrilldown-app-1.0.34.zip (4146320 bytes)...
push-plugins.sh: grafana-metricsdrilldown-app-1.0.34.zip uploaded
push-plugins.sh: === yesoreyeram-infinity-datasource-3.7.4.zip ===
push-plugins.sh: Uploading config blob (2 bytes)...
push-plugins.sh: config blob uploaded
push-plugins.sh: Uploading yesoreyeram-infinity-datasource-3.7.4.zip (81446703 bytes)...
push-plugins.sh: yesoreyeram-infinity-datasource-3.7.4.zip uploaded
push-plugins.sh: Done. Pushed 3 plugin(s).

Create a new file with the following values. You will need credentials for accessing the UFM REST API:

cat << EOF > grafana_infinity.yaml
grafana:
  enabled: true
  extraSecretMounts:
    - name: registry-ca
      mountPath: /etc/ssl/certs/ca-cert-registry.crt
      secretName: registry-ca
      readOnly: true
      optional: false
      subPath: ca.crt
  additionalDataSources:
    - name: BCM API
      type: yesoreyeram-infinity-datasource
      access: proxy
      isDefault: false
      basicAuth: true
      basicAuthUser: apiuser
      jsonData:
        auth_method: 'basicAuth'
        allowedHosts:
          - https://master
        tlsSkipVerify: true
        timeoutInSeconds: 60
      secureJsonData:
        basicAuthPassword: apiuserpassword
    - name: UFM API
      type: yesoreyeram-infinity-datasource
      access: proxy
      isDefault: false
      basicAuth: true
      basicAuthUser: UFM_USER
      jsonData:
        auth_method: 'basicAuth'
        allowedHosts:
          - https://master
        tlsSkipVerify: true
        timeoutInSeconds: 60
      secureJsonData:
        basicAuthPassword: UFM_PASSWORD
EOF

Run the following command to apply the values:

helm upgrade kube-prometheus-stack ./kube-prometheus-stack -n prometheus \
  --reuse-values -f grafana_infinity.yaml -f grafana_plugins.yaml

To access the UFM REST API, we need to install a port-forwarding rule within BCM as well. To configure this, you need to gather the IP address of the UFM appliance and replace it in the following command sequence:
```
cmsh -c "device; portforward create UFM_IP_ADDRESS 443 11443"
```
This should be repeated on your secondary BCM head node, if you have one configured.

Note

The port-forwarding rules are not persisted on restarts of the BCM daemon process. If your head node reboots or the daemon restarts for any reason, you will need to rerun this port-forwarding sequence.

Installation#

Note

These steps assume that the bundle is present in the current directory.

Upload the .tgz file to the cluster head node.

Update values to enable folders within Grafana:

helm upgrade -n prometheus kube-prometheus-stack ./kube-prometheus-stack \
  --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true \
  --set grafana.sidecar.dashboards.enabled=true --reuse-values

DGX B200

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards \
    ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b200

DGX GB200

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards \
    ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb200

DGX B300

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards \
    ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b300

DGX GB300

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards \
    ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb300

Accessing dashboards#

In a browser, open: https://<headnode>/grafana

Login using default credentials. Get Grafana ‘admin’ user password by running:

kubectl get secret -n prometheus kube-prometheus-stack-grafana \
    -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Navigate to Dashboards in the Grafana UI.

Deployed dashboards will be denoted with a number, for example:
1. 01. SuperPOD Overall Dashboard

Customization and BMS integration#

The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.

One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.

Most customers will need to:

Integrate their own BMS with NVIDIA BCM.
Update the dashboard’s data source and queries to reflect their own BMS setup.

These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.

Troubleshooting#

Error: Could not load plugin#

On rare occasions, some dashboards may fail to load properly and display an error:

BCM|UFM API plugin failed
Error: Could not load plugin

This fix for this is to scale the Grafana deployment down to 0 and back up again.

kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=0
sleep 30
kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=1

This will restart the Grafana deployment and the dashboards should load properly.