Observability Stack Configuration#

Configure storage and retention for Prometheus#

  1. Create a new YAML file on the head node for this setup:

    cat << EOF > prom_config.yaml
    prometheus:
      prometheusSpec:
        retention: 90d
        storageSpec:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: local-path
              resources:
                requests:
                  storage: 300Gi
    EOF
    
  2. Install/upgrade CRDs:

    helm pull oci://master.cm.cluster:5000/helm-charts/kube-prometheus-stack --untar \
        --ca-file /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt
    kubectl apply -f kube-prometheus-stack/charts/crds/crds/ --server-side
    
  3. With the configuration file in place, apply these values to the K8s cluster:

    helm upgrade kube-prometheus-stack ./kube-prometheus-stack \
        -n prometheus -f prom_config.yaml --reuse-values
    

Configure storage and retention for Loki#

This will require a redeployment of the Helm chart for Loki as we need to set a new storage value and we can’t resize this live.

  1. Create a new YAML file on the head node for this setup:

    cat << EOF > loki_config.yaml
    global:
      image:
        registry: master.cm.cluster:5000
    memcached:
      image:
        repository: library/memcached
    backend:
      replicas: 3
    chunksCache:
      writebackSizeLimit: 500MB
    commonConfig:
      replication_factor: 3
    compactor:
      replicas: 0
    deploymentMode: SimpleScalable
    distributor:
      maxUnavailable: 0
      replicas: 0
    indexGateway:
      maxUnavailable: 0
      replicas: 0
    ingester:
      replicas: 0
    loki:
      limits_config:
        ingestion_rate_mb: 75
        ingestion_burst_size_mb: 150
        per_stream_rate_limit: 16MB
        per_stream_rate_limit_burst: 64MB
        retention_period: 90d
      auth_enabled: false
      commonConfig:
        replication_factor: 3
      schemaConfig:
        configs:
        - from: '2025-09-04'
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v13
          store: tsdb
      storage:
        type: s3
        bucketNames:
        chunks: loki-chunks-bucket
        ruler: loki-ruler-bucket
        admin: loki-admin-bucket
    minio:
      enabled: true
      persistence:
        size: 500Gi
      image:
        repository: master.cm.cluster:5000/minio/minio
      mcImage:
        repository: master.cm.cluster:5000/minio/mc
    querier:
      maxUnavailable: 0
      replicas: 0
    queryFrontend:
      maxUnavailable: 0
      replicas: 0
    queryScheduler:
      replicas: 0
    read:
      replicas: 3
    singleBinary:
      replicas: 0
    write:
      replicas: 3
    EOF
    
  2. Uninstall the existing Helm chart:

    helm uninstall loki -n loki
    
  3. Delete any lingering PVCs:

    kubectl delete pvc --all -n loki
    
  4. Install the Helm chart for Loki, using the new values:

    helm install loki oci://master.cm.cluster:5000/helm-charts/loki --ca-file \
        /cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt \
        -n loki -f loki_config.yaml
    

Configure Observability#

Follow the instructions in the following sections on the nmc-observability-stack-configuration page:

  1. Enable Prometheus scrape endpoint in BCM

  2. Configure Prometheus to scrape metrics from BCM endpoint

  3. Configure Prometheus to scrape metrics from DCGM endpoints

  4. Grafana and querying BCM metrics

  5. Reduce BCM monitoring directory size

Install NMC Grafana dashboards#

As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.

Prerequisites#

To take advantage of all dashboard features, install and configure the Infinity plugin and data sources to process data from the BCM REST API and UFM REST API.

  1. Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:

    cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"
    
  2. Generate a Kubernetes secret for the registry CA certificate so Grafana can access the local registry to install plugins.

    kubectl create secret generic registry-ca -n prometheus --from-file \
        ca.crt=/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt
    
  3. Save the Grafana plugin upload script to a file called push-plugins.sh in the current directory

    #!/usr/bin/env bash
    set -euo pipefail
    PROG="$(basename "$0")"
    die() { echo "$PROG: error: $*" >&2; exit 1; }
    log() { echo "$PROG: $*" >&2; }
    REGISTRY="master.cm.cluster:5000"
    SCHEME="https"
    CACERT="/cm/local/apps/containerd/var/etc/certs.d/master.cm.cluster:5000/ca.crt"
    PLUGIN_PATTERNS=("grafana-metricsdrilldown-app" "grafana-lokiexplore-app" "yesoreyeram-infinity-datasource")
    VERBOSE="${VERBOSE:-0}"
    BUNDLE=""
    while [[ $# -gt 0 ]]; do
        case "$1" in
            --bundle) BUNDLE="$2"; shift 2 ;;
            -v) VERBOSE=1; shift ;;
            -h|--help) echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 0 ;;
            --) shift; break ;;
            -*) die "Unknown option: $1" ;;
            *) break ;;
        esac
    done
    [[ $# -eq 1 ]]    || { echo "Usage: $PROG --bundle <dir> <repository>" >&2; exit 1; }
    [[ -n "$BUNDLE" ]] || die "--bundle is required"
    REPOSITORY="$1"
    [[ -d "$BUNDLE" ]]       || die "Bundle directory not found: $BUNDLE"
    [[ -d "$BUNDLE/files" ]] || die "No files/ subdirectory in bundle: $BUNDLE"
    [[ -f "$CACERT" ]]       || die "CA certificate not found: $CACERT"
    sha256_file() { command -v sha256sum &>/dev/null && sha256sum "$1" | awk '{print $1}' || shasum -a 256 "$1" | awk '{print $1}'; }
    sha256_str()  { command -v sha256sum &>/dev/null && printf '%s' "$1" | sha256sum | awk '{print $1}' || printf '%s' "$1" | shasum -a 256 | awk '{print $1}'; }
    file_size()   { stat -c%s "$1" 2>/dev/null || stat -f%z "$1"; }
    TMPDIR_WORK="$(mktemp -d)"
    trap 'rm -rf "$TMPDIR_WORK"' EXIT
    _curl() { curl -s --cacert "$CACERT" -o "$TMPDIR_WORK/response_body" -D "$TMPDIR_WORK/response_headers" -w "%{http_code}" "$@" 2>/dev/null; }
    _hdr()  { grep -i "^${1}:" "$TMPDIR_WORK/response_headers" 2>/dev/null | head -1 | sed 's/^[^:]*:\s*//' | tr -d '\r'; }
    _api()  { local m="$1" u="$2"; shift 2; local c; c=$(_curl -X "$m" "$@" "$u") || c="000"; [[ "${VERBOSE:-0}" == "1" ]] && echo "$PROG: [debug] $m $u$c" >&2; echo "$c"; }
    blob_exists() { [[ "$(_api HEAD "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/$1")" == "200" ]]; }
    push_blob_file() {
        local file="$1" digest="$2" size="$3" label="${4:-blob}"
        if blob_exists "$digest"; then log "$label already present, skipping"; return 0; fi
        log "Uploading $label (${size} bytes)..."
        local code loc
        code=$(_api POST "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/uploads/")
        [[ "$code" == "202" ]] || die "Failed to initiate blob upload (HTTP $code)"
        loc=$(_hdr "Location")
        [[ -n "$loc" ]] || die "No Location header in upload response"
        [[ "$loc" == http* ]] || loc="${SCHEME}://${REGISTRY}${loc}"
        local sep="?"; [[ "$loc" == *"?"* ]] && sep="&"
        code=$(_curl -X PUT -H "Content-Type: application/octet-stream" -H "Content-Length: ${size}" --data-binary "@${file}" "${loc}${sep}digest=${digest}") || code="000"
        [[ "$code" == "201" ]] || die "Blob upload failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)"
        log "$label uploaded"
    }
    push_blob_str() {
        blob_exists "$2" && return 0
        printf '%s' "$1" > "$TMPDIR_WORK/blob_str"
        push_blob_file "$TMPDIR_WORK/blob_str" "$2" "$3" "config blob"
    }
    push_manifest() {
        printf '%s' "$1" > "$TMPDIR_WORK/manifest.json"
        local code
        code=$(_curl -X PUT -H "Content-Type: application/vnd.oci.image.manifest.v1+json" --data-binary "@$TMPDIR_WORK/manifest.json" "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/manifests/$2") || code="000"
        [[ "$code" == "201" ]] || die "Manifest push failed (HTTP $code): $(cat "$TMPDIR_WORK/response_body" 2>/dev/null)"
    }
    push_plugin() {
        local zipfile="$1" plugin_name="$2"
        local filename; filename="$(basename "$zipfile")"
        local tag="${filename%.zip}"; tag="${tag//+/-}"
        log "=== ${filename} ==="
        local layer_digest="sha256:$(sha256_file "$zipfile")"
        local layer_size; layer_size=$(file_size "$zipfile")
        local cfg='{}' cfg_digest cfg_size
        cfg_digest="sha256:$(sha256_str "$cfg")"
        cfg_size=${#cfg}
        local manifest
        manifest="{\"schemaVersion\":2,\"mediaType\":\"application/vnd.oci.image.manifest.v1+json\",\"artifactType\":\"application/vnd.oci.artifact.v1\",\"config\":{\"mediaType\":\"application/vnd.oci.empty.v1+json\",\"digest\":\"${cfg_digest}\",\"size\":${cfg_size}},\"layers\":[{\"mediaType\":\"application/zip\",\"digest\":\"${layer_digest}\",\"size\":${layer_size},\"annotations\":{\"org.opencontainers.image.title\":\"${filename}\"}}]}"
        push_blob_str "$cfg" "$cfg_digest" "$cfg_size"
        push_blob_file "$zipfile" "$layer_digest" "$layer_size" "$filename"
        push_manifest "$manifest" "$tag"
        echo "${SCHEME}://${REGISTRY}/v2/${REPOSITORY}/blobs/${layer_digest};${plugin_name}"
    }
    command -v curl &>/dev/null || die "curl is required"
    log "Scanning ${BUNDLE}/files/ for matching plugins..."
    urls=()
    while IFS= read -r -d '' zipfile; do
        name="$(basename "$zipfile")"
        matched=""
        for p in "${PLUGIN_PATTERNS[@]}"; do [[ "$name" == *"${p}"* ]] && matched="$p" && break; done
        [[ -n "$matched" ]] && urls+=("$(push_plugin "$zipfile" "$matched")")
    done < <(find "$BUNDLE/files" -maxdepth 1 -name "*.zip" -print0 | sort -z)
    [[ ${#urls[@]} -gt 0 ]] || die "No matching plugin zips found in ${BUNDLE}/files/"
    log "Done. Pushed ${#urls[@]} plugin(s)."
    echo "grafana:" > grafana_plugins.yaml
    echo "  plugins:" >> grafana_plugins.yaml
    for url in "${urls[@]}"; do echo "    - ${url}" >> grafana_plugins.yaml; done
    
  4. Run the plugin upload script

    Note

    These steps assume that the bundle is present in the current directory. If it is elsewhere, update the --bundle argument to the correct path.

    chmod +x push-plugins.sh && ./push-plugins.sh --bundle ./bundle grafana-plugins
    
  5. The script will generate a file called grafana_plugins.yaml, which will be used in a later step. If successful, the script’s output should look like the following:

    push-plugins.sh: Scanning ./bundle/files/ for matching plugins...
    push-plugins.sh: === grafana-lokiexplore-app-1.0.40.zip ===
    push-plugins.sh: Uploading config blob (2 bytes)...
    push-plugins.sh: config blob uploaded
    push-plugins.sh: Uploading grafana-lokiexplore-app-1.0.40.zip (10480908 bytes)...
    push-plugins.sh: grafana-lokiexplore-app-1.0.40.zip uploaded
    push-plugins.sh: === grafana-metricsdrilldown-app-1.0.34.zip ===
    push-plugins.sh: Uploading config blob (2 bytes)...
    push-plugins.sh: config blob uploaded
    push-plugins.sh: Uploading grafana-metricsdrilldown-app-1.0.34.zip (4146320 bytes)...
    push-plugins.sh: grafana-metricsdrilldown-app-1.0.34.zip uploaded
    push-plugins.sh: === yesoreyeram-infinity-datasource-3.7.4.zip ===
    push-plugins.sh: Uploading config blob (2 bytes)...
    push-plugins.sh: config blob uploaded
    push-plugins.sh: Uploading yesoreyeram-infinity-datasource-3.7.4.zip (81446703 bytes)...
    push-plugins.sh: yesoreyeram-infinity-datasource-3.7.4.zip uploaded
    push-plugins.sh: Done. Pushed 3 plugin(s).
    
  6. Create a new file with the following values. You will need credentials for accessing the UFM REST API:

    cat << EOF > grafana_infinity.yaml
    grafana:
      enabled: true
      extraSecretMounts:
        - name: registry-ca
          mountPath: /etc/ssl/certs/ca-cert-registry.crt
          secretName: registry-ca
          readOnly: true
          optional: false
          subPath: ca.crt
      additionalDataSources:
        - name: BCM API
          type: yesoreyeram-infinity-datasource
          access: proxy
          isDefault: false
          basicAuth: true
          basicAuthUser: apiuser
          jsonData:
            auth_method: 'basicAuth'
            allowedHosts:
              - https://master
            tlsSkipVerify: true
            timeoutInSeconds: 60
          secureJsonData:
            basicAuthPassword: apiuserpassword
        - name: UFM API
          type: yesoreyeram-infinity-datasource
          access: proxy
          isDefault: false
          basicAuth: true
          basicAuthUser: UFM_USER
          jsonData:
            auth_method: 'basicAuth'
            allowedHosts:
              - https://master
            tlsSkipVerify: true
            timeoutInSeconds: 60
          secureJsonData:
            basicAuthPassword: UFM_PASSWORD
    EOF
    
  7. Run the following command to apply the values:

    helm upgrade kube-prometheus-stack ./kube-prometheus-stack -n prometheus \
      --reuse-values -f grafana_infinity.yaml -f grafana_plugins.yaml
    
  8. To access the UFM REST API, we need to install a port-forwarding rule within BCM as well. To configure this, you need to gather the IP address of the UFM appliance and replace it in the following command sequence:

    cmsh -c "device; portforward create UFM_IP_ADDRESS 443 11443"
    

    This should be repeated on your secondary BCM head node, if you have one configured.

Note

The port-forwarding rules are not persisted on restarts of the BCM daemon process. If your head node reboots or the daemon restarts for any reason, you will need to rerun this port-forwarding sequence.

Installation#

Note

These steps assume that the bundle is present in the current directory.

  1. Upload the .tgz file to the cluster head node.

  2. Update values to enable folders within Grafana:

    helm upgrade -n prometheus kube-prometheus-stack ./kube-prometheus-stack \
      --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true \
      --set grafana.sidecar.dashboards.enabled=true --reuse-values
    
  1. Run the following command (update the filename if needed):

    helm upgrade --install -n prometheus nmc-dashboards \
        ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b200
    
  1. Run the following command (update the filename if needed):

    helm upgrade --install -n prometheus nmc-dashboards \
        ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb200
    
  1. Run the following command (update the filename if needed):

    helm upgrade --install -n prometheus nmc-dashboards \
        ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=b300
    
  1. Run the following command (update the filename if needed):

    helm upgrade --install -n prometheus nmc-dashboards \
        ./bundle/helm/nmc-grafana-dashboards-27.3.2.tgz --set dgx_system=gb300
    

Accessing dashboards#

  1. In a browser, open: https://<headnode>/grafana

  2. Login using default credentials. Get Grafana ‘admin’ user password by running:

    kubectl get secret -n prometheus kube-prometheus-stack-grafana \
        -o jsonpath="{.data.admin-password}" | base64 -d ; echo
    
  3. Navigate to Dashboards in the Grafana UI.

    Deployed dashboards will be denoted with a number, for example:

    1. 01. SuperPOD Overall Dashboard

Customization and BMS integration#

The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.

One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.

Most customers will need to:

  1. Integrate their own BMS with NVIDIA BCM.

  2. Update the dashboard’s data source and queries to reflect their own BMS setup.

These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.

Troubleshooting#

Error: Could not load plugin#

On rare occasions, some dashboards may fail to load properly and display an error:

BCM|UFM API plugin failed
Error: Could not load plugin

This fix for this is to scale the Grafana deployment down to 0 and back up again.

kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=0
sleep 30
kubectl scale deployment kube-prometheus-stack-grafana -n prometheus --replicas=1

This will restart the Grafana deployment and the dashboards should load properly.