Observability Stack Configuration#

The BCM cm-kubernetes-setup wizard deploys all the foundational components for observability, by default enabling observability within the K8s space. However, additional configuration is needed to enable time-series and logging data sources from BCM.

Configure storage and retention for Prometheus#

  1. Create a new YAML file on the head node for this setup:

    cat << EOF > prom_config.yaml
    prometheus:
      prometheusSpec:
        retention: 90d
        storageSpec:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 300Gi
      persistence:
        enabled: true
        storageClassName: local-path
        type: pvc
        enabled: true
    EOF
    
  2. With the configuration file in place, apply these values to the K8s cluster:

    helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f prom_config.yaml --reuse-values
    

Configure storage and retention for Loki#

This will require a redeployment of the Helm chart for Loki as we need to set a new sorage value and we can’t resize this live.

  1. Create a new YAML file on the head node for this setup:

    cat << EOF > loki_config.yaml
    backend:
      replicas: 3
    chunksCache:
      writebackSizeLimit: 500MB
    commonConfig:
      replication_factor: 3
    compactor:
      replicas: 0
    deploymentMode: SimpleScalable
    distributor:
      maxUnavailable: 0
      replicas: 0
    indexGateway:
      maxUnavailable: 0
      replicas: 0
    ingester:
      replicas: 0
    loki:
      limits_config:
        ingestion_rate_mb: 75
        ingestion_burst_size_mb: 150
        per_stream_rate_limit: 16MB
        per_stream_rate_limit_burst: 64MB
        retention_period: 90d
      auth_enabled: false
      commonConfig:
        replication_factor: 3
      schemaConfig:
        configs:
        - from: '2025-09-04'
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v13
          store: tsdb
      storage:
        type: s3
    minio:
      enabled: true
      persistence:
        size: 500Gi
    querier:
      maxUnavailable: 0
      replicas: 0
    queryFrontend:
      maxUnavailable: 0
      replicas: 0
    queryScheduler:
      replicas: 0
    read:
      replicas: 3
    singleBinary:
      replicas: 0
    write:
      replicas: 3
    EOF
    
  2. Uninstall the existing Helm chart:

    helm uninstall loki -n loki
    
  3. Delete any lingering PVCs.:

    kubectl delete pvc –all -n loki

  4. Install the Helm chart for Loki, using the new values:

    helm install loki grafana/loki -n loki -f loki_config.yaml
    

Enable Prometheus scrape endpoint in BCM#

  1. On the head node (both head nodes, if configured for HA) using sed to inline edit the configuraiton:

    sed -i 's/# EnablePrometheusExporterService = true/EnablePrometheusExporterService = true/g' /cm/local/apps/cmd/etc/cmd.conf
    
  2. Disable certificate-based authentication for the scrape endpoint, and enable extra information to be added to the metrics:

    cm-manipulate-advanced-config.py PrometheusExporterExtraInfo=1 PrometheusExporterInfo=1 PrometheusExporterEnum=0 PrometheusExporterRequireCertificate=0 PrometheusLabelIncludeSerialNumber=1 PrometheusLabelIncludePartNumber=1 PrometheusLabelIncludeSystemName=1 MaxMeasurablesPerProducer=1600
    

    If you have integrated your BMS with BCM, then you will additionally need a config like so:

    cm-manipulate-advanced-config.py PushMonitoringDeviceStatusMetrics=CDUStatus,CDULiquidSystemPressure,CDULiquidReturnTemperature
    
  3. Restart the cmd service:

    systemctl restart cmd
    
  4. After cmd restarts, run the following to validate that metrics is being successfully exported through curl:

    curl -sk https://localhost:8081/exporter | tail -n 2
    

    The exporter could take some time to start from restart of cmd:

    writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vda"} 0
    writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vdb"} 3882985
    

Configure Prometheus to scrape metrics from BCM endpoint#

With Prometheus deployed earlier through the BCM K8s wizard, interaction with the Prometheus operator is required to make configuration changes, enabling this data source.

Prepare the configuration files#

K8s typically uses YAML files with custom values to make configuration changes. Create three of these YAML files on the head node for this setup.

  1. bcm_prom_endpoint.yaml:

    cat << EOF > /tmp/bcm_prom_endpoint.yaml
    apiVersion: v1
    kind: Endpoints
    metadata:
      name: external-bcmexporter
      namespace: prometheus
      labels:
        app: external-bcmexporter
    subsets:
      - addresses:
    $(cmsh -c 'device; list -f ip:0 -t headnode' | xargs -I{} echo "    - ip: {}")
        ports:
        - name: metrics
          port: 8081
    EOF
    
  2. bcm_prom_svc.yaml:

    cat << EOF > /tmp/bcm_prom_svc.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: external-bcmexporter
      namespace: prometheus
      labels:
        app: external-bcmexporter
    spec:
      clusterIP: None
      ports:
      - name: metrics
        port: 8081
        targetPort: 8081
    EOF
    
  3. bcm_prom_svcmon.yaml:

    cat << EOF > /tmp/bcm_prom_svcmon.yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: external-bcmexporter
      namespace: prometheus
      labels:
        release: kube-prometheus-stack
    spec:
      endpoints:
      - port: metrics
        interval: 30s
        path: /exporter
        scheme: https
        tlsConfig:
          insecureSkipVerify: true
      selector:
        matchLabels:
          app: external-bcmexporter
      namespaceSelector:
        matchNames:
          - prometheus
    EOF
    
  4. With these configuration files in place, apply these values to the K8s cluster:

    kubectl apply -f /tmp/bcm_prom_endpoint.yaml
    kubectl apply -f /tmp/bcm_prom_svc.yaml
    kubectl apply -f /tmp/bcm_prom_svcmon.yaml
    

Configure Prometheus to scrape metrics from DCGM endpoints#

As with the scraping of metrics from the BCM Prometheus endpoint, we also need to configure the Prometheus instance in K8s to scrape the dcgm-exporter service from each GB200 compute tray.

Prepare the configuration files#

Enable the XID errors count metric on the software image used for DGX nodes:

echo "DCGM_EXP_XID_ERRORS_COUNT,         gauge,   Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /cm/images/<image name>/etc/dcgm-exporter/default-counters.csv

Enable the dcgm-exporter service on the software image:

cm-chroot-sw-img /cm/images/<image name> << EOF
  systemctl enable nvidia-dcgm-exporter.service
  exit
EOF

Run an image update, writing the changes to active nodes:

cmsh -c "device; foreach -c dgx-gb200 (imageupdate -w)"

Using PDSH, restart the service.:

pdsh -g category=dgx-gb200 'systemctl restart nvidia-dcgm-exporter.service'

Create three YAML files on the head node for this setup to define the external-dcgmexporter service monitor scraping. These files assume that your GB200 compute trays are in a category named dgx-gb200.

  1. dcgm_prom_endpoint.yaml:

    cat << EOF > /tmp/dcgm_prom_endpoint.yaml
    apiVersion: v1
    kind: Endpoints
    metadata:
      name: external-dcgmexporter
      namespace: prometheus
      labels:
        app: external-dcgmexporter
    subsets:
      - addresses:
      $(cmsh -c 'device; list -c dgx-gb200 -t physicalnode -f ip:0' | xargs -I{} echo "    - ip: {}")
        ports:
        - name: metrics
          port: 9400
    EOF
    
  2. dcgm_prom_svc.yaml:

    cat << EOF > /tmp/dcgm_prom_svc.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: external-dcgmexporter
      namespace: prometheus
      labels:
        app: external-dcgmexporter
    spec:
      clusterIP: None
      ports:
      - name: metrics
        port: 9400
        targetPort: 9400
    EOF
    
  3. dcgm_prom_svcmon.yaml:

    cat << EOF > /tmp/dcgm_prom_svcmon.yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: external-dcgmexporter
      namespace: prometheus
      labels:
        release: kube-prometheus-stack
    spec:
      endpoints:
      - port: metrics
        interval: 30s
        path: /metrics
        scheme: http
      selector:
        matchLabels:
          app: external-dcgmexporter
      namespaceSelector:
        matchNames:
          - prometheus
    EOF
    
  4. With these configuration files in place, we can now apply these values to the K8s cluster:

    kubectl apply -f /tmp/dcgm_prom_endpoint.yaml
    kubectl apply -f /tmp/dcgm_prom_svc.yaml
    kubectl apply -f /tmp/dcgm_prom_svcmon.yaml
    

Grafana and querying BCM metrics#

Grafana should be available on your head node listening on HTTPS on the /grafana sub-domain.

  1. Using a web browser, navigate to https://<headnode>/grafana.

    By default, the administrative account is admin with password prom-operator.

    Grafana login page
  2. Once authenticated, click on Explore on the left.

    Grafana Dashboard

    Once in the Explore interface, set the query editor to code and query the metric memorytotal.

  3. BCM provides many metrics out of the box, a list of which can be found in Grafana by using the Metric browser selecting all metrics which include the label job with the value external-bcmexporter.

    Grafana Metric browser
  4. Available metrics from BCM can also be found by inspecting the output of the BCM Prometheus exporter endpoint.

    The following CURL command will inspect the output of the BCM Prometheus exporter endpoint:

    curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | cut -d '{' -f1 | sort -u
    
    alertlevel
    ...
    writetime_total
    

Configure Promtail to scrape logs from head nodes#

Note

Promtail configuration is accomplished as part of the wizard install of Autonomous Job Recovery.

Install NMC Grafana dashboards#

As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.

Prerequisites#

In addition to the steps above, the Infinity data source will need to be installed and configured to process data from the BCM REST API.

  1. Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:

    cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"
    
  2. Create a new file with the following values:

    cat << EOF > grafana_infinity.yaml
    grafana:
      enabled: true
      plugins:
        - yesoreyeram-infinity-datasource
      additionalDataSources:
        - name: Infinity
          type: yesoreyeram-infinity-datasource
          access: proxy
          isDefault: false
          basicAuth: true
          basicAuthUser: apiuser
          jsonData:
            auth_method: 'basicAuth'
            allowedHosts:
              - https://master
            tlsSkipVerify: true
            timeoutInSeconds: 60
          secureJsonData:
            basicAuthPassword: apiuserpassword
    EOF
    
  3. Run the following command to apply the values:

    helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f grafana_infinity.yaml --reuse-values
    

Note

Some dashboards currently assume that DGX compute tray hostnames follow a convention where the hostname begins with the rack location (e.g., b08-dgx01). If the customer environment uses a different naming scheme, the dashboards will need to be modified accordingly.

Download#

The dashboards are packaged as a Helm chart (.tgz file). Download the package appropriate to the customer account type:

  1. Internal (NVIS / NVONLINE users)

    Helm chart: https://apps.nvidia.com/pid/contentlibraries/detail?id=1137886

  2. OEM Partners (via NVIDIA Partners Portal) 1137886

Note

OEMs must have an NVOnline account with Site Access. Contact the NVIDIA account team to set up an account if needed.

Installation#

  1. Upload the .tgz file to the cluster head node.

  2. Update values to enable folders within Grafana:

    helm upgrade -n prometheus kube-prometheus-stack prometheus-community/kube-prometheus-stack \
      --reuse-values \
      --set grafana.sidecar.dashboards.enabled=true \
      --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true
    
  3. Run the following command (update the filename if needed):

    helm upgrade --install -n prometheus nmc-dashboards nmc-grafana-dashboards-25.07.01.tgz
    

Accessing dashboards#

  1. In a browser, open: https://<headnode>/grafana

  2. Login using default credentials:

    1. Username: admin

    2. Password: prom-operator

  3. Navigate to Dashboards in the Grafana UI.

    A folder named !! NMC Dashboards, which contains dashboards covering various operational aspects of the cluster, will be present.

Customization and BMS integration#

The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.

One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.

Most customers will need to:

  1. Integrate their own BMS with NVIDIA BCM.

  2. Update the dashboard’s data source and queries to reflect their own BMS setup.

These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.