Observability Stack Configuration#

The BCM cm-kubernetes-setup wizard deploys all the foundational components for observability, by default enabling observability within the K8s space. However, additional configuration is needed to enable time-series and logging data sources from BCM.

Configure storage and retention for Prometheus#

Create a new YAML file on the head node for this setup:

cat << EOF > prom_config.yaml
prometheus:
  prometheusSpec:
    retention: 90d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 300Gi
  persistence:
    enabled: true
    storageClassName: local-path
    type: pvc
    enabled: true
EOF

With the configuration file in place, apply these values to the K8s cluster:

helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f prom_config.yaml --reuse-values

Configure storage and retention for Loki#

This will require a redeployment of the Helm chart for Loki as we need to set a new storage value and we can’t resize this live.

Create a new YAML file on the head node for this setup:

cat << EOF > loki_config.yaml
backend:
  replicas: 3
chunksCache:
  writebackSizeLimit: 500MB
commonConfig:
  replication_factor: 3
compactor:
  replicas: 0
deploymentMode: SimpleScalable
distributor:
  maxUnavailable: 0
  replicas: 0
indexGateway:
  maxUnavailable: 0
  replicas: 0
ingester:
  replicas: 0
loki:
  limits_config:
    ingestion_rate_mb: 75
    ingestion_burst_size_mb: 150
    per_stream_rate_limit: 16MB
    per_stream_rate_limit_burst: 64MB
    retention_period: 90d
  auth_enabled: false
  commonConfig:
    replication_factor: 3
  schemaConfig:
    configs:
    - from: '2025-09-04'
      index:
        period: 24h
        prefix: loki_index_
      object_store: s3
      schema: v13
      store: tsdb
  storage:
    type: s3
minio:
  enabled: true
  persistence:
    size: 500Gi
querier:
  maxUnavailable: 0
  replicas: 0
queryFrontend:
  maxUnavailable: 0
  replicas: 0
queryScheduler:
  replicas: 0
read:
  replicas: 3
singleBinary:
  replicas: 0
write:
  replicas: 3
EOF

Uninstall the existing Helm chart:
```
helm uninstall loki -n loki
```
Delete any lingering PVCs:
```
kubectl delete pvc --all -n loki
```

Install the Helm chart for Loki, using the new values:

helm install loki grafana/loki -n loki -f loki_config.yaml

Enable Prometheus scrape endpoint in BCM#

On the head node (both head nodes, if configured for HA) using sed to inline edit the configuraiton:

sed -i 's/# EnablePrometheusExporterService = true/EnablePrometheusExporterService = true/g' /cm/local/apps/cmd/etc/cmd.conf

Disable certificate-based authentication for the scrape endpoint, and enable extra information to be added to the metrics:

cm-manipulate-advanced-config.py PrometheusExporterExtraInfo=1 PrometheusExporterInfo=1 PrometheusExporterEnum=0 PrometheusExporterRequireCertificate=0 PrometheusLabelIncludeSerialNumber=1 PrometheusLabelIncludePartNumber=1 PrometheusLabelIncludeSystemName=1 MaxMeasurablesPerProducer=1600

If you have integrated your BMS with BCM, then you will additionally need a config like so:

cm-manipulate-advanced-config.py PushMonitoringDeviceStatusMetrics=CDUStatus,CDULiquidSystemPressure,CDULiquidReturnTemperature

Restart the cmd service:
```
systemctl restart cmd
```

After cmd restarts, run the following to validate that metrics is being successfully exported through curl:

curl -sk https://localhost:8081/exporter | tail -n 2

The exporter could take some time to start from restart of cmd:

writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vda"} 0
writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vdb"} 3882985

Configure Prometheus to scrape metrics from BCM endpoint#

With Prometheus deployed earlier through the BCM K8s wizard, interaction with the Prometheus operator is required to make configuration changes, enabling this data source.

Prepare the configuration files#

K8s typically uses YAML files with custom values to make configuration changes. Create three of these YAML files on the head node for this setup.

bcm_prom_endpoint.yaml:

cat << EOF > /tmp/bcm_prom_endpoint.yaml
apiVersion: v1
kind: Endpoints
metadata:
  name: external-bcmexporter
  namespace: prometheus
  labels:
    app: external-bcmexporter
subsets:
  - addresses:
$(cmsh -c 'device; list -f ip:0 -t headnode' | xargs -I{} echo "    - ip: {}")
    ports:
    - name: metrics
      port: 8081
EOF

bcm_prom_svc.yaml:

cat << EOF > /tmp/bcm_prom_svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: external-bcmexporter
  namespace: prometheus
  labels:
    app: external-bcmexporter
spec:
  clusterIP: None
  ports:
  - name: metrics
    port: 8081
    targetPort: 8081
EOF

bcm_prom_svcmon.yaml:

cat << EOF > /tmp/bcm_prom_svcmon.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: external-bcmexporter
  namespace: prometheus
  labels:
    release: kube-prometheus-stack
spec:
  endpoints:
  - port: metrics
    interval: 30s
    path: /exporter
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  selector:
    matchLabels:
      app: external-bcmexporter
  namespaceSelector:
    matchNames:
      - prometheus
EOF

With these configuration files in place, apply these values to the K8s cluster:

kubectl apply -f /tmp/bcm_prom_endpoint.yaml
kubectl apply -f /tmp/bcm_prom_svc.yaml
kubectl apply -f /tmp/bcm_prom_svcmon.yaml

Configure Prometheus to scrape metrics from DCGM endpoints#

As with the scraping of metrics from the BCM Prometheus endpoint, we also need to configure the Prometheus instance in K8s to scrape the dcgm-exporter service from each GB200 compute tray.

Prepare the configuration files#

Enable the XID errors count metric on the software image used for DGX nodes:

echo "DCGM_EXP_XID_ERRORS_COUNT,         gauge,   Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /cm/images/<image name>/etc/dcgm-exporter/default-counters.csv

Enable the dcgm-exporter service on the software image:

cm-chroot-sw-img /cm/images/<image name> << EOF
  systemctl enable nvidia-dcgm-exporter.service
  exit
EOF

Run an image update, writing the changes to active nodes:

cmsh -c "device; foreach -c dgx-gb200 (imageupdate -w)"

Using PDSH, restart the service.:

pdsh -g category=dgx-gb200 'systemctl restart nvidia-dcgm-exporter.service'

Create three YAML files on the head node for this setup to define the external-dcgmexporter service monitor scraping. These files assume that your GB200 compute trays are in a category named dgx-gb200.

dcgm_prom_endpoint.yaml:

cat << EOF > /tmp/dcgm_prom_endpoint.yaml
apiVersion: v1
kind: Endpoints
metadata:
  name: external-dcgmexporter
  namespace: prometheus
  labels:
    app: external-dcgmexporter
subsets:
  - addresses:
$(cmsh -c 'device; list -c dgx-gb200 -t physicalnode -f ip:0' | xargs -I{} echo "    - ip: {}")
    ports:
    - name: metrics
      port: 9400
EOF

dcgm_prom_svc.yaml:

cat << EOF > /tmp/dcgm_prom_svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: external-dcgmexporter
  namespace: prometheus
  labels:
    app: external-dcgmexporter
spec:
  clusterIP: None
  ports:
  - name: metrics
    port: 9400
    targetPort: 9400
EOF

dcgm_prom_svcmon.yaml:

cat << EOF > /tmp/dcgm_prom_svcmon.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: external-dcgmexporter
  namespace: prometheus
  labels:
    release: kube-prometheus-stack
spec:
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
  selector:
    matchLabels:
      app: external-dcgmexporter
  namespaceSelector:
    matchNames:
      - prometheus
EOF

With these configuration files in place, we can now apply these values to the K8s cluster:

kubectl apply -f /tmp/dcgm_prom_endpoint.yaml
kubectl apply -f /tmp/dcgm_prom_svc.yaml
kubectl apply -f /tmp/dcgm_prom_svcmon.yaml

We can validate that DCGM exporter is working by issuing a curl to the endpoint listening on 9400/tcp:

pdsh -g category=dgx-gb200 'curl -s http://localhost:9400/metrics | grep DCGM | tail -n 1' | dshbak -c
----------------
b06-p1-dgx-06-c06
----------------
DCGM_EXP_XID_ERRORS_COUNT{gpu="3",UUID="GPU-843eece3-8d4c-82e2-0f6d-eddd48380b14",pci_bus_id="00000019:01:00.0",device="nvidia3",modelName="NVIDIA GB200",Hostname="b06-p1-dgx-06-c06",DCGM_FI_DRIVER_VERSION="570.172.08",window_size_in_ms="300000"} 0
----------------
b06-p1-dgx-06-c07
----------------
DCGM_EXP_XID_ERRORS_COUNT{gpu="3",UUID="GPU-2635e627-944b-5810-e599-78747aef0107",pci_bus_id="00000019:01:00.0",device="nvidia3",modelName="NVIDIA GB200",Hostname="b06-p1-dgx-06-c07",DCGM_FI_DRIVER_VERSION="570.172.08",window_size_in_ms="300000"} 0
----------------
b06-p1-dgx-06-c08
----------------
DCGM_EXP_XID_ERRORS_COUNT{gpu="3",UUID="GPU-059a99b0-8f4a-43ee-9e22-e843be6fbf78",pci_bus_id="00000019:01:00.0",device="nvidia3",modelName="NVIDIA GB200",Hostname="b06-p1-dgx-06-c08",DCGM_FI_DRIVER_VERSION="570.172.08",window_size_in_ms="300000"} 0
----------------
b06-p1-dgx-06-c09
----------------
DCGM_EXP_XID_ERRORS_COUNT{gpu="3",UUID="GPU-26d2e46e-ed78-f470-c5a1-09d9f76bb400",pci_bus_id="00000019:01:00.0",device="nvidia3",modelName="NVIDIA GB200",Hostname="b06-p1-dgx-06-c09",DCGM_FI_DRIVER_VERSION="570.172.08",window_size_in_ms="300000"} 0

Grafana and querying BCM metrics#

Grafana should be available on your head node listening on HTTPS on the /grafana sub-domain.

Using a web browser, navigate to https://<headnode>/grafana.

By default, the administrative account is admin with password prom-operator.
Once authenticated, click on Explore on the left.

Once in the Explore interface, set the query editor to code and query the metric memorytotal.
BCM provides many metrics out of the box, a list of which can be found in Grafana by using the Metric browser selecting all metrics which include the label job with the value external-bcmexporter.
Available metrics from BCM can also be found by inspecting the output of the BCM Prometheus exporter endpoint.

The following CURL command will inspect the output of the BCM Prometheus exporter endpoint:
```
curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | cut -d '{' -f1 | sort -u

alertlevel
...
writetime_total
```

Configure Promtail to scrape logs from head nodes#

Note

Promtail configuration is accomplished as part of the wizard install of Autonomous Job Recovery.

Install NMC Grafana dashboards#

As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.

Prerequisites#

In addition to the steps above, the Infinity data source will need to be installed and configured to process data from the BCM REST API and UFM REST API.

Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:
```
cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"
```

Create a new file with the following values. You will need credentials for accessing the UFM REST API:

cat << EOF > grafana_infinity.yaml
grafana:
  enabled: true
  plugins:
    - yesoreyeram-infinity-datasource
  additionalDataSources:
    - name: BCM API
      type: yesoreyeram-infinity-datasource
      access: proxy
      isDefault: false
      basicAuth: true
      basicAuthUser: apiuser
      jsonData:
        auth_method: 'basicAuth'
        allowedHosts:
          - https://master
        tlsSkipVerify: true
        timeoutInSeconds: 60
      secureJsonData:
        basicAuthPassword: apiuserpassword
    - name: UFM API
      type: yesoreyeram-infinity-datasource
      access: proxy
      isDefault: false
      basicAuth: true
      basicAuthUser: UFM_USER
      jsonData:
        auth_method: 'basicAuth'
        allowedHosts:
          - https://master
        tlsSkipVerify: true
        timeoutInSeconds: 60
      secureJsonData:
        basicAuthPassword: UFM_PASSWORD
EOF

Run the following command to apply the values:

helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f grafana_infinity.yaml --reuse-values

To access the UFM REST API, we need to install a port-forwarding rule within BCM as well. To configure this, you need to gather the IP address of the UFM appliance and replace it in the following command sequence:
```
cmsh -c "device; portforward create UFM_IP_ADDRESS 443 11443"
```
This should be repeated on your secondary BCM head node, if you have one configured.

Note

The port-forwarding rules are not persisted on restarts of the BCM daemon process. If your head node reboots or the daemon restarts for any reason, you will need to rerun this port-forwarding sequence.

Download#

The dashboards are packaged as a Helm chart (.tgz file). Download the package appropriate to the customer account type:

Internal (NVIS / NVONLINE users)

Helm chart: https://apps.nvidia.com/PID/ContentLibraries/Detail?id=1142822
OEM Partners (via NVIDIA Partners Portal) 1142822

Note

OEMs must have an NVOnline account with Site Access. Contact the NVIDIA account team to set up an account if needed.

Installation#

Upload the .tgz file to the cluster head node.

Update values to enable folders within Grafana:

helm upgrade -n prometheus kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --reuse-values \
  --set grafana.sidecar.dashboards.enabled=true \
  --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true

DGX B200

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards nmc-grafana-dashboards-27.0.1.tgz --set gpu_type=b200

DGX GB200

Run the following command (update the filename if needed):

helm upgrade --install -n prometheus nmc-dashboards nmc-grafana-dashboards-27.0.1.tgz --set gpu_type=gb200

Accessing dashboards#

In a browser, open: https://<headnode>/grafana
Login using default credentials:
1. Username: admin
2. Password: prom-operator
Navigate to Dashboards in the Grafana UI.

A folder named !! NMC Dashboards, which contains dashboards covering various operational aspects of the cluster, will be present.

Customization and BMS integration#

The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.

One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.

Most customers will need to:

Integrate their own BMS with NVIDIA BCM.
Update the dashboard’s data source and queries to reflect their own BMS setup.

These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.