Observability Stack Configuration#
The BCM cm-kubernetes-setup wizard deploys all the foundational components for observability, by default enabling observability within the K8s space. However, additional configuration is needed to enable time-series and logging data sources from BCM.
Configure storage and retention for Prometheus#
Create a new YAML file on the head node for this setup:
cat << EOF > prom_config.yaml prometheus: prometheusSpec: retention: 90d storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 300Gi persistence: enabled: true storageClassName: local-path type: pvc enabled: true EOF
With the configuration file in place, apply these values to the K8s cluster:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f prom_config.yaml --reuse-values
Configure storage and retention for Loki#
This will require a redeployment of the Helm chart for Loki as we need to set a new sorage value and we can’t resize this live.
Create a new YAML file on the head node for this setup:
cat << EOF > loki_config.yaml backend: replicas: 3 chunksCache: writebackSizeLimit: 500MB commonConfig: replication_factor: 3 compactor: replicas: 0 deploymentMode: SimpleScalable distributor: maxUnavailable: 0 replicas: 0 indexGateway: maxUnavailable: 0 replicas: 0 ingester: replicas: 0 loki: limits_config: ingestion_rate_mb: 75 ingestion_burst_size_mb: 150 per_stream_rate_limit: 16MB per_stream_rate_limit_burst: 64MB retention_period: 90d auth_enabled: false commonConfig: replication_factor: 3 schemaConfig: configs: - from: '2025-09-04' index: period: 24h prefix: loki_index_ object_store: s3 schema: v13 store: tsdb storage: type: s3 minio: enabled: true persistence: size: 500Gi querier: maxUnavailable: 0 replicas: 0 queryFrontend: maxUnavailable: 0 replicas: 0 queryScheduler: replicas: 0 read: replicas: 3 singleBinary: replicas: 0 write: replicas: 3 EOF
Uninstall the existing Helm chart:
helm uninstall loki -n loki
Delete any lingering PVCs.:
kubectl delete pvc –all -n loki
Install the Helm chart for Loki, using the new values:
helm install loki grafana/loki -n loki -f loki_config.yaml
Enable Prometheus scrape endpoint in BCM#
On the head node (both head nodes, if configured for HA) using
sed
to inline edit the configuraiton:sed -i 's/# EnablePrometheusExporterService = true/EnablePrometheusExporterService = true/g' /cm/local/apps/cmd/etc/cmd.conf
Disable certificate-based authentication for the scrape endpoint, and enable extra information to be added to the metrics:
cm-manipulate-advanced-config.py PrometheusExporterExtraInfo=1 PrometheusExporterInfo=1 PrometheusExporterEnum=0 PrometheusExporterRequireCertificate=0 PrometheusLabelIncludeSerialNumber=1 PrometheusLabelIncludePartNumber=1 PrometheusLabelIncludeSystemName=1 MaxMeasurablesPerProducer=1600
If you have integrated your BMS with BCM, then you will additionally need a config like so:
cm-manipulate-advanced-config.py PushMonitoringDeviceStatusMetrics=CDUStatus,CDULiquidSystemPressure,CDULiquidReturnTemperature
Restart the cmd service:
systemctl restart cmd
After cmd restarts, run the following to validate that metrics is being successfully exported through curl:
curl -sk https://localhost:8081/exporter | tail -n 2
The exporter could take some time to start from restart of cmd:
writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vda"} 0 writetime_total{base_type="Device",category="default",hostname="node001",type="PhysicalNode",parameter="vdb"} 3882985
Configure Prometheus to scrape metrics from BCM endpoint#
With Prometheus deployed earlier through the BCM K8s wizard, interaction with the Prometheus operator is required to make configuration changes, enabling this data source.
Prepare the configuration files#
K8s typically uses YAML files with custom values to make configuration changes. Create three of these YAML files on the head node for this setup.
bcm_prom_endpoint.yaml:
cat << EOF > /tmp/bcm_prom_endpoint.yaml apiVersion: v1 kind: Endpoints metadata: name: external-bcmexporter namespace: prometheus labels: app: external-bcmexporter subsets: - addresses: $(cmsh -c 'device; list -f ip:0 -t headnode' | xargs -I{} echo " - ip: {}") ports: - name: metrics port: 8081 EOF
bcm_prom_svc.yaml:
cat << EOF > /tmp/bcm_prom_svc.yaml apiVersion: v1 kind: Service metadata: name: external-bcmexporter namespace: prometheus labels: app: external-bcmexporter spec: clusterIP: None ports: - name: metrics port: 8081 targetPort: 8081 EOF
bcm_prom_svcmon.yaml:
cat << EOF > /tmp/bcm_prom_svcmon.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: external-bcmexporter namespace: prometheus labels: release: kube-prometheus-stack spec: endpoints: - port: metrics interval: 30s path: /exporter scheme: https tlsConfig: insecureSkipVerify: true selector: matchLabels: app: external-bcmexporter namespaceSelector: matchNames: - prometheus EOF
With these configuration files in place, apply these values to the K8s cluster:
kubectl apply -f /tmp/bcm_prom_endpoint.yaml kubectl apply -f /tmp/bcm_prom_svc.yaml kubectl apply -f /tmp/bcm_prom_svcmon.yaml
Configure Prometheus to scrape metrics from DCGM endpoints#
As with the scraping of metrics from the BCM Prometheus endpoint, we also need to configure the Prometheus instance in K8s to scrape the dcgm-exporter service from each GB200 compute tray.
Prepare the configuration files#
Enable the XID errors count metric on the software image used for DGX nodes:
echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /cm/images/<image name>/etc/dcgm-exporter/default-counters.csv
Enable the dcgm-exporter service on the software image:
cm-chroot-sw-img /cm/images/<image name> << EOF
systemctl enable nvidia-dcgm-exporter.service
exit
EOF
Run an image update, writing the changes to active nodes:
cmsh -c "device; foreach -c dgx-gb200 (imageupdate -w)"
Using PDSH, restart the service.:
pdsh -g category=dgx-gb200 'systemctl restart nvidia-dcgm-exporter.service'
Create three YAML files on the head node for this setup to define the external-dcgmexporter service monitor scraping. These files assume that your GB200 compute trays are in a category named dgx-gb200
.
dcgm_prom_endpoint.yaml:
cat << EOF > /tmp/dcgm_prom_endpoint.yaml apiVersion: v1 kind: Endpoints metadata: name: external-dcgmexporter namespace: prometheus labels: app: external-dcgmexporter subsets: - addresses: $(cmsh -c 'device; list -c dgx-gb200 -t physicalnode -f ip:0' | xargs -I{} echo " - ip: {}") ports: - name: metrics port: 9400 EOF
dcgm_prom_svc.yaml:
cat << EOF > /tmp/dcgm_prom_svc.yaml apiVersion: v1 kind: Service metadata: name: external-dcgmexporter namespace: prometheus labels: app: external-dcgmexporter spec: clusterIP: None ports: - name: metrics port: 9400 targetPort: 9400 EOF
dcgm_prom_svcmon.yaml:
cat << EOF > /tmp/dcgm_prom_svcmon.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: external-dcgmexporter namespace: prometheus labels: release: kube-prometheus-stack spec: endpoints: - port: metrics interval: 30s path: /metrics scheme: http selector: matchLabels: app: external-dcgmexporter namespaceSelector: matchNames: - prometheus EOF
With these configuration files in place, we can now apply these values to the K8s cluster:
kubectl apply -f /tmp/dcgm_prom_endpoint.yaml kubectl apply -f /tmp/dcgm_prom_svc.yaml kubectl apply -f /tmp/dcgm_prom_svcmon.yaml
Grafana and querying BCM metrics#
Grafana should be available on your head node listening on HTTPS on the /grafana sub-domain.
Using a web browser, navigate to
https://<headnode>/grafana
.By default, the administrative account is admin with password prom-operator.
Once authenticated, click on Explore on the left.
Once in the Explore interface, set the query editor to code and query the metric memorytotal.
BCM provides many metrics out of the box, a list of which can be found in Grafana by using the Metric browser selecting all metrics which include the label job with the value external-bcmexporter.
Available metrics from BCM can also be found by inspecting the output of the BCM Prometheus exporter endpoint.
The following CURL command will inspect the output of the BCM Prometheus exporter endpoint:
curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | cut -d '{' -f1 | sort -u alertlevel ... writetime_total
Configure Promtail to scrape logs from head nodes#
Note
Promtail configuration is accomplished as part of the wizard install of Autonomous Job Recovery.
Install NMC Grafana dashboards#
As part of NVIDIA Mission Control, a set of Grafana dashboards are provided that consolidate monitoring data across the cluster. These dashboards help track ongoing operations and support troubleshooting. This dashboard serves as a starting point for visualizing infrastructure telemetry. It is encouraged to modify or extend it based on the facility’s capabilities.
Prerequisites#
In addition to the steps above, the Infinity data source will need to be installed and configured to process data from the BCM REST API.
Create a user to access the BCM REST API. Create user apiuser with password apiuserpassword with a read-only BCM profile:
cmsh -c "user; add apiuser; set password apiuserpassword; set profile readonly; commit"
Create a new file with the following values:
cat << EOF > grafana_infinity.yaml grafana: enabled: true plugins: - yesoreyeram-infinity-datasource additionalDataSources: - name: Infinity type: yesoreyeram-infinity-datasource access: proxy isDefault: false basicAuth: true basicAuthUser: apiuser jsonData: auth_method: 'basicAuth' allowedHosts: - https://master tlsSkipVerify: true timeoutInSeconds: 60 secureJsonData: basicAuthPassword: apiuserpassword EOF
Run the following command to apply the values:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus -f grafana_infinity.yaml --reuse-values
Note
Some dashboards currently assume that DGX compute tray hostnames follow a convention where the hostname begins with the rack location (e.g., b08-dgx01). If the customer environment uses a different naming scheme, the dashboards will need to be modified accordingly.
Download#
The dashboards are packaged as a Helm chart (.tgz file). Download the package appropriate to the customer account type:
Internal (NVIS / NVONLINE users)
Helm chart: https://apps.nvidia.com/pid/contentlibraries/detail?id=1137886
OEM Partners (via NVIDIA Partners Portal) 1137886
Note
OEMs must have an NVOnline account with Site Access. Contact the NVIDIA account team to set up an account if needed.
Installation#
Upload the .tgz file to the cluster head node.
Update values to enable folders within Grafana:
helm upgrade -n prometheus kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --reuse-values \ --set grafana.sidecar.dashboards.enabled=true \ --set grafana.sidecar.dashboards.provider.foldersFromFilesStructure=true
Run the following command (update the filename if needed):
helm upgrade --install -n prometheus nmc-dashboards nmc-grafana-dashboards-25.07.01.tgz
Accessing dashboards#
In a browser, open:
https://<headnode>/grafana
Login using default credentials:
Username: admin
Password: prom-operator
Navigate to Dashboards in the Grafana UI.
A folder named !! NMC Dashboards, which contains dashboards covering various operational aspects of the cluster, will be present.
Customization and BMS integration#
The dashboards included in this Helm chart are provided as examples. They are fully functional out of the box but are intended to be customized to match the customer’s specific cluster environment.
One dashboard, BMS View of Cluster, is configured to use NVIDIA Cronus as the Building Management System (BMS) data source. It displays power, liquid cooling, environmental, and other facility-level metrics.
Most customers will need to:
Integrate their own BMS with NVIDIA BCM.
Update the dashboard’s data source and queries to reflect their own BMS setup.
These dashboards serve as a starting point for visualizing infrastructure telemetry. The customer is encouraged to modify or extend them based on the facility’s capabilities and telemetry available. In some dashboards we show what filtering by rack can look like, but you might need to customize the filters to fit your situation.