Base Command Manager Integration with Building Management System#
Introduction#
NVIDIA DGX NVL72 system has defined three different levels of liquid leak detection system:
Node/tray-level liquid detection, with (1) Cold Plate Leak sensor and (2) Inner Manifold Leak sensor.
Rack-level liquid detection, with leak sensing ropes and leak spot sensor located along piping and in the DGX GB200 compute racks.
Datacenter-level liquid detection, with leak spot sensor and sensing rope located in the cooling distribution units (CDUs), and alongside with the piping of the datacenter leak pipes.
For DGX GB200, tray level liquid detection is handled by the system BMC on compute tray / switch tray. The rack level liquid detection is handled by a customer provided building management system (BMS) operating in the operational technology (OT) side of the datacenter.
NVIDIA Base Command Manager 11 provides native support for managing the leak event over the REDFISH interface from the BMC of the DGX GB200. To reach a common intelligence and allow centralized leak detection, power controlling and leak even reaction, we recommend to integrate the customer provided BMS.
Integration of BMS with BCM#
In order to integrate the customer provided BMS with BCM, we recommend all customers of DGX GB200 to align with our specification as provided in the following parts of this document.
Leak Detection Process#

In the DGX GB200 based system, BCM expects a MQTT based BMS system following the data catalog as published by NVIDIA. MQTT is a publish-subscribe based communication protocol for IoT devices and provides fast broadcasting of messages, as well as low end to end latency.
NVIDIA BCM expects TCP/IP connectivity to and from the MQTT server that BMS system would provide. Note, that the MQTT server itself is not part of BCM, and must be provided by the customer or their BMS system integrator.
Moreover, NVIDIA recommends that the MQTT server is firewall protected with TLS or SSL enabled. This way, a mixing of the OT and IT side traffic can be avoided.
Setting up the BMS in BCM#
These are all the settings involved to BCM up as an MQTT client for a BMS system
[a03-p1-head-01->partition[base]]% get bms
NVIDIA conforming BMS
[a03-p1-head-01->partition[base]]% configurationoverlay
[a03-p1-head-01->configurationoverlay]% use mqtt
[a03-p1-head-01->configurationoverlay[mqtt]]% roles
[a03-p1-head-01->configurationoverlay[mqtt]->roles]% show mqtt
Parameter Value
------------------------ -----------------------------------------------------------------------
Name mqtt
Revision
Type MQTTRole
Add services yes
Servers <1 in submode>
CA certificate path /cm/local/apps/cmd/pythoncm/lib/python3.12/site-packages/pythoncm/etc/cacert.pem
Private key path /cm/local/apps/cmd/cm-mqtt/etc/mqtt.key
Certificate path /cm/local/apps/cmd/cm-mqtt/etc/mqtt.pem
Write named pipe path /var/spool/cmd/mqtt.pipe
[a03-p1-head-01->configurationoverlay[mqtt]->roles]% servers mqtt
[a03-p1-head-01->configurationoverlay[mqtt]->roles[mqtt]->servers]% list
Server (key) Port Disabled
-------------- ------ ----------
7.241.8.177 1883 norhy
[a03-p1-head-01->configurationoverlay[mqtt]->roles[mqtt]->servers]% show 7.241.8.177
Parameter Value
------------------------ ------------------------------------------------
Revision
Server 7.241.8.177
Port 1883
Topic BCM/#
Disabled no
Username Bcm
Password *******
Transport tcp
Protocol v3.1.1
Certificate required yes
Check hostname yes
CA certificate
Certificate
Private key
It is recommended to also define all the racks the BMS knows about inside BCM, even if those do not yet contain any nodes.
Define all the power circuits the BMS reports data for as well, these will be linked to the power circuit data that comes from MQTT.
[a03-p1-head-01->powercircuit]% list
Name (key) Building Location
----------- -------- --------
RPP-B12-3
RPP-B14-3
RPP-B21-5
Defining the CDU as devices allows them to be shown as UP/DOWN
Via IP ping : if set
Via timestamp of latest data point reported by MQTT
[a03-p1-head-01->device]% list -t coolingdistributionunit
Type Hostname (key) IP Status
--------------------- -------------- -------- ----------------
CoolingDistributionUnit CDU01 0.0.0.0 [ UP ]
CoolingDistributionUnit CDU02 0.0.0.0 [ UP ]
In some instances, you might like to make additional BMS metrics available over Prometheus as part of the observability stack. As an example, you might configure it this way:
cm-manipulate-advanced-config.py PushMonitoringDeviceStatusMetrics=CDUStatus,CDULiquidSystemPressure,CDULiquidReturnTemperature
after which you should resart the BCM cmdaemon:
systemctl restart cmd
Data Catalog#
The latest version of the data catalog is available via PID (https://apps.nvidia.com/PID/ContentLibraries/Detail/1132978).
Fault Type and Handling Recommendations#
Fault Type |
BMC |
BCM |
BMS |
---|---|---|---|
Tray level leak detection |
|
|
NA |
Rack level leak detection (Detected by BCM) |
Same as above |
|
|
Rack level leak detection (Detected by BMS) |
Same as above |
NA |
|
Row level leak detection |
Same as above |
NA |
|
Sensor Fault (includes false alarms due to sensor misreadings) |
NA |
Call for onsite inspection (e.g., power drain procedure) |
Call for onsite inspection (e.g., power drain procedure) |