Base Command Manager Integration with Building Management System#

Introduction#

NVIDIA DGX NVL72 system has defined three different levels of liquid leak detection system:

Node/tray-level liquid detection, with (1) Cold Plate Leak sensor and (2) Inner Manifold Leak sensor.
Rack-level liquid detection, with leak sensing ropes and leak spot sensor located along piping and in the DGX GB200 compute racks.
Datacenter-level liquid detection, with leak spot sensor and sensing rope located in the cooling distribution units (CDUs), and alongside with the piping of the datacenter leak pipes.

For DGX GB200, tray level liquid detection is handled by the system BMC on compute tray / switch tray. The rack level liquid detection is handled by a customer provided building management system (BMS) operating in the operational technology (OT) side of the datacenter.

NVIDIA Base Command Manager 11 provides native support for managing the leak event over the REDFISH interface from the BMC of the DGX GB200. To reach a common intelligence and allow centralized leak detection, power controlling and leak even reaction, we recommend to integrate the customer provided BMS.

Integration of BMS with BCM#

In order to integrate the customer provided BMS with BCM, we recommend all customers of DGX GB200 to align with our specification as provided in the following parts of this document.

Leak Detection Process#

MQTT-based communication flow diagram between Building Management System (BMS) and Base Command Manager (BCM) showing data exchange and leak detection process

In the DGX GB200 based system, BCM expects a MQTT based BMS system following the data catalog as published by NVIDIA. MQTT is a publish-subscribe based communication protocol for IoT devices and provides fast broadcasting of messages, as well as low end to end latency.

NVIDIA BCM expects TCP/IP connectivity to and from the MQTT server that BMS system would provide. Note, that the MQTT server itself is not part of BCM, and must be provided by the customer or their BMS system integrator.

Moreover, NVIDIA recommends that the MQTT server is firewall protected with TLS or SSL enabled. This way, a mixing of the OT and IT side traffic can be avoided.

Setting up the BMS in BCM#

These are all the settings involved to BCM up as an MQTT client for a BMS system

[a03-p1-head-01->partition[base]]% get bms

NVIDIA conforming BMS

[a03-p1-head-01->partition[base]]% configurationoverlay

[a03-p1-head-01->configurationoverlay]% use mqtt

[a03-p1-head-01->configurationoverlay[mqtt]]% roles

[a03-p1-head-01->configurationoverlay[mqtt]->roles]% show mqtt

Parameter                Value
------------------------ -----------------------------------------------------------------------
Name                     mqtt
Revision
Type                     MQTTRole
Add services             yes
Servers                  <1 in submode>
CA certificate path      /cm/local/apps/cmd/pythoncm/lib/python3.12/site-packages/pythoncm/etc/cacert.pem
Private key path         /cm/local/apps/cmd/cm-mqtt/etc/mqtt.key
Certificate path         /cm/local/apps/cmd/cm-mqtt/etc/mqtt.pem
Write named pipe path    /var/spool/cmd/mqtt.pipe

[a03-p1-head-01->configurationoverlay[mqtt]->roles]% servers mqtt

[a03-p1-head-01->configurationoverlay[mqtt]->roles[mqtt]->servers]% list

Server (key)    Port    Disabled
--------------  ------  ----------
7.241.8.177     1883    norhy

[a03-p1-head-01->configurationoverlay[mqtt]->roles[mqtt]->servers]% show 7.241.8.177

Parameter                Value
------------------------ ------------------------------------------------
Revision
Server                   7.241.8.177
Port                     1883
Topic                    BCM/#
Disabled                 no
Username                 Bcm
Password                 *******
Transport                tcp
Protocol                 v3.1.1
Certificate required     yes
Check hostname           yes
CA certificate
Certificate
Private key

It is recommended to also define all the racks the BMS knows about inside BCM, even if those do not yet contain any nodes.

Define all the power circuits the BMS reports data for as well, these will be linked to the power circuit data that comes from MQTT.

[a03-p1-head-01->powercircuit]% list

Name (key)    Building    Location
-----------   --------    --------
RPP-B12-3
RPP-B14-3
RPP-B21-5

Defining the CDU as devices allows them to be shown as UP/DOWN

Via IP ping : if set
Via timestamp of latest data point reported by MQTT

[a03-p1-head-01->device]% list -t coolingdistributionunit

Type                    Hostname (key)    IP         Status
---------------------   --------------    --------   ----------------
CoolingDistributionUnit CDU01             0.0.0.0    [ UP ]
CoolingDistributionUnit CDU02             0.0.0.0    [ UP ]

In some instances, you might like to make additional BMS metrics available over Prometheus as part of the observability stack. As an example, you might configure it this way:

cm-manipulate-advanced-config.py PushMonitoringDeviceStatusMetrics=CDUStatus,CDULiquidSystemPressure,CDULiquidReturnTemperature

after which you should resart the BCM cmdaemon:

systemctl restart cmd

Data Catalog#

The latest version of the data catalog is available via PID (https://apps.nvidia.com/PID/ContentLibraries/Detail/1132978).

Fault Type and Handling Recommendations#

Fault Type	BMC	BCM	BMS
Tray level leak detection	Compute Tray Only - BMC activates Shutdown timer All - BMC notify BCM leak event	Switch Tray Only - BCM power off leak tray via OOB redfish commands BCM creates service ticket for onsite inspection Notify BMS for single tray leak event	NA
Rack level leak detection (Detected by BCM)	Same as above	Shut off DC output of power shelf immediately Notify BMS for rack leak fault event	Shut off all power circuit break to rack by BCM notified event Shut off liquid valves at both direction by BCM notified event
Rack level leak detection (Detected by BMS)	Same as above	NA	Shut off all power circuit break to rack - Notify BCM Shut off liquid valves for rack supply and return - Notify BCM
Row level leak detection	Same as above	NA	Shut off all power circuit breaks to all racks in the row - Notify BCM Shut down row CDU – Notify BCM
Sensor Fault (includes false alarms due to sensor misreadings)	NA	Call for onsite inspection (e.g., power drain procedure)	Call for onsite inspection (e.g., power drain procedure)