Leak Detection#

The GB200 system uses liquid cooling to achieve its scale-out density and efficiency. With a liquid, there is the chance for leaks and the GB200 system implements several fail-safes to avoid the worst effects of these. From an administrative point of view, these events will be collated within BCM.

9.1.1 Leak Example in cmsh#

In this example, two nodes that are in a leak detection status.

[a03-p1-head-01]% events off broadcast
Broadcast events: off
[a03-p1-head-01]% device
[a03-p1-head-01->device]% list -r a06 -t physicalnode --status DOWN -v
Type             Hostname (key)     MAC               Category        IP              Network         Status
---------------- ------------------ ------------------ ---------------- ---------------- ---------------- --------------------------------
PhysicalNode     a06-p1-dgx-02-c05  72:14:AF:B4:FC:7E  hwqa-gb200      7.241.18.33     dgxnet1         [  DOWN  ], leak:large (going
                                                                                                          down timeout reached: 10m),
                                                                                                          health check failed, health
                                                                                                          check unknown, restart
                                                                                                          required (interface:enP22p3s0f0np0)
PhysicalNode     a06-p1-dgx-02-c15  1A:A7:F7:32:98:AD  hwqa-gb200      7.241.18.43     dgxnet1         [  DOWN  ], leak:large (going
                                                                                                          down timeout reached: 10m),
                                                                                                          health check failed, health check unknown

BCM also shows this within the UI in several locations.

9.1.2 Leak Example in BCM UI#

Events

BCM Events interface showing leak detection events

Monitoring -> Leak Detection and Response

BCM Monitoring interface for leak detection and response

BCM integrates with several data sources for leak detection:

  • Redfish API at the device level via BMC.

  • Building management System (BMS) at the rack and data center level.

BCM leak detection dashboard with integrated data sources

These metrics are available through the UI, using the in-built reporting or as metrics via the Prometheus endpoint.

Examples of these metrics through the exporter endpoint:

for i in $(curl -sk https://localhost:8081/exporter | grep -i 'leak' | grep -Ev '# HELP|# TYPE' | cut -d '{' -f1 | sort -u); do echo Example metric $i; curl -sk https://localhost:8081/exporter | grep $i | head -n 3; done

Example metric deviceswithleaks

# HELP deviceswithleaks Number of devices that have detected a leak
# TYPE deviceswithleaks gauge
deviceswithleaks{base_type="Rack",name="B04"} 0

Example metric leakdetectrack

# HELP leakdetectrack undefined
# TYPE leakdetectrack gauge
leakdetectrack{base_type="Rack",name="A05"} 0

Example metric leakdetectracktray

# HELP leakdetectracktray undefined
# TYPE leakdetectracktray gauge
leakdetectracktray{base_type="Rack",name="A06"} 0

Example metric leakresponserackelectricalisolationstatus

# HELP leakresponserackelectricalisolationstatus undefined
# TYPE leakresponserackelectricalisolationstatus gauge
leakresponserackelectricalisolationstatus{base_type="Rack",name="A06"} 0

Example metric leakresponserackliquidisolationstatus

# HELP leakresponserackliquidisolationstatus undefined
# TYPE leakresponserackliquidisolationstatus gauge
leakresponserackliquidisolationstatus{base_type="Rack",name="A06"} 0

Example metric rf_chassis_0_leakdetector_0_coldplate

# HELP rf_chassis_0_leakdetector_0_coldplate Chassis 0 LeakDetector 0 ColdPlate
# TYPE rf_chassis_0_leakdetector_0_coldplate gauge
rf_chassis_0_leakdetector_0_coldplate{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode"} 1.7047

Example metric rf_chassis_0_leakdetector_0_manifold

# HELP rf_chassis_0_leakdetector_0_manifold Chassis 0 LeakDetector 0 Manifold
# TYPE rf_chassis_0_leakdetector_0_manifold gauge
rf_chassis_0_leakdetector_0_manifold{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode"} 1.7055

Example metric rf_chassis_0_leakdetector_1_coldplate

# HELP rf_chassis_0_leakdetector_1_coldplate Chassis 0 LeakDetector 1 ColdPlate
# TYPE rf_chassis_0_leakdetector_1_coldplate gauge
rf_chassis_0_leakdetector_1_coldplate{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode",parameter="state"} 0

Example metric rf_chassis_0_leakdetector_1_manifold

# HELP rf_chassis_0_leakdetector_1_manifold Chassis 0 LeakDetector 1 Manifold
# TYPE rf_chassis_0_leakdetector_1_manifold gauge
rf_chassis_0_leakdetector_1_manifold{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode",parameter="state"} 0

Example metric rf_ld_leakdetection

# HELP rf_ld_leakdetection Leak Detection Systems
# TYPE rf_ld_leakdetection gauge
rf_ld_leakdetection{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode"} 0

Example metric rf_leakdetector

# HELP rf_leakdetector Chassis 0 LeakDetector 1 ColdPlate
# TYPE rf_leakdetector gauge
rf_leakdetector{base_type="Device",category="perf-team",hostname="b05-p1-dgx-05-c12",type="PhysicalNode",parameter="Chassis_0_LeakDetector_1_ColdPlate"} 0