Leak Detection#

The GB200/GB300 systems use liquid cooling to achieve their scale-out density and efficiency. With a liquid, there is the chance for leaks and the GB200/GB300 systems implement several fail-safes to avoid the worst effects of these. From an administrative point of view, these events will be collated within BCM.

Leak Example in cmsh#

In this example, two nodes that are in a leak detection status.

[a03-p1-head-01]% device
[a03-p1-head-01->device]% list -r a06 -t physicalnode --status DOWN -v
Type             Hostname (key)     MAC               Category        IP              Network         Status
---------------- ------------------ ------------------ ---------------- ---------------- ---------------- --------------------------------
PhysicalNode     a06-p1-dgx-02-c05  72:14:AF:B4:FC:7E  hwqa-gb200      7.241.18.33     dgxnet1         [  DOWN  ], leak:large (going
                                                                                                          down timeout reached: 10m),
                                                                                                          health check failed, health
                                                                                                          check unknown, restart
                                                                                                          required (interface:enP22p3s0f0np0)
PhysicalNode     a06-p1-dgx-02-c15  1A:A7:F7:32:98:AD  hwqa-gb200      7.241.18.43     dgxnet1         [  DOWN  ], leak:large (going
                                                                                                          down timeout reached: 10m),
                                                                                                          health check failed, health check unknown

You can also use the leakinfo command to show current and historical leak events.

[a08-u04-p01-bcm-01->device]% leakinfo
Hostname         Timestamp    Severity     Info
---------------- ------------ ------------ ----------------
[a08-u04-p01-bcm-01->device]% leakinfo --history 10
Hostname            Timestamp            Severity     Info
------------------- -------------------- ------------ --------------------------------------------------------------------------------------------
a05-p01-dgx-03-c18  2026/01/30 19:19:28  large        The state of resource `Chassis_0_LeakDetector_1_ColdPlate` has changed to Degraded., resolu+
a05-p01-dgx-03-c18  2026/01/30 19:20:41  none
a06-p01-dgx-02-c04  2025/12/27 20:51:30  large        The state of resource `Chassis_0_LeakDetector_1_Manifold` has changed to Degraded., resolut+
a06-p01-dgx-02-c04  2025/12/27 20:52:07  none
a06-p01-dgx-02-c04  2026/01/07 23:23:28  large        The state of resource `Chassis_0_LeakDetector_1_ColdPlate` has changed to Degraded., resolu+
a06-p01-dgx-02-c04  2026/01/07 23:25:16  none
a06-p01-dgx-02-c04  2026/01/17 06:43:29  large        The state of resource `Chassis_0_LeakDetector_1_Manifold` has changed to Degraded., resolut+
a06-p01-dgx-02-c04  2026/01/17 06:45:10  none
[a08-u04-p01-bcm-01->device]%

For additional commands useful for advanced rack management, refer to Section 3.3.2 of the BCM Nvidia Mission Control Manual.

Leak Example in BCM UI#

BCM also shows this within the UI in several locations.

Events

BCM Events interface showing leak detection events

Monitoring -> Leak Detection and Response

BCM Monitoring interface for leak detection and response