Leak Detection#
The GB200/GB300 systems use liquid cooling to achieve their scale-out density and efficiency. With a liquid, there is the chance for leaks and the GB200/GB300 systems implement several fail-safes to avoid the worst effects of these. From an administrative point of view, these events will be collated within BCM.
Leak Example in cmsh#
In this example, two nodes that are in a leak detection status.
[a03-p1-head-01]% device
[a03-p1-head-01->device]% list -r a06 -t physicalnode --status DOWN -v
Type Hostname (key) MAC Category IP Network Status
---------------- ------------------ ------------------ ---------------- ---------------- ---------------- --------------------------------
PhysicalNode a06-p1-dgx-02-c05 72:14:AF:B4:FC:7E hwqa-gb200 7.241.18.33 dgxnet1 [ DOWN ], leak:large (going
down timeout reached: 10m),
health check failed, health
check unknown, restart
required (interface:enP22p3s0f0np0)
PhysicalNode a06-p1-dgx-02-c15 1A:A7:F7:32:98:AD hwqa-gb200 7.241.18.43 dgxnet1 [ DOWN ], leak:large (going
down timeout reached: 10m),
health check failed, health check unknown
You can also use the leakinfo command to show current and historical leak events.
[a08-u04-p01-bcm-01->device]% leakinfo
Hostname Timestamp Severity Info
---------------- ------------ ------------ ----------------
[a08-u04-p01-bcm-01->device]% leakinfo --history 10
Hostname Timestamp Severity Info
------------------- -------------------- ------------ --------------------------------------------------------------------------------------------
a05-p01-dgx-03-c18 2026/01/30 19:19:28 large The state of resource `Chassis_0_LeakDetector_1_ColdPlate` has changed to Degraded., resolu+
a05-p01-dgx-03-c18 2026/01/30 19:20:41 none
a06-p01-dgx-02-c04 2025/12/27 20:51:30 large The state of resource `Chassis_0_LeakDetector_1_Manifold` has changed to Degraded., resolut+
a06-p01-dgx-02-c04 2025/12/27 20:52:07 none
a06-p01-dgx-02-c04 2026/01/07 23:23:28 large The state of resource `Chassis_0_LeakDetector_1_ColdPlate` has changed to Degraded., resolu+
a06-p01-dgx-02-c04 2026/01/07 23:25:16 none
a06-p01-dgx-02-c04 2026/01/17 06:43:29 large The state of resource `Chassis_0_LeakDetector_1_Manifold` has changed to Degraded., resolut+
a06-p01-dgx-02-c04 2026/01/17 06:45:10 none
[a08-u04-p01-bcm-01->device]%
For additional commands useful for advanced rack management, refer to Section 3.3.2 of the BCM Nvidia Mission Control Manual.
Leak Example in BCM UI#
BCM also shows this within the UI in several locations.
Events
Monitoring -> Leak Detection and Response