High-Speed Fabric Management#
NVLink and NVLink Switch#
NVIDIA NVLink is a high-speed interconnect that allows multiple GPUs to communicate directly. In the past, the NVLink connections were limited to GPUs within a single node. With the addition of the NVLink Switch technology multi-node NVLink is now possible. This now allows multiple systems to interconnect over NVLink, using NVLink Switches, to form a large GPU and memory fabric within an NVLink Domain.
The NVIDIA GB200 NVL72 rack enables a 72 GPU NVLink Domain by default.This NVL72 NVLink Domain is set up as a single NVLink Partition. In other words, all GPUs in the NVL72 Domain have the potential to communicate with each other by default. In certain scenarios, a cluster administrator may need to create multiple NVLink Partitions within the NVL72 system. These are called User Partitions and they enable the ability to have multiple NVLink Partitions instead of one large Default Partition. This can provide a higher degree of multi-tenancy and isolation between workloads.
In most cases the Default Partition can be utilized, unless there are strong security requirements for certain workloads. IMEX and Slurm provide ways to dynamically split the NVLink Default Partition at job runtime and those will be discussed in more detail at the Slurm <provide link> section of this document.
It is worth noting that the connection between racks utilizes Infiniband but the connection within a rack will utilize NVLink.
NVLink definitions
Terms |
Description |
---|---|
NVLink |
High speed link to enable GPU to GPU communication |
NVLink Partition |
A subset of GPUs within a hardware isolation boundary |
NVLink Domain |
A set of nodes that can communicate over NVLink |
Default Partition |
Allows all GPUs in an NVLink Domain to communicate |
User Partition |
An NVLink partition that is created and managed by an admin |
More details on NVLink Partitioning can be found in this GB200 NVL Partition Guide.
NVLink Management Software (NMX)#
NVIDIA NMX is an integrated platform for managing and monitoring NVLink Switches, Domains, and Partitions.
It includes three main components:
NMX-Telemetry (NMX-T)
NMX-Manager (NMX-M)
NMX-Controller (NNMX-C)
To manage NVLink Partitions gRCP API calls are sent to the NMX Controller or by utilizing the NVOS CLI on the NVSwitch.
The base case is highlighted below, where we utilize the NVOS CLI on NVSWITCH-1
. In this case, NVSWITCH-1
is our leader switch that runs NMX-C and NMX-T.
You can see the processes of the leader switch with the NVOS CLI and running nv show cluster app
:
admin@a07-p1-nvsw-01-eth1.ipminet2.cluster:~$ nv show cluster app
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ ------ ------------------------------ -------
nmx-controller nmx-c-nvos 1.0.0_2025-04-11_15-23 sm, gfm, fib, gw-api sm:2025.03.6, gfm:R570.133.15, fib-fe:1.0.1 ok CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry nmx-telemetry 1.0.4 nvl telemetry, gnmi aggregation, syslog aggregation nvl-telemetry:1.20.4, gnmi-aggregator:1.4.1, nmx-connector:1.4.1 ok
You can further check NVLink partition configuration:
admin@NVSWITCH-1:~$ nv show sdn partition
ID Name Num of GPUs Health Resiliency mode Multicast groups limit Partition type Summary
----- ----------------- ----------- ------- ------------------ ---------------------- -------------- -------
32766 Default Partition 72 healthy adaptive_bandwidth 1024 gpuuid_based
Additionally, here is a way to see available commands for creating or modifying partitions:
admin@NVSWITCH-1:~$ nv list-commands | grep partition
nv show sdn partition
nv show sdn partition <partition-id>
nv show sdn partition <partition-id> location
nv show sdn partition <partition-id> uuid
nv action boot-next system image (partition1|partition2)
nv action update sdn partition <partition-id> [reroute]
nv action update sdn partition <partition-id> location <location-id> [no-reroute]
nv action update sdn partition <partition-id> uuid <uuid> [no-reroute]
nv action create sdn partition <partition-id> [name <value>] [resiliency-mode (full_bandwidth|adaptive_bandwidth|user_action)] [mcast-limit (0-1024)] [location <location-id>] [uuid (0-18446744073709551615)]
nv action delete sdn partition <partition-id>
nv action restore sdn partition <partition-id> location <location-id> [no-reroute]
nv action restore sdn partition <partition-id> uuid <uuid> [no-reroute]
More details on managing NVLink Switches can be found within the NVOS and NMX docs on the NVLink Networking docs page.
NMX-M provides for centralized monitoring of NVSwitches providing a REST API endpoint and Prometheus scrape endpoint per rack. BCM also integrates into NMX-M and will collect a subset of metrics.
[a03-p1-head-01]% partition
[a03-p1-head-01->partition[base]]% nmxmsettings
[a03-p1-head-01->partition[base]->nmxmsettings]% show
Parameter Value
-------------------------------- ------------------------------------------------
Revision
Server 7.241.0.145
User name rw-user
Password ********
Port 443
Verify SSL no
CA certificate
Certificate
Private key
Prometheus metric forwarders <0 in submode>
An example of updating the Server and Password fields in nmxmsettings.
[a03-p1-head-01->partition[base]->nmxmsettings]% set Server 7.241.0.144
[a03-p1-head-01->partition*[base*]->nmxmsettings*]% set Password ********
[a03-p1-head-01->partition*[base*]->nmxmsettings*]% commit
BCM exposes the counts from NMX-M’s KPI endpoint as metrics
curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u
nmxm_chassis_count
nmxm_compute_allocation_count
nmxm_compute_health_count
nmxm_compute_nodes_count
nmxm_domain_health_count
nmxm_gpu_health_count
nmxm_gpus_count
nmxm_ports_count
nmxm_switch_health_count
nmxm_switch_nodes_count
for i in $(curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u); do curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep $i; done
nmxm_chassis_count{base_type="Partition",name="base"} 7
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="full"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="all"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="partial"} 0
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="free"} 54
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unhealthy"} 1
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_compute_health_count{base_type="Partition",name="base",parameter="healthy"} 122
nmxm_compute_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_compute_nodes_count{base_type="Partition",name="base"} 123
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="healthy"} 7
nmxm_domain_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="healthy"} 491
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="nonvlink"} 1
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded_bw"} 0
nmxm_gpus_count{base_type="Partition",name="base"} 492
nmxm_ports_count{base_type="Partition",name="base"} 15552
nmxm_switch_health_count{base_type="Partition",name="base",parameter="missing_nvlink"} 54
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_switch_health_count{base_type="Partition",name="base",parameter="healthy"} 72
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_switch_nodes_count{base_type="Partition",name="base"} 63
These count metrics are categorized by the label parameter
.
An example dashboard, using the Single Stat panel consuming these KPI metrics:

InfiniBand Fabrics#
The high-speed InfiniBand fabrics are managed with NVIDIA Unified Fabric Manager (UFM). UFM is a powerful platform for managing scale-out computing environments. UFM enables data center operators to efficiently monitor and operate the entire fabric, boost application performance, and maximize fabric resource utilization.
While other tools are device-oriented and involve manual processes, UFM automated and application-centric approach bridges the gap between servers, applications, and fabric elements, thus enabling administrators to manage and optimize from the smallest to the largest and most performance-demanding clusters.
UFM Dashboard#
The dashboard window summarizes the fabric’s status, including events, alarms, errors, traffic and statistics.

UFM InfiniBand fabric ports view#
Ports view provides a list of all ports in InfiniBand fabric and their speed/width and status.

Verifying that UFM is running#
Use the service ufmha status
command to verify UFM is running:
ufm001# service ufmha status
ufmha status
========================================
Local Host
Server ufm001
Kernel 3.10.0-1127.19.1.el7.x86_64
IP Address 10.166.130.31
HA Interface bond0
DRBD Partition /dev/sda6
Heartbeat Master
Mysql Running
UFM Server Running
DRBD State Primary
DRBD Device State UpToDate
========================================
Remote Host
Server ufm002
Kernel 3.10.0-1127.19.1.el7.x86_64
IP Address 10.166.130.32
HA Interface bond0
DRBD Partition /dev/sda6
Heartbeat Slave
Mysql Stopped
UFM Server Stopped
DRBD State Secondary
DRBD Device State UpToDate
========================================
Virtual IP 10.166.130.58/24
Broadcast IP 10.166.130.255
========================================
Refer to http://nvidia.com/en-us/networking/infiniband/ufm/ for UFM documentation.