NVLink Management Software (NMX + NetQ)#
NVIDIA NVLink Management Software (NetQ) is an integrated platform for managing and monitoring NVLink Switches, Domains, and Partitions.
It includes the following main components:
NetQ NvLink (previously NMX-M)
NMX-Telemetry (NMX-T)
NMX-Controller (NMX-C)
To manage NVLink partitions, send gRPC API calls to the NMX Controller or use the NVOS CLI on the NVSwitch.
The following example uses the NVOS CLI on a leader switch that runs NMX-C and NMX-T.
To view the processes on the leader switch, run nv show cluster app:
admin@a07-p1-nvsw-01-eth1.ipminet2.cluster:~$ nv show cluster app
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ ------ ------------------------------ -------
nmx-controller nmx-c-nvos 1.0.0_2025-04-11_15-23 sm, gfm, fib, gw-api sm:2025.03.6, gfm:R570.133.15, fib-fe:1.0.1 ok CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry nmx-telemetry 1.0.4 nvl telemetry, gnmi aggregation, syslog aggregation nvl-telemetry:1.20.4, gnmi-aggregator:1.4.1, nmx-connector:1.4.1 ok
To check NVLink partition configuration:
admin@NVSWITCH-1:~$ nv show sdn partition
ID Name Num of GPUs Health Resiliency mode Multicast groups limit Partition type Summary
----- ----------------- ----------- ------- ------------------ ---------------------- -------------- -------
32766 Default Partition 72 healthy adaptive_bandwidth 1024 gpuuid_based
To see available commands for creating or modifying partitions:
admin@NVSWITCH-1:~$ nv list-commands | grep partition
nv show sdn partition
nv show sdn partition <partition-id>
nv show sdn partition <partition-id> location
nv show sdn partition <partition-id> uuid
nv action boot-next system image (partition1|partition2)
nv action update sdn partition <partition-id> [reroute]
nv action update sdn partition <partition-id> location <location-id> [no-reroute]
nv action update sdn partition <partition-id> uuid <uuid> [no-reroute]
nv action create sdn partition <partition-id> [name <value>] [resiliency-mode (full_bandwidth|adaptive_bandwidth|user_action)] [mcast-limit (0-1024)] [location <location-id>] [uuid (0-18446744073709551615)]
nv action delete sdn partition <partition-id>
nv action restore sdn partition <partition-id> location <location-id> [no-reroute]
nv action restore sdn partition <partition-id> uuid <uuid> [no-reroute]
For more details on managing NVLink switches, see the NVOS and NetQ documentation on the NVLink Networking documentation page.
NetQ NvLink provides centralized monitoring of NVSwitches with a REST API endpoint and Prometheus scrape endpoint per rack. BCM also integrates with NetQ and collects a subset of metrics.
[a03-p1-head-01]% partition
[a03-p1-head-01->partition[base]]% nmxmsettings
[a03-p1-head-01->partition[base]->nmxmsettings]% show
Parameter Value
-------------------------------- ------------------------------------------------
Revision
Server 7.241.0.145
User name rw-user
Password ********
Port 443
Verify SSL no
CA certificate
Certificate
Private key
Prometheus metric forwarders <0 in submode>
Example: Update the Server and Password fields in nmxmsettings.
[a03-p1-head-01->partition[base]->nmxmsettings]% set Server 7.241.0.144
[a03-p1-head-01->partition*[base*]->nmxmsettings*]% set Password ********
[a03-p1-head-01->partition*[base*]->nmxmsettings*]% commit
BCM exposes the counts from NetQ’s KPI endpoint as metrics:
curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u
Example output:
nmxm_chassis_count
nmxm_compute_allocation_count
nmxm_compute_health_count
nmxm_compute_nodes_count
nmxm_domain_health_count
nmxm_gpu_health_count
nmxm_gpus_count
nmxm_ports_count
nmxm_switch_health_count
nmxm_switch_nodes_count
To retrieve detailed metric values:
for i in $(curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u); do curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep $i; done
Example output:
nmxm_chassis_count{base_type="Partition",name="base"} 7
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="full"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="all"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="partial"} 0
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="free"} 54
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unhealthy"} 1
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_compute_health_count{base_type="Partition",name="base",parameter="healthy"} 122
nmxm_compute_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_compute_nodes_count{base_type="Partition",name="base"} 123
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="healthy"} 7
nmxm_domain_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="healthy"} 491
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="nonvlink"} 1
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded_bw"} 0
nmxm_gpus_count{base_type="Partition",name="base"} 492
nmxm_ports_count{base_type="Partition",name="base"} 15552
nmxm_switch_health_count{base_type="Partition",name="base",parameter="missing_nvlink"} 54
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_switch_health_count{base_type="Partition",name="base",parameter="healthy"} 72
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_switch_nodes_count{base_type="Partition",name="base"} 63
These metrics are categorized by the label parameter.