NVLink Management Software (NMX + NetQ)#

NVIDIA NVLink Management Software (NetQ) is an integrated platform for managing and monitoring NVLink Switches, Domains, and Partitions.

It includes the following main components:

NetQ NvLink (previously NMX-M)
NMX-Telemetry (NMX-T)
NMX-Controller (NMX-C)

To manage NVLink partitions, send gRPC API calls to the NMX Controller or use the NVOS CLI on the NVSwitch.

The following example uses the NVOS CLI on a leader switch that runs NMX-C and NMX-T.

To view the processes on the leader switch, run nv show cluster app:

admin@a07-p1-nvsw-01-eth1.ipminet2.cluster:~$ nv show cluster app

Name           ID            Version                  Capabilities                                                   Components Version                                                         Status       Reason   Additional Information           Summary
-------------- ------------- ----------------------   ---------------------------------------------------            ----------------------------------------------------------------         ------       ------   ------------------------------   -------
nmx-controller nmx-c-nvos   1.0.0_2025-04-11_15-23  sm, gfm, fib, gw-api                                          sm:2025.03.6, gfm:R570.133.15, fib-fe:1.0.1                        ok             CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry  nmx-telemetry 1.0.4                    nvl telemetry, gnmi aggregation, syslog aggregation          nvl-telemetry:1.20.4, gnmi-aggregator:1.4.1, nmx-connector:1.4.1   ok

To check NVLink partition configuration:

admin@NVSWITCH-1:~$ nv show sdn partition

ID     Name                Num of GPUs   Health   Resiliency mode        Multicast groups limit   Partition type   Summary
-----  -----------------   -----------   -------  ------------------     ----------------------   --------------   -------
32766  Default Partition   72            healthy  adaptive_bandwidth     1024                     gpuuid_based

To see available commands for creating or modifying partitions:

admin@NVSWITCH-1:~$ nv list-commands | grep partition

nv show sdn partition
nv show sdn partition <partition-id>
nv show sdn partition <partition-id> location
nv show sdn partition <partition-id> uuid
nv action boot-next system image (partition1|partition2)
nv action update sdn partition <partition-id> [reroute]
nv action update sdn partition <partition-id> location <location-id> [no-reroute]
nv action update sdn partition <partition-id> uuid <uuid> [no-reroute]
nv action create sdn partition <partition-id> [name <value>] [resiliency-mode (full_bandwidth|adaptive_bandwidth|user_action)] [mcast-limit (0-1024)] [location <location-id>] [uuid (0-18446744073709551615)]
nv action delete sdn partition <partition-id>
nv action restore sdn partition <partition-id> location <location-id> [no-reroute]
nv action restore sdn partition <partition-id> uuid <uuid> [no-reroute]

For more details on managing NVLink switches, see the NVOS and NetQ documentation on the NVLink Networking documentation page.

NetQ NvLink provides centralized monitoring of NVSwitches with a REST API endpoint and Prometheus scrape endpoint per rack. BCM also integrates with NetQ and collects a subset of metrics.

[a03-p1-head-01]% partition

[a03-p1-head-01->partition[base]]% nmxmsettings

[a03-p1-head-01->partition[base]->nmxmsettings]% show

Parameter                       Value
-------------------------------- ------------------------------------------------
Revision
Server                          7.241.0.145
User name                       rw-user
Password                        ********
Port                            443
Verify SSL                      no
CA certificate
Certificate
Private key
Prometheus metric forwarders     <0 in submode>

Example: Update the Server and Password fields in nmxmsettings.

[a03-p1-head-01->partition[base]->nmxmsettings]% set Server 7.241.0.144

[a03-p1-head-01->partition*[base*]->nmxmsettings*]% set Password ********

[a03-p1-head-01->partition*[base*]->nmxmsettings*]% commit

BCM exposes the counts from NetQ’s KPI endpoint as metrics:

curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u

Example output:

nmxm_chassis_count
nmxm_compute_allocation_count
nmxm_compute_health_count
nmxm_compute_nodes_count
nmxm_domain_health_count
nmxm_gpu_health_count
nmxm_gpus_count
nmxm_ports_count
nmxm_switch_health_count
nmxm_switch_nodes_count

To retrieve detailed metric values:

for i in $(curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep -E '^nmxm' | cut -d '{' -f1 | sort -u); do curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | grep $i; done

Example output:

nmxm_chassis_count{base_type="Partition",name="base"} 7
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="full"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="all"} 126
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="partial"} 0
nmxm_compute_allocation_count{base_type="Partition",name="base",parameter="free"} 54
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unhealthy"} 1
nmxm_compute_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_compute_health_count{base_type="Partition",name="base",parameter="healthy"} 122
nmxm_compute_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_compute_nodes_count{base_type="Partition",name="base"} 123
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_domain_health_count{base_type="Partition",name="base",parameter="healthy"} 7
nmxm_domain_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="healthy"} 491
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="nonvlink"} 1
nmxm_gpu_health_count{base_type="Partition",name="base",parameter="degraded_bw"} 0
nmxm_gpus_count{base_type="Partition",name="base"} 492
nmxm_ports_count{base_type="Partition",name="base"} 15552
nmxm_switch_health_count{base_type="Partition",name="base",parameter="missing_nvlink"} 54
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unknown"} 0
nmxm_switch_health_count{base_type="Partition",name="base",parameter="healthy"} 72
nmxm_switch_health_count{base_type="Partition",name="base",parameter="unhealthy"} 0
nmxm_switch_nodes_count{base_type="Partition",name="base"} 63

These metrics are categorized by the label parameter.