Inspect and Verify a ConnectX-7 Cluster Network Plan#
Overview#
After the Cluster Assistant completes successfully, the DGX Spark or GB10 devices have a configured ConnectX-7 network and node-to-node SSH connectivity for the selected cluster topology.
The Cluster Assistant configures the underlying network only. It does not configure distributed workloads, orchestration systems, or workload runtimes such as the NVIDIA Collective Communications Library (NCCL) or vLLM.
Before you configure those workloads, inspect the active ConnectX-7 network configuration, the configured interfaces and IP ranges, and the expected node-to-node connectivity between cluster members.
This section provides basic information to help you and your agent understand the network that the Cluster Assistant created. NVIDIA Sync provides the relevant information in the UI, but you can also inspect the network manually when needed. For setup steps, cabling models, and assistant troubleshooting, refer to Cluster Assistant for ConnectX-7 Multi-Node Clusters.
Key Concepts#
Some key items to keep in mind:
The ConnectX-7 network is separate from the normal management network. Internet access and NVIDIA Sync access can continue over Wi-Fi or 10 Gb Ethernet while ConnectX-7 is used for high-bandwidth node-to-node traffic.
The IP addresses assigned by the Cluster Assistant are private to the ConnectX-7 network, and should not be confused with IP addresses on the Wi-Fi or Ethernet management network in general.
A single physical QSFP link can involve more than one Linux Ethernet-style interface. These interfaces can represent different functions or subinterfaces exposed by the ConnectX-7 device driver and firmware.
Important Hardware, Software, and Files#
When you inspect the ConnectX-7 cluster network manually, you will use or refer to the following hardware, software, and files:
QSFP link — Quad Small Form-factor Pluggable (QSFP) link is a high-speed networking connection used on DGX Spark and GB10 devices. Refer to Terminology for the Quad Small Form-factor Pluggable (QSFP) entry.
Netplan — Linux network configuration layer that provides a declarative abstraction over low-level network configuration. Netplan reads YAML files and generates network configurations for network management layers such as NetworkManager.
NetworkManager — Runtime network management service that applies and manages the active network configuration for different interfaces, for example Wi-Fi, Ethernet, and the ConnectX-7 network.
Netplan configuration files — Files in
/etc/netplan/that define the high-level network configuration. Multiple files typically exist, and Netplan merges them into a single network configuration. Numeric prefixes in filenames set processing order and precedence.ConnectX-7 network Netplan file — The Netplan file created by the Cluster Assistant for the managed cluster. The name is
99-nvidia-sync-cluster.yaml.
Inspect the Cluster Configuration on a Single Node#
The ConnectX-7 network is defined at a high level by the network interfaces the cables are connected to and the related static IP ranges assigned by the Cluster Assistant.
You can find them in the Netplan file 99-nvidia-sync-cluster.yaml.
On a single node, inspect the cluster Netplan file as follows:
Confirm that the file is in the
/etc/netplandirectory:sudo ls /etc/netplan/99-nvidia-sync-cluster.yamlDisplay the file contents in the terminal:
sudo cat /etc/netplan/99-nvidia-sync-cluster.yamlScan the
ethernetsentries for the device name, for exampleenp1s0f0np0, and the assigned address range.
Some items to keep in mind:
Your DGX Spark or GB10 device has two QSFP links that are both configured and appear in the cluster Netplan file.
The configured cluster does not always use both of those links. To verify which link is in use, inspect the network runtime information.
Inspect the Network Runtime State on a Single Node#
The Netplan file 99-nvidia-sync-cluster.yaml describes the intended ConnectX-7 cluster. You can verify that it is active by comparing it to the running interfaces and IP addresses.
On a single node, inspect the runtime state as follows:
Show the detected network interfaces and status on the device:
ip -br linkFind the interface names that appear in the Netplan file. The interface should show
LOWER_UPto indicate that the physical ConnectX-7 link is active.Show the assigned IP addresses:
ip -br addrFind the interface names from the Netplan file. Check that the IP address ranges match those in the Netplan file.
Some items to keep in mind:
You need to correlate the interface names from the Netplan file with the output from the
ipcommands to verify the assigned IP ranges.Interfaces show
LOWER_UPorNO-CARRIER.LOWER_UPindicates an active ConnectX-7 physical connection.NO-CARRIERindicates that no ConnectX-7 physical connection is detected.Depending on the configured topology, some ConnectX-7 interfaces can be inactive.
Verify Node Connectivity for the Expected Topology#
After you verify the configured interfaces and IP ranges on each node, confirm that the nodes communicate over the ConnectX-7 network.
To verify connectivity on each node:
Compare the interface names and cluster IP ranges across nodes to identify the peer cluster IP addresses for your topology.
Ping each expected peer cluster IP address:
ping -c 3 <peer-cluster-ip>
Some items to keep in mind:
Each node shows only its local ConnectX-7 interfaces and assigned cluster IP ranges.
The peer cluster IP address is the IP address assigned to the corresponding interface on the connected node.
Multi-node configurations do not necessarily mean that every node or cluster IP is directly connected to every other node over the ConnectX-7 network.
Verify connectivity only between the peer cluster IPs expected for the configured topology.
Remove the ConnectX-7 Netplan Configuration on an Individual Node#
If you want to remove the ConnectX-7 cluster network configuration created by the Cluster Assistant, remove the cluster Netplan file from /etc/netplan and then apply the updated network configuration.
Some items to keep in mind:
Removing the cluster Netplan file removes the ConnectX-7 network configuration, but it does not necessarily remove SSH aliases, SSH keys, or other node-to-node SSH settings created by the Cluster Assistant. Use the Delete Cluster action to do this.
Be careful when you perform this procedure remotely. Management access can continue over Wi-Fi, Ethernet, or Tailscale rather than the ConnectX-7 network.
To remove the ConnectX-7 Netplan configuration on one node, follow these steps:
Move the cluster Netplan file out of
/etc/netplan:sudo mkdir -p /root/netplan-disabled sudo mv /etc/netplan/99-nvidia-sync-cluster.yaml /root/netplan-disabled/
Regenerate and apply the updated network configuration:
sudo netplan generate sudo netplan try
Verify that the ConnectX-7 cluster IP ranges no longer appear:
ip -br addr