Cluster Assistant for ConnectX-7 Multi-Node Clusters#
Overview#
The Cluster Assistant in NVIDIA Sync helps configure the high-bandwidth ConnectX-7 network on DGX Spark and GB10 devices. Use it when you want to connect multiple devices into a supported cluster topology and prepare them for node-to-node workloads.
The assistant configures the ConnectX-7 network and node-to-node SSH settings required for the cluster. It does not install, enforce, or configure a workload orchestration system. After the network is configured, you can choose the orchestration approach that best fits your workload.
Understand the ConnectX-7 Network#
DGX Spark and GB10 systems include a high-speed ConnectX-7 network interface that connects each Quad Small Form-factor Pluggable (QSFP) port to the Grace-Blackwell SoC over PCIe. Each QSFP port provides 200 Gbps, but to achieve this speed the systems use an atypical setup of two PCIe 5.0 x4 links instead of a single PCIe 5.0 x8 link.
In a typical server design, the ConnectX-7 bridges a QSFP port to a single PCIe 5.0 x8 link to deliver the full 200 Gbps from the QSFP port through to the rest of the hardware.
However, the DGX Spark and GB10 form factor requires swapping a single x8 link for two PCIe 5.0 x4 links to deliver the desired 200 Gbps connection to the Grace-Blackwell SoC.
This setup is not standard. Each QSFP port appears as a pair of Linux network interfaces, with corresponding Remote Direct Memory Access over Converged Ethernet (RoCE) devices. The layout can be confusing the first time you encounter it, and proper configuration can be error-prone.
When you connect two devices through the QSFP ports, you must set up two separate subnets and match Linux interfaces across devices on those subnets, with exactly one interface per device on a given subnet:
Assign an IP address to each of the two Linux interfaces for the QSFP port on each device.
Create a subnet and place corresponding Linux interfaces across devices on it.
Refer to the following diagram:
DGX Spark A DGX Spark B
QSFP Port 1 <====== cable ======> QSFP Port 1
iface A0 <---------------------> iface B0
subnet 10.100.0.0/24 subnet 10.100.0.0/24
iface A1 <---------------------> iface B1
subnet 10.100.1.0/24 subnet 10.100.1.0/24
The NVIDIA Sync Cluster Assistant handles this for the supported network topologies between DGX Spark and GB10 devices. This includes detection, inspection, and configuration for the fabric and the device-to-device SSH connections, as well as applying the Netplan configuration and verifying connectivity and performance.
Prerequisites#
Before you start Cluster Assistant:
Complete the initial setup on each DGX Spark, including network setup, installation of available system updates, and creation of at least one interactive user account.
Add every DGX Spark that will join the cluster to NVIDIA Sync on the Devices tab. Refer to Adding a Device for a Direct Connection.
Connect QSFP cabling for the layout you intend to use (refer to Supported ConnectX-7 Layouts). If cabling is missing or incorrect, the assistant prompts you when it validates topology.
Ensure the computer running NVIDIA Sync is on the same LAN as the DGX Spark systems (Ethernet or Wi-Fi) so Sync can SSH into each device while you run the workflow.
Readiness checks require every node to report a supported DGX Spark (GB10) system and an OTA version of April 2026 or later. If a node is behind or mismatched, update it and run the checks again.
The workflow also verifies that NVIDIA Sync can open SSH to each device and that the remote user can use sudo. If passwordless sudo is not configured, use the on-screen control to enter your account password. NVIDIA Sync validates it once and continues with steps that require elevated privileges.
Supported ConnectX-7 Layouts#
Cluster Assistant supports the following validated configurations for two to four nodes:
Two DGX Spark systems (direct connection)
One QSFP cable connects the two systems on one ConnectX-7 port pair.
Both DGX Spark systems remain on the same LAN as the NVIDIA Sync computer. Internet access continues over 10 Gb Ethernet or Wi-Fi on each DGX Spark.
A dedicated QSFP switch is not required.
Three DGX Spark systems in a ring (direct connection)
Each DGX Spark uses two QSFP ports so it can connect directly to the other two DGX Spark systems. The three QSFP cables form a closed loop (each machine has two ConnectX-7 neighbors).
All DGX Spark systems are on the same LAN as the NVIDIA Sync computer, with Internet access over 10 Gb Ethernet or Wi-Fi.
A dedicated QSFP switch is not required.
Two, three, or four DGX Spark systems on a QSFP switch
Each DGX Spark connects to a managed QSFP switch with enough 200 Gbps-class ports for every node.
Management traffic to NVIDIA Sync can use ConnectX-7, Ethernet, or Wi-Fi on each DGX Spark. If the DGX Spark systems rely on the switch for LAN or Internet access, the switch must provide appropriate addressing (for example, DHCP on the switch network if required by your design).
Any topology outside this supported set must be configured and verified manually. For example, an eight-device switch topology is not configured by the Cluster Assistant.
ConnectX-7 Cabling Guidelines#
Follow these guidelines when you connect QSFP cables for supported topologies:
Use only one QSFP cable from each device to another device or to the switch. Adding a second cable does not increase throughput.
For a two-device direct connection, connect the devices with one QSFP cable, using the same port on each device. This provides the maximum throughput supported by the GB10 chip.
For a three-device ring topology, use three QSFP cables total. Each device connects to the other two devices with one cable per link.
For switch-based topologies, connect each device to the switch with one QSFP cable.
You can use Wi-Fi or the 10 Gb Ethernet connection for Internet access and SSH access from your primary system. The ConnectX-7 network is configured as a separate high-bandwidth node-to-node network.
After Cluster Setup#
After the Cluster Assistant completes successfully, the devices have a configured ConnectX-7 network and node-to-node SSH settings. You can then run the workload or orchestration system of your choice.
NVIDIA Sync does not provide a workflow orchestration system and does not require a specific orchestration tool.
There is a current limitation when using the NVIDIA Collective Communications Library (NCCL) with three devices in a ring topology. You must manually build NCCL for this configuration.
For getting-started examples, refer to the DGX Spark developer site.
Inspect and Verify the ConnectX-7 Cluster Network Plan#
After Cluster Assistant completes, the ConnectX-7 network plan is stored in Netplan on each node.
To inspect what was configured, verify runtime state with ip commands, test connectivity, or safely remove the cluster network plan, refer to Inspect and Verify a ConnectX-7 Cluster Network Plan.
Delete a Cluster#
Deleting a cluster removes the node-to-node SSH configuration and deletes the cluster relationship in NVIDIA Sync.
If you want to change a cluster’s topology (for example, to add another node), you must delete the cluster and then recreate the new cluster.
Delete a cluster from the NVIDIA Sync interface:
Open Settings.
Select the Clusters tab.
Select the cluster you want to delete.
Open the overflow menu (⋯) for that cluster.
Select Delete.
Troubleshooting#
Use this section to resolve common Cluster Assistant checks and configuration issues.
Validate System Readiness#
The Cluster Assistant checks whether each device is reachable and ready before it configures the cluster.
SSH Check Fails#
If the SSH check fails:
Verify that all devices are powered on.
Verify that all devices are connected to your network.
Verify that the system running NVIDIA Sync is on the same network as the devices.
Verify that the system running NVIDIA Sync can SSH directly to each device.
GB10 Check Fails#
If the GB10 check fails, the selected device is not recognized as a DGX Spark or GB10 system. DGX Spark or GB10 hardware is required for the Cluster Assistant feature.
Remove any unsupported device before continuing with cluster setup.
Software Version Check Fails#
If the software version check fails, update the system software on each device before continuing.
All DGX Spark or GB10 devices in the cluster must run the April 2026 system software release or later.
The easiest way to update is through the DGX Dashboard. To update manually, follow the DGX Spark user guide:
Password Check Fails#
If the password check fails, verify that each device is configured with the permissions required by NVIDIA Sync.
NVIDIA Sync needs your password to configure network settings during setup. Select Fix Now and enter your password when prompted. NVIDIA Sync stores the password in memory only for temporary use during the setup process, then discards it.
Verify User Details#
Consistent usernames, user IDs (UIDs), and group IDs (GIDs) are not required for a cluster, but they can make the cluster easier to use.
If usernames differ across nodes, home directory paths will also differ. This can make scripts, file paths, and manual SSH commands harder to manage. When you SSH between nodes, use the SSH aliases generated by NVIDIA Sync because the aliases already include the correct username for each node.
If UID or GID values differ across nodes, shared files can appear to have unexpected ownership depending on your workflow and storage configuration.
If you want usernames, UIDs, and GIDs to match, make this change before creating the cluster when possible. The simplest approach is to create matching user accounts on all nodes, then add the devices to NVIDIA Sync using those accounts.
Make User Details Consistent on Ubuntu 24.04#
Before creating new accounts, choose a username, UID, and primary GID that are not already in use on any node.
On each node, check whether the UID or GID is already assigned:
getent passwd <uid>
getent group <gid>
Create the group:
sudo groupadd --gid <gid> <group-name>
Create the user with the selected UID and primary GID:
sudo adduser --uid <uid> --gid <gid> <username>
If the account needs administrator privileges for your setup, add it to the sudo group:
sudo usermod -aG sudo <username>
Then sign in with the new account, configure SSH access as needed, re-add the devices to NVIDIA Sync using the matching accounts, and run the Cluster Assistant again.
Set Network Configuration#
If the detected ConnectX-7 network topology is incorrect or not what you expected:
Verify that all ConnectX-7 cables are fully seated.
For a two-device direct connection, verify that only one QSFP cable connects the devices.
For two-, three-, or four-device switch topologies, verify that each device has one QSFP cable connected to the switch.
For a three-device ring topology, verify that each device connects to the other two devices, using three QSFP cables total.
If the topology is still not detected correctly, reboot the devices with the ConnectX-7 cables connected, then try again.
For Netplan-level inspection, verification, and removal of the cluster network plan, refer to Inspect and Verify a ConnectX-7 Cluster Network Plan.
If port speed is not reported as the expected 200 Gbps:
Verify that you are using a supported QSFP112 DAC, 400 GbE, Ethernet-mode-only cable.
If you are using a switch, verify that the switch is negotiating the correct port speed.
If the switch is negotiating the wrong speed, update the port configuration in the switch administrator console.
Supported cables include:
Amphenol NJAAKK-N911
Luxshare LMTQF022-SD-R
Check Network Performance#
If bandwidth or latency is outside the expected range, measured performance can be below the optimal range for your topology. A bandwidth or latency warning does not necessarily mean the cluster is nonfunctional.
Network performance can be temporary or affected by other equipment. Try the following to determine whether performance improves:
Reboot the devices with the ConnectX-7 cables connected, then run the speed test again.
Verify that you are using a supported QSFP112 DAC, 400 GbE, Ethernet-mode-only cable and check the connections:
Amphenol NJAAKK-N911
Luxshare LMTQF022-SD-R
If you are using a switch, check the switch vendor documentation and configuration for possible issues.