Spark Clustering#
Connecting DGX Spark Systems into a Virtual Cluster#
Overview#
This guide explains how to connect two DGX Spark systems into a virtual compute cluster using simplified networking configuration and a QSFP/CX7 cable for high-performance interconnect.
The goal is to enable distributed workloads across Grace Blackwell GPUs using MPI (for inter-process CPU communication) and NCCL v2.28.3 (for GPU-accelerated collective operations).
System Requirements#
Before you begin, ensure the following:
Both DGX Spark systems have Grace Blackwell GPUs, are connected to each other using a QSFP/CX7 cable, and are running Ubuntu 24.04 (or later) with NVIDIA drivers installed
The systems have internet access for initial software setup
You have sudo/root access on both systems
Setup Networking Between Spark Systems#
Option 1: Automatic IP assignment (recommended)#
Follow these steps on both DGX Spark nodes to configure network interfaces using netplan
:
Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF network: version: 2 ethernets: enp1s0f0np0: link-local: [ ipv4 ] enp1s0f1np1: link-local: [ ipv4 ] EOF
Set appropriate permissions on the configuration file
sudo chmod 600 /etc/netplan/40-cx7.yaml
Apply the netplan configuration
sudo netplan apply
Option 2: Manual IP assignment (advanced)#
Follow these steps to manually assign IP addresses for dedicated cluster networking.
On Node 1:
Assign a static IP address and bring up the interface
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1 sudo ip link set enP2p1s0f1np1 up
On Node 2:
Assign a static IP address and bring up the interface
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1 sudo ip link set enP2p1s0f1np1 up
Verify connectivity:
From Node 1, test connection to Node 2:
ping -c 3 192.168.100.11
From Node 2, test connection to Node 1:
ping -c 3 192.168.100.10
Run the DGX Spark Discovery Script#
This step will automatically identify interconnected DGX Spark systems and set up SSH authentication without requiring a password.
On both nodes:
Run the discovery script
./discover-sparks
Example output:
Found: 192.168.100.10 (spark-1b3b.local) Found: 192.168.100.11 (spark-1d84.local) Copying your SSH public key to all discovered nodes using ssh-copy-id. You may be prompted for your password on each node. Copying SSH key to 192.168.100.10 ... Copying SSH key to 192.168.100.11 ... nvidia@192.168.100.11's password: SSH key copy process complete. These two sparks can now talk to each other.
Install Required Software#
To support distributed workloads, both systems must have MPI (for CPU process communication) and NCCL (for GPU collective communication) installed.
Install MPI (OpenMPI)
MPI allows distributed processes across systems to communicate.
sudo apt update sudo apt install -y openmpi-bin libopenmpi-dev
Install NCCL v2.28.3 (or later)
NCCL provides fast GPU-to-GPU communication over QSFP/CX7 and supports multi-rail socket interfaces. The instructions below are for NCCL v2.28.3. Visit https://developer.nvidia.com/nccl/nccl-download for the latest version.
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl2_2.8.3-1+cuda11.2_amd64.deb wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl-dev_2.8.3-1+cuda11.2_amd64.deb sudo dpkg -i libnccl2_2.8.3*.deb libnccl-dev_2.8.3*.deb
Build the NCCL test suite
These tools verify GPU-to-GPU communication and measure performance across nodes.
git clone https://github.com/NVIDIA/nccl-tests cd nccl-tests make MPI=1
Run a Test Workload#
Now that networking and software are configured, run an all-reduce benchmark to test NCCL and MPI across nodes.
Run the test from either machine
This example sets the LD_LIBRARY_PATH environment to include the locations of the OpenMPI and CUDA libraries. Adjust this path as necessary to match your system configuration.
export LD_LIBRARY_PATH='/usr/local/openmpi/lib:/usr/local/cuda/lib64' mpirun -np 2 -H 192.168.100.11:1,192.168.100.21:1 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \ -mca oob_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \ -mca btl_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \ ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1
This uses one GPU on each system to perform a collective reduce operation and report bandwidth and latency.
Troubleshooting#
Ensure the QSFP/CX7 interface is active and used for IP assignment
Verify connectivity between nodes via
ping
Use
nvidia-smi
to confirm GPU statusUse
NCCL_DEBUG=INFO
to show detailed diagnostics during the testCheck your interface bindings with
ip a
andethtool
If the discovery script fails, manually verify SSH connectivity between nodes
For additional troubleshooting guidance and support options, see Maintenance and Troubleshooting
Next Steps#
Once tested, this configuration can be scaled to support:
Job orchestration with Slurm or Kubernetes
Containerized execution with Singularity or Docker