Spark Clustering#

Connecting DGX Spark Systems into a Virtual Cluster#

Overview#

This guide explains how to connect two DGX Spark systems into a virtual compute cluster using simplified networking configuration and a QSFP/CX7 cable for high-performance interconnect.

The goal is to enable distributed workloads across Grace Blackwell GPUs using MPI (for inter-process CPU communication) and NCCL v2.28.3 (for GPU-accelerated collective operations).

System Requirements#

Before you begin, ensure the following:

Both DGX Spark systems have Grace Blackwell GPUs, are connected to each other using a QSFP/CX7 cable, and are running Ubuntu 24.04 (or later) with NVIDIA drivers installed
The systems have internet access for initial software setup
You have sudo/root access on both systems

Setup Networking Between Spark Systems#

Option 1: Automatic IP assignment (recommended)#

Follow these steps on both DGX Spark nodes to configure network interfaces using netplan:

Create the netplan configuration file

sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      link-local: [ ipv4 ]
    enp1s0f1np1:
      link-local: [ ipv4 ]
EOF

Set appropriate permissions on the configuration file
```
sudo chmod 600 /etc/netplan/40-cx7.yaml
```
Apply the netplan configuration
```
sudo netplan apply
```

Option 2: Manual IP assignment (advanced)#

Follow these steps to manually assign IP addresses for dedicated cluster networking.

On Node 1:

Assign a static IP address and bring up the interface

sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up

On Node 2:

Assign a static IP address and bring up the interface

sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up

Verify connectivity:

From Node 1, test connection to Node 2:
```
ping -c 3 192.168.100.11
```
From Node 2, test connection to Node 1:
```
ping -c 3 192.168.100.10
```

Run the DGX Spark Discovery Script#

This step will automatically identify interconnected DGX Spark systems and set up SSH authentication without requiring a password.

On both nodes:

Run the discovery script

./discover-sparks

Example output:

Found: 192.168.100.10 (spark-1b3b.local)
Found: 192.168.100.11 (spark-1d84.local)

Copying your SSH public key to all discovered nodes using ssh-copy-id.
You may be prompted for your password on each node.
Copying SSH key to 192.168.100.10 ...
Copying SSH key to 192.168.100.11 ...
nvidia@192.168.100.11's password:

SSH key copy process complete. These two sparks can now talk to each other.

Install Required Software#

To support distributed workloads, both systems must have MPI (for CPU process communication) and NCCL (for GPU collective communication) installed.

Install MPI (OpenMPI)

MPI allows distributed processes across systems to communicate.
```
sudo apt update
sudo apt install -y openmpi-bin libopenmpi-dev
```

Install NCCL v2.28.3 (or later)

NCCL provides fast GPU-to-GPU communication over QSFP/CX7 and supports multi-rail socket interfaces. The instructions below are for NCCL v2.28.3. Visit https://developer.nvidia.com/nccl/nccl-download for the latest version.

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl2_2.8.3-1+cuda11.2_amd64.deb
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl-dev_2.8.3-1+cuda11.2_amd64.deb

sudo dpkg -i libnccl2_2.8.3*.deb libnccl-dev_2.8.3*.deb

Build the NCCL test suite

These tools verify GPU-to-GPU communication and measure performance across nodes.
```
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make MPI=1
```

Run a Test Workload#

Now that networking and software are configured, run an all-reduce benchmark to test NCCL and MPI across nodes.

Run the test from either machine

This example sets the LD_LIBRARY_PATH environment to include the locations of the OpenMPI and CUDA libraries. Adjust this path as necessary to match your system configuration.

export LD_LIBRARY_PATH='/usr/local/openmpi/lib:/usr/local/cuda/lib64'
mpirun -np 2 -H 192.168.100.11:1,192.168.100.21:1 \
  -bind-to none -map-by slot \
  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
  -mca oob_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \
  -mca btl_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \
  ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1

This uses one GPU on each system to perform a collective reduce operation and report bandwidth and latency.

Troubleshooting#

Ensure the QSFP/CX7 interface is active and used for IP assignment
Verify connectivity between nodes via ping
Use nvidia-smi to confirm GPU status
Use NCCL_DEBUG=INFO to show detailed diagnostics during the test
Check your interface bindings with ip a and ethtool
If the discovery script fails, manually verify SSH connectivity between nodes
For additional troubleshooting guidance and support options, see Maintenance and Troubleshooting

Next Steps#

Once tested, this configuration can be scaled to support:

Job orchestration with Slurm or Kubernetes
Containerized execution with Singularity or Docker