Spark Clustering#

Connecting DGX Spark Systems into a Virtual Cluster#

Overview#

This guide explains how to connect two DGX Spark systems into a virtual compute cluster using simplified networking configuration and a QSFP/CX7 cable for high-performance interconnect.

The goal is to enable distributed workloads across Grace Blackwell GPUs using MPI (for inter-process CPU communication) and NCCL v2.28.3 (for GPU-accelerated collective operations).

System Requirements#

Before you begin, ensure the following:

  • Both DGX Spark systems have Grace Blackwell GPUs, are connected to each other using a QSFP/CX7 cable, and are running Ubuntu 24.04 (or later) with NVIDIA drivers installed

  • The systems have internet access for initial software setup

  • You have sudo/root access on both systems

Setup Networking Between Spark Systems#

Option 2: Manual IP assignment (advanced)#

Follow these steps to manually assign IP addresses for dedicated cluster networking.

On Node 1:

  1. Assign a static IP address and bring up the interface

    sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
    sudo ip link set enP2p1s0f1np1 up
    

On Node 2:

  1. Assign a static IP address and bring up the interface

    sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
    sudo ip link set enP2p1s0f1np1 up
    

Verify connectivity:

  • From Node 1, test connection to Node 2:

    ping -c 3 192.168.100.11
    
  • From Node 2, test connection to Node 1:

    ping -c 3 192.168.100.10
    

Run the DGX Spark Discovery Script#

This step will automatically identify interconnected DGX Spark systems and set up SSH authentication without requiring a password.

On both nodes:

  1. Run the discovery script

    ./discover-sparks
    

    Example output:

    Found: 192.168.100.10 (spark-1b3b.local)
    Found: 192.168.100.11 (spark-1d84.local)
    
    Copying your SSH public key to all discovered nodes using ssh-copy-id.
    You may be prompted for your password on each node.
    Copying SSH key to 192.168.100.10 ...
    Copying SSH key to 192.168.100.11 ...
    nvidia@192.168.100.11's password:
    
    SSH key copy process complete. These two sparks can now talk to each other.
    

Install Required Software#

To support distributed workloads, both systems must have MPI (for CPU process communication) and NCCL (for GPU collective communication) installed.

  1. Install MPI (OpenMPI)

    MPI allows distributed processes across systems to communicate.

    sudo apt update
    sudo apt install -y openmpi-bin libopenmpi-dev
    
  2. Install NCCL v2.28.3 (or later)

    NCCL provides fast GPU-to-GPU communication over QSFP/CX7 and supports multi-rail socket interfaces. The instructions below are for NCCL v2.28.3. Visit https://developer.nvidia.com/nccl/nccl-download for the latest version.

    wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl2_2.8.3-1+cuda11.2_amd64.deb
    wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/libnccl-dev_2.8.3-1+cuda11.2_amd64.deb
    
    sudo dpkg -i libnccl2_2.8.3*.deb libnccl-dev_2.8.3*.deb
    
  3. Build the NCCL test suite

    These tools verify GPU-to-GPU communication and measure performance across nodes.

    git clone https://github.com/NVIDIA/nccl-tests
    cd nccl-tests
    make MPI=1
    

Run a Test Workload#

Now that networking and software are configured, run an all-reduce benchmark to test NCCL and MPI across nodes.

  1. Run the test from either machine

    This example sets the LD_LIBRARY_PATH environment to include the locations of the OpenMPI and CUDA libraries. Adjust this path as necessary to match your system configuration.

    export LD_LIBRARY_PATH='/usr/local/openmpi/lib:/usr/local/cuda/lib64'
    mpirun -np 2 -H 192.168.100.11:1,192.168.100.21:1 \
      -bind-to none -map-by slot \
      -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
      -mca oob_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \
      -mca btl_tcp_if_exclude docker0,lo,virbr0,bmc_redfish0,wlP9s9 \
      ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1
    

    This uses one GPU on each system to perform a collective reduce operation and report bandwidth and latency.

Troubleshooting#

  • Ensure the QSFP/CX7 interface is active and used for IP assignment

  • Verify connectivity between nodes via ping

  • Use nvidia-smi to confirm GPU status

  • Use NCCL_DEBUG=INFO to show detailed diagnostics during the test

  • Check your interface bindings with ip a and ethtool

  • If the discovery script fails, manually verify SSH connectivity between nodes

  • For additional troubleshooting guidance and support options, see Maintenance and Troubleshooting

Next Steps#

Once tested, this configuration can be scaled to support:

  • Job orchestration with Slurm or Kubernetes

  • Containerized execution with Singularity or Docker