Spark Stacking#

Connecting DGX Spark Systems into a Virtual Cluster#

Overview#

This guide explains how to connect two DGX Spark systems into a virtual compute cluster using simplified networking configuration and a QSFP/CX7 cable for high-performance interconnect.

The goal is to enable distributed workloads across Grace Blackwell GPUs using MPI (for inter-process CPU communication) and NCCL v2.28.3 (for GPU-accelerated collective operations).

Additional Information can be found in the Connect Two Sparks playbook.

System Requirements#

Before you begin, ensure the following:

  • Both DGX Spark systems have Grace Blackwell GPUs, are connected to each other using a QSFP/CX7 cable, and are running Ubuntu 24.04 (or later) with NVIDIA drivers installed

    Note

    The DGX Spark CX-7 ports support ethernet configuration only.

    Approved cables for the CX-7 ports are:

    • Amphenol: NJAAKK-N911 (QSFP to QSFP112, 32AWG, 400mm, LSZH), NJAAKK0006 is the 0.5m version of this cable

    • Luxshare: LMTQF022-SD-R (QSFP112 400G DAC Cable, 400mm, 30AWG)

  • The systems have internet access for initial software setup

  • You have sudo/root access on both systems

Setup Networking Between Spark Systems#

Option 2: Manual IP assignment (advanced)#

Follow these steps to manually assign IP addresses for dedicated cluster networking.

  1. On Node 1, assign a static IP address and bring up the interface

    sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
    sudo ip link set enP2p1s0f1np1 up
    
  2. On Node 2, assign a static IP address and bring up the interface

    sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
    sudo ip link set enP2p1s0f1np1 up
    
  3. From Node 1, verify connectivity by testing connection to Node 2

    ping -c 3 192.168.100.11
    
  4. From Node 2, verify connectivity by testing connection to Node 1

    ping -c 3 192.168.100.10
    

Run the DGX Spark Discovery Script#

This step will automatically identify interconnected DGX Spark systems and set up SSH authentication without requiring a password.

The following commands should be run in a terminal session (either local or remote) on both nodes.

On both nodes:

  1. Download the discovery script

    wget https://github.com/NVIDIA/dgx-spark-playbooks/raw/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
    
  2. Make the script executable

    chmod +x discover-sparks
    
  3. Run the discovery script

    ./discover-sparks
    

    Example output:

    Found: 192.168.100.10 (spark-1b3b.local)
    Found: 192.168.100.11 (spark-1d84.local)
    
    Copying your SSH public key to all discovered nodes using ssh-copy-id.
    You may be prompted for your password on each node.
    Copying SSH key to 192.168.100.10 ...
    Copying SSH key to 192.168.100.11 ...
    nvidia@192.168.100.11's password:
    
    SSH key copy process complete. These two sparks can now talk to each other.
    

Install Required Software and Verify the Configuration#

With the networking configured and the systems able to communicate with each other, the next step is to install the required software for distributed workloads and run test workloads to verify that GPU-to-GPU communication is working correctly and to measure performance across the stacked systems.

For complete instructions on building NCCL, running the NCCL test suite, and interpreting the results, see the NCCL Stacked Sparks playbook.

Troubleshooting#

  • Ensure the QSFP/CX7 interface is active and used for IP assignment

  • Verify connectivity between nodes via ping

  • Check your interface bindings with ip a and ethtool

  • If the discovery script fails, manually verify SSH connectivity between nodes

  • For additional troubleshooting guidance and support options, see Maintenance and Troubleshooting

Next Steps#

Once tested, this configuration can be scaled to support:

  • Job orchestration with Slurm or Kubernetes

  • Containerized execution with Singularity or Docker