Run NCCL Test on Dev Pods

Learn how to run NCCL test on DGX Cloud Lepton.

NCCL test is a tool to verify bandwidth among nodes that have GPUs with InfiniBand (IB) or RoCE connectivity.

This example covers how to run NCCL test on a set of Dev Pods.

Step 1: Create Dev Pods

NCCL Test requires 2 or more Dev Pods, each utilizing a full node. In this example, we will create 2 Dev Pods, each uses 8x H100 GPUs.

Step 2: Setup SSH Keys for Pods Communication

1. Generate an SSH key on Pod 1

On Pod 1, generate an SSH key pair and add the public key to the list of authorized keys:

2. Apply the public key to Pods 2 through N

On each of Pod 2 to Pod N, prepare the .ssh directory and add the generated public key from Pod 1 to authorized_keys:

Step 3: Get the local IP addresses for Pod 1 to Pod N

You can obtain the local IP of the node from the dashboard by clicking on the node name and find it in the node detail page.

Step 4: Create hostfile on Pod 1

Create a file named /tmp/hostfile.txt on Pod 1. Each line should contain the private IP address of the nodes where these pods are running.

Step 5: Download the nccl-tests script on every Pod

Step 6: Run the nccl-tests script on Pod 2 to Pod N

On Pod 2 to Pod N, run the following command:

You can find the ssh port from the Pod detail page. Commonly, the ssh port is 2222 since the Pod is utilizing a full node.

You should see "Done" printed at the end of the output.

Step 7: Run the nccl-tests script on Pod 1

Then, on Pod 1, run the following command:

The output will be similar to the following:

This result is obtained by running the NCCL tests on nvcr.io/nvidia/cuda-dl-base:24.10-cuda12.6-devel-ubuntu22.04.

Copyright @ 2025, NVIDIA Corporation.