Run NCCL Test on Dev Pods
Learn how to run NCCL test on DGX Cloud Lepton.
NCCL test is a tool to verify bandwidth among nodes that have GPUs with InfiniBand (IB) or RoCE connectivity.
This example covers how to run NCCL test on a set of Dev Pods.
Step 1: Create Dev Pods
NCCL Test requires 2 or more Dev Pods, each utilizing a full node. In this example, we will create 2 Dev Pods, each uses 8x H100 GPUs.
Step 2: Setup SSH Keys for Pods Communication
1. Generate an SSH key on Pod 1
On Pod 1, generate an SSH key pair and add the public key to the list of authorized keys:
2. Apply the public key to Pods 2 through N
On each of Pod 2 to Pod N, prepare the .ssh directory and add the generated public key from Pod 1 to authorized_keys:
Step 3: Get the local IP addresses for Pod 1 to Pod N
You can obtain the local IP of the node from the dashboard by clicking on the node name and find it in the node detail page.
Step 4: Create hostfile on Pod 1
Create a file named /tmp/hostfile.txt on Pod 1. Each line should contain the private IP address of the nodes where these pods are running.
Step 5: Download the nccl-tests script on every Pod
Step 6: Run the nccl-tests script on Pod 2 to Pod N
On Pod 2 to Pod N, run the following command:
You can find the ssh port from the Pod detail page. Commonly, the ssh port is 2222 since the Pod is utilizing a full node.
You should see "Done" printed at the end of the output.
Step 7: Run the nccl-tests script on Pod 1
Then, on Pod 1, run the following command:
The output will be similar to the following:
This result is obtained by running the NCCL tests on nvcr.io/nvidia/cuda-dl-base:24.10-cuda12.6-devel-ubuntu22.04.