Run NCCL Test on Dev Pods
NCCL test is a tool to verify bandwidth among nodes that have GPUs with InfiniBand (IB) or RoCE connectivity.
This example covers how to run NCCL test on a set of Dev Pods.
Step 1: Create Dev Pods
NCCL Test requires 2 or more Dev Pods, each utilizing a full node. In this example, we will create 2 Dev Pods, each uses 8x H100 GPUs.
Step 2: Setup SSH Keys for Pods Communication
1. Generate an SSH key on Pod 1
On Pod 1, generate an SSH key pair and add the public key to the list of authorized keys:
mkdir -p /root/.ssh
chmod 700 /root/.ssh
ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519
cat /root/.ssh/id_ed25519.pub >> /root/.ssh/authorized_keys
2. Apply the public key to Pods 2 through N
On each of Pod 2 to Pod N, prepare the .ssh
directory and add the generated public key from Pod 1 to authorized_keys
:
mkdir -p /root/.ssh
chmod 700 /root/.ssh
touch /root/.ssh/authorized_keys
# then echo >> or any other command that writes the public key to /root/.ssh/authorized_keys
# on pod 2 ... pod N
Step 3: Get the local IP addresses for Pod 1 to Pod N
You can obtain the local IP of the node from the dashboard by clicking on the node name and find it in the node detail page.
Step 4: Create hostfile on Pod 1
Create a file named /tmp/hostfile.txt
on Pod 1. Each line should contain the private IP address of the nodes where these pods are running.
touch /tmp/hostfile.txt
vi /tmp/hostfile.txt
# IP_ADDRESS_FOR_NODE_1
# IP_ADDRESS_FOR_NODE_2
# ...
Step 5: Download the nccl-tests script on every Pod
wget https://pub-2f78d6ca875c410392d83a29768dd4ce.r2.dev/nccl_test_pod.bash -O ./nccl_test_pod.bash
chmod +x nccl_test_pod.bash
Step 6: Run the nccl-tests script on Pod 2 to Pod N
On Pod 2 to Pod N, run the following command:
./nccl_test_pod.bash --ssh-port <ssh-port>
You can find the ssh port from the Pod detail page. Commonly, the ssh port is 2222
since the Pod is utilizing a full node.
You should see "Done" printed at the end of the output.
Step 7: Run the nccl-tests script on Pod 1
Then, on Pod 1, run the following command:
./nccl_test_pod.bash --launcher --host-file /tmp/hostfile.txt --num-workers <N> --ssh-port <ssh-port>
The output will be similar to the following:
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 42.85 0.00 0.00 N/A 36.65 0.00 0.00 N/A
16 4 float sum -1 36.15 0.00 0.00 N/A 36.02 0.00 0.00 N/A
32 8 float sum -1 36.19 0.00 0.00 N/A 35.95 0.00 0.00 N/A
64 16 float sum -1 36.22 0.00 0.00 N/A 36.30 0.00 0.00 N/A
128 32 float sum -1 36.38 0.00 0.01 N/A 36.54 0.00 0.01 N/A
256 64 float sum -1 62.06 0.00 0.01 N/A 36.49 0.01 0.01 N/A
512 128 float sum -1 55.09 0.01 0.02 N/A 36.07 0.01 0.03 N/A
1024 256 float sum -1 35.48 0.03 0.05 N/A 36.33 0.03 0.05 N/A
2048 512 float sum -1 36.13 0.06 0.11 N/A 36.43 0.06 0.11 N/A
4096 1024 float sum -1 36.23 0.11 0.21 N/A 36.23 0.11 0.21 N/A
8192 2048 float sum -1 40.11 0.20 0.38 N/A 39.67 0.21 0.39 N/A
16384 4096 float sum -1 47.88 0.34 0.64 N/A 47.94 0.34 0.64 N/A
32768 8192 float sum -1 57.35 0.57 1.07 N/A 57.48 0.57 1.07 N/A
65536 16384 float sum -1 63.68 1.03 1.93 N/A 63.04 1.04 1.95 N/A
131072 32768 float sum -1 69.54 1.88 3.53 N/A 71.59 1.83 3.43 N/A
262144 65536 float sum -1 70.81 3.70 6.94 N/A 67.62 3.88 7.27 N/A
524288 131072 float sum -1 69.38 7.56 14.17 N/A 67.24 7.80 14.62 N/A
1048576 262144 float sum -1 82.14 12.77 23.94 N/A 80.83 12.97 24.32 N/A
2097152 524288 float sum -1 98.29 21.34 40.01 N/A 97.38 21.53 40.38 N/A
4194304 1048576 float sum -1 111.6 37.57 70.44 N/A 110.5 37.94 71.14 N/A
8388608 2097152 float sum -1 147.5 56.88 106.64 N/A 145.2 57.75 108.29 N/A
16777216 4194304 float sum -1 192.9 86.98 163.09 N/A 193.0 86.93 162.99 N/A
33554432 8388608 float sum -1 259.4 129.34 242.51 N/A 258.1 129.99 243.73 N/A
67108864 16777216 float sum -1 465.2 144.27 270.51 N/A 461.3 145.47 272.76 N/A
134217728 33554432 float sum -1 736.0 182.37 341.95 N/A 748.3 179.35 336.29 N/A
268435456 67108864 float sum -1 1282.3 209.34 392.52 N/A 1284.0 209.06 391.98 N/A
536870912 134217728 float sum -1 2338.5 229.57 430.45 N/A 2347.8 228.67 428.76 N/A
1073741824 268435456 float sum -1 4483.0 239.51 449.09 N/A 4476.1 239.89 449.79 N/A
2147483648 536870912 float sum -1 8768.3 244.91 459.21 N/A 8772.0 244.81 459.02 N/A
4294967296 1073741824 float sum -1 17427 246.45 462.10 N/A 17365 247.33 463.75 N/A
8589934592 2147483648 float sum -1 34683 247.67 464.38 N/A 34668 247.78 464.58 N/A
17179869184 4294967296 float sum -1 69265 248.03 465.06 N/A 69282 247.97 464.95 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 137.867
#
MPI job completed!
This result is obtained by running the NCCL tests on nvcr.io/nvidia/cuda-dl-base:24.10-cuda12.6-devel-ubuntu22.04
.