Validate the cluster level NCCL test with 4 nodes and 32 GPUs#
Create a test file named ‘nccl-test.yaml’ with the contents shown below. In this example we are running NCCLtest across 4 DGX nodes, over a total of 32 GPUs, so “-np” is set to “4” in the mpirun command.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccltest
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
imagePullSecrets:
- name: ngc-registry-default
containers:
- image: docker.io/deepops/nccl-tests:2312
name: nccltest
imagePullPolicy: IfNotPresent
command:
- sh
- "-c"
- |
/bin/bash << 'EOF'
mpirun --allow-run-as-root \
-np 4 \
-bind-to none -map-by slot \
-mca pml ob1 \
-mca btl ^openib \
-mca btl_tcp_if_include 192.168.0.0/16 \
-mca oob_tcp_if_include 172.29.0.0/16 \
all_reduce_perf_mpi -b 8 -e 16G -f2 -g 8 \
&& sleep infinity
EOF
Worker:
replicas: 4
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: ibp192s0,ibp206s0,ibp154s0,ibp220s0,ibp24s0,ibp64s0,ibp79s0,ibp94s0
spec:
imagePullSecrets:
- name: ngc-registry-default
containers:
- image: docker.io/deepops/nccl-tests:2312
name: nccltest
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
requests:
nvidia.com/resibp192s0: "1"
nvidia.com/resibp206s0: "1"
nvidia.com/resibp154s0: "1"
nvidia.com/resibp220s0: "1"
nvidia.com/resibp24s0: "1"
nvidia.com/resibp64s0: "1"
nvidia.com/resibp79s0: "1"
nvidia.com/resibp94s0: "1"
nvidia.com/gpu: 8
limits:
nvidia.com/resibp192s0: "1"
nvidia.com/resibp206s0: "1"
nvidia.com/resibp154s0: "1"
nvidia.com/resibp220s0: "1"
nvidia.com/resibp24s0: "1"
nvidia.com/resibp64s0: "1"
nvidia.com/resibp79s0: "1"
nvidia.com/resibp94s0: "1"
nvidia.com/gpu: 8
Run the ‘nccl-test.yaml’ file:
kubectl apply –f nccl-test.yaml
Monitor the progress of the job: Wait till the pods are “Running”
kubectl get pods
root@bcm10-headnode1:~# k get pods
NAME READY STATUS RESTARTS AGE
nccltest-launcher-8znll 1/1 Running 3 (54s ago) 87s
nccltest-worker-0 1/1 Running 0 87s
nccltest-worker-1 1/1 Running 0 87s
nccltest-worker-2 1/1 Running 0 87s
nccltest-worker-3 1/1 Running 0 87s
Once all are in Running state, (takes ~90 seconds), verify the results:
kubeclt logs nccltest-launcher-NNNNN
For example
root@bcm10-headnode1:~#kubectl logs -f nccltest-launcher-8znll
#
# Using devices
# Rank 0 Group 0 Pid 43 on nccltest-worker-0 device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 43 on nccltest-worker-0 device 1 [0x43] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 43 on nccltest-worker-0 device 2 [0x52] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 43 on nccltest-worker-0 device 3 [0x61] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 43 on nccltest-worker-0 device 4 [0x9d] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 43 on nccltest-worker-0 device 5 [0xc3] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 43 on nccltest-worker-0 device 6 [0xd1] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 43 on nccltest-worker-0 device 7 [0xdf] NVIDIA H100 80GB HBM3
# Rank 8 Group 0 Pid 43 on nccltest-worker-1 device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 9 Group 0 Pid 43 on nccltest-worker-1 device 1 [0x43] NVIDIA H100 80GB HBM3
# Rank 10 Group 0 Pid 43 on nccltest-worker-1 device 2 [0x52] NVIDIA H100 80GB HBM3
# Rank 11 Group 0 Pid 43 on nccltest-worker-1 device 3 [0x61] NVIDIA H100 80GB HBM3
# Rank 12 Group 0 Pid 43 on nccltest-worker-1 device 4 [0x9d] NVIDIA H100 80GB HBM3
# Rank 13 Group 0 Pid 43 on nccltest-worker-1 device 5 [0xc3] NVIDIA H100 80GB HBM3
# Rank 14 Group 0 Pid 43 on nccltest-worker-1 device 6 [0xd1] NVIDIA H100 80GB HBM3
# Rank 15 Group 0 Pid 43 on nccltest-worker-1 device 7 [0xdf] NVIDIA H100 80GB HBM3
# Rank 16 Group 0 Pid 43 on nccltest-worker-2 device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 17 Group 0 Pid 43 on nccltest-worker-2 device 1 [0x43] NVIDIA H100 80GB HBM3
# Rank 18 Group 0 Pid 43 on nccltest-worker-2 device 2 [0x52] NVIDIA H100 80GB HBM3
# Rank 19 Group 0 Pid 43 on nccltest-worker-2 device 3 [0x61] NVIDIA H100 80GB HBM3
# Rank 20 Group 0 Pid 43 on nccltest-worker-2 device 4 [0x9d] NVIDIA H100 80GB HBM3
# Rank 21 Group 0 Pid 43 on nccltest-worker-2 device 5 [0xc3] NVIDIA H100 80GB HBM3
# Rank 22 Group 0 Pid 43 on nccltest-worker-2 device 6 [0xd1] NVIDIA H100 80GB HBM3
# Rank 23 Group 0 Pid 43 on nccltest-worker-2 device 7 [0xdf] NVIDIA H100 80GB HBM3
# Rank 24 Group 0 Pid 43 on nccltest-worker-3 device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 25 Group 0 Pid 43 on nccltest-worker-3 device 1 [0x43] NVIDIA H100 80GB HBM3
# Rank 26 Group 0 Pid 43 on nccltest-worker-3 device 2 [0x52] NVIDIA H100 80GB HBM3
# Rank 27 Group 0 Pid 43 on nccltest-worker-3 device 3 [0x61] NVIDIA H100 80GB HBM3
# Rank 28 Group 0 Pid 43 on nccltest-worker-3 device 4 [0x9d] NVIDIA H100 80GB HBM3
# Rank 29 Group 0 Pid 43 on nccltest-worker-3 device 5 [0xc3] NVIDIA H100 80GB HBM3
# Rank 30 Group 0 Pid 43 on nccltest-worker-3 device 6 [0xd1] NVIDIA H100 80GB HBM3
# Rank 31 Group 0 Pid 43 on nccltest-worker-3 device 7 [0xdf] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 222.7 0.00 0.00 0 52.78 0.00 0.00 0
16 4 float sum -1 45.53 0.00 0.00 0 45.37 0.00 0.00 0
32 8 float sum -1 43.96 0.00 0.00 0 46.49 0.00 0.00 0
64 16 float sum -1 44.13 0.00 0.00 0 45.86 0.00 0.00 0
128 32 float sum -1 44.73 0.00 0.01 0 44.28 0.00 0.01 0
256 64 float sum -1 66.13 0.00 0.01 0 43.62 0.01 0.01 0
512 128 float sum -1 51.52 0.01 0.02 0 46.60 0.01 0.02 0
1024 256 float sum -1 50.13 0.02 0.04 0 48.78 0.02 0.04 0
2048 512 float sum -1 52.50 0.04 0.08 0 48.21 0.04 0.08 0
4096 1024 float sum -1 50.20 0.08 0.16 0 50.25 0.08 0.16 0
8192 2048 float sum -1 56.57 0.14 0.28 0 50.76 0.16 0.31 0
16384 4096 float sum -1 56.42 0.29 0.56 0 53.13 0.31 0.60 0
32768 8192 float sum -1 55.20 0.59 1.15 0 53.74 0.61 1.18 0
65536 16384 float sum -1 69.90 0.94 1.82 0 68.49 0.96 1.85 0
131072 32768 float sum -1 63.59 2.06 3.99 0 83.64 1.57 3.04 0
262144 65536 float sum -1 74.29 3.53 6.84 0 77.09 3.40 6.59 0
524288 131072 float sum -1 78.86 6.65 12.88 0 82.36 6.37 12.33 0
1048576 262144 float sum -1 91.09 11.51 22.30 0 88.45 11.85 22.97 0
2097152 524288 float sum -1 159.7 13.13 25.44 0 139.5 15.03 29.13 0
4194304 1048576 float sum -1 151.1 27.75 53.77 0 159.6 26.28 50.91 0
8388608 2097152 float sum -1 197.8 42.41 82.17 0 203.7 41.19 79.80 0
16777216 4194304 float sum -1 254.9 65.81 127.50 0 254.6 65.89 127.66 0
33554432 8388608 float sum -1 364.0 92.19 178.61 0 462.5 72.56 140.58 0
67108864 16777216 float sum -1 562.0 119.41 231.35 0 550.7 121.87 236.12 0
134217728 33554432 float sum -1 982.6 136.59 264.64 0 969.6 138.42 268.19 0
268435456 67108864 float sum -1 1842.7 145.67 282.24 0 1874.8 143.18 277.41 0
536870912 134217728 float sum -1 3578.6 150.02 290.67 0 3580.5 149.94 290.52 0
1073741824 268435456 float sum -1 7421.0 144.69 280.34 0 7059.9 152.09 294.67 0
2147483648 536870912 float sum -1 14051 152.84 296.12 0 14061 152.72 295.90 0
4294967296 1073741824 float sum -1 28187 152.37 295.22 0 28145 152.60 295.67 0
8589934592 2147483648 float sum -1 56366 152.39 295.26 0 56564 151.86 294.23 0
17179869184 4294967296 float sum -1 112687 152.46 295.38 0 112808 152.29 295.07 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 94.9049