Validate the cluster level NCCL test with 4 nodes and 32 GPUs#

Create a test file named ‘nccl-test.yaml’ with the contents shown below. In this example we are running NCCLtest across 4 DGX nodes, over a total of 32 GPUs, so “-np” is set to “4” in the mpirun command.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccltest
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
           imagePullSecrets:
           - name: ngc-registry-default
           containers:
           - image: docker.io/deepops/nccl-tests:2312
             name: nccltest
             imagePullPolicy: IfNotPresent
             command:
             - sh
             - "-c"
             - |
               /bin/bash << 'EOF'
               mpirun --allow-run-as-root \
                 -np 4 \
                 -bind-to none -map-by slot \
                 -mca pml ob1 \
                 -mca btl ^openib \
                 -mca btl_tcp_if_include 192.168.0.0/16 \
                 -mca oob_tcp_if_include 172.29.0.0/16 \
                 all_reduce_perf_mpi -b 8 -e 16G -f2 -g 8 \
                 && sleep infinity
               EOF
    Worker:
      replicas: 4
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: ibp192s0,ibp206s0,ibp154s0,ibp220s0,ibp24s0,ibp64s0,ibp79s0,ibp94s0
        spec:
          imagePullSecrets:
          - name: ngc-registry-default
          containers:
          - image: docker.io/deepops/nccl-tests:2312
            name: nccltest
            imagePullPolicy: IfNotPresent
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              requests:
                nvidia.com/resibp192s0: "1"
                nvidia.com/resibp206s0: "1"
                nvidia.com/resibp154s0: "1"
                nvidia.com/resibp220s0: "1"
                nvidia.com/resibp24s0: "1"
                nvidia.com/resibp64s0: "1"
                nvidia.com/resibp79s0: "1"
                nvidia.com/resibp94s0: "1"
                nvidia.com/gpu: 8
              limits:
                nvidia.com/resibp192s0: "1"
                nvidia.com/resibp206s0: "1"
                nvidia.com/resibp154s0: "1"
                nvidia.com/resibp220s0: "1"
                nvidia.com/resibp24s0: "1"
                nvidia.com/resibp64s0: "1"
                nvidia.com/resibp79s0: "1"
                nvidia.com/resibp94s0: "1"
                nvidia.com/gpu: 8

Run the ‘nccl-test.yaml’ file:

kubectl apply –f nccl-test.yaml

Monitor the progress of the job: Wait till the pods are “Running”

kubectl get pods
root@bcm10-headnode1:~# k get pods
NAME READY STATUS RESTARTS AGE
nccltest-launcher-8znll 1/1 Running 3 (54s ago) 87s
nccltest-worker-0 1/1 Running 0 87s
nccltest-worker-1 1/1 Running 0 87s
nccltest-worker-2 1/1 Running 0 87s
nccltest-worker-3 1/1 Running 0 87s

Once all are in Running state, (takes ~90 seconds), verify the results:

kubeclt logs nccltest-launcher-NNNNN

For example

root@bcm10-headnode1:~#kubectl logs -f nccltest-launcher-8znll
#
# Using devices
#  Rank  0 Group  0 Pid     43 on nccltest-worker-0 device  0 [0x1b] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid     43 on nccltest-worker-0 device  1 [0x43] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid     43 on nccltest-worker-0 device  2 [0x52] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid     43 on nccltest-worker-0 device  3 [0x61] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid     43 on nccltest-worker-0 device  4 [0x9d] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid     43 on nccltest-worker-0 device  5 [0xc3] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid     43 on nccltest-worker-0 device  6 [0xd1] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid     43 on nccltest-worker-0 device  7 [0xdf] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid     43 on nccltest-worker-1 device  0 [0x1b] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid     43 on nccltest-worker-1 device  1 [0x43] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid     43 on nccltest-worker-1 device  2 [0x52] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid     43 on nccltest-worker-1 device  3 [0x61] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid     43 on nccltest-worker-1 device  4 [0x9d] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid     43 on nccltest-worker-1 device  5 [0xc3] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid     43 on nccltest-worker-1 device  6 [0xd1] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid     43 on nccltest-worker-1 device  7 [0xdf] NVIDIA H100 80GB HBM3
#  Rank 16 Group  0 Pid     43 on nccltest-worker-2 device  0 [0x1b] NVIDIA H100 80GB HBM3
#  Rank 17 Group  0 Pid     43 on nccltest-worker-2 device  1 [0x43] NVIDIA H100 80GB HBM3
#  Rank 18 Group  0 Pid     43 on nccltest-worker-2 device  2 [0x52] NVIDIA H100 80GB HBM3
#  Rank 19 Group  0 Pid     43 on nccltest-worker-2 device  3 [0x61] NVIDIA H100 80GB HBM3
#  Rank 20 Group  0 Pid     43 on nccltest-worker-2 device  4 [0x9d] NVIDIA H100 80GB HBM3
#  Rank 21 Group  0 Pid     43 on nccltest-worker-2 device  5 [0xc3] NVIDIA H100 80GB HBM3
#  Rank 22 Group  0 Pid     43 on nccltest-worker-2 device  6 [0xd1] NVIDIA H100 80GB HBM3
#  Rank 23 Group  0 Pid     43 on nccltest-worker-2 device  7 [0xdf] NVIDIA H100 80GB HBM3
#  Rank 24 Group  0 Pid     43 on nccltest-worker-3 device  0 [0x1b] NVIDIA H100 80GB HBM3
#  Rank 25 Group  0 Pid     43 on nccltest-worker-3 device  1 [0x43] NVIDIA H100 80GB HBM3
#  Rank 26 Group  0 Pid     43 on nccltest-worker-3 device  2 [0x52] NVIDIA H100 80GB HBM3
#  Rank 27 Group  0 Pid     43 on nccltest-worker-3 device  3 [0x61] NVIDIA H100 80GB HBM3
#  Rank 28 Group  0 Pid     43 on nccltest-worker-3 device  4 [0x9d] NVIDIA H100 80GB HBM3
#  Rank 29 Group  0 Pid     43 on nccltest-worker-3 device  5 [0xc3] NVIDIA H100 80GB HBM3
#  Rank 30 Group  0 Pid     43 on nccltest-worker-3 device  6 [0xd1] NVIDIA H100 80GB HBM3
#  Rank 31 Group  0 Pid     43 on nccltest-worker-3 device  7 [0xdf] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    222.7    0.00    0.00      0    52.78    0.00    0.00      0
          16             4     float     sum      -1    45.53    0.00    0.00      0    45.37    0.00    0.00      0
          32             8     float     sum      -1    43.96    0.00    0.00      0    46.49    0.00    0.00      0
          64            16     float     sum      -1    44.13    0.00    0.00      0    45.86    0.00    0.00      0
         128            32     float     sum      -1    44.73    0.00    0.01      0    44.28    0.00    0.01      0
         256            64     float     sum      -1    66.13    0.00    0.01      0    43.62    0.01    0.01      0
         512           128     float     sum      -1    51.52    0.01    0.02      0    46.60    0.01    0.02      0
        1024           256     float     sum      -1    50.13    0.02    0.04      0    48.78    0.02    0.04      0
        2048           512     float     sum      -1    52.50    0.04    0.08      0    48.21    0.04    0.08      0
        4096          1024     float     sum      -1    50.20    0.08    0.16      0    50.25    0.08    0.16      0
        8192          2048     float     sum      -1    56.57    0.14    0.28      0    50.76    0.16    0.31      0
       16384          4096     float     sum      -1    56.42    0.29    0.56      0    53.13    0.31    0.60      0
       32768          8192     float     sum      -1    55.20    0.59    1.15      0    53.74    0.61    1.18      0
       65536         16384     float     sum      -1    69.90    0.94    1.82      0    68.49    0.96    1.85      0
      131072         32768     float     sum      -1    63.59    2.06    3.99      0    83.64    1.57    3.04      0
      262144         65536     float     sum      -1    74.29    3.53    6.84      0    77.09    3.40    6.59      0
      524288        131072     float     sum      -1    78.86    6.65   12.88      0    82.36    6.37   12.33      0
     1048576        262144     float     sum      -1    91.09   11.51   22.30      0    88.45   11.85   22.97      0
     2097152        524288     float     sum      -1    159.7   13.13   25.44      0    139.5   15.03   29.13      0
     4194304       1048576     float     sum      -1    151.1   27.75   53.77      0    159.6   26.28   50.91      0
     8388608       2097152     float     sum      -1    197.8   42.41   82.17      0    203.7   41.19   79.80      0
    16777216       4194304     float     sum      -1    254.9   65.81  127.50      0    254.6   65.89  127.66      0
    33554432       8388608     float     sum      -1    364.0   92.19  178.61      0    462.5   72.56  140.58      0
    67108864      16777216     float     sum      -1    562.0  119.41  231.35      0    550.7  121.87  236.12      0
   134217728      33554432     float     sum      -1    982.6  136.59  264.64      0    969.6  138.42  268.19      0
   268435456      67108864     float     sum      -1   1842.7  145.67  282.24      0   1874.8  143.18  277.41      0
   536870912     134217728     float     sum      -1   3578.6  150.02  290.67      0   3580.5  149.94  290.52      0
  1073741824     268435456     float     sum      -1   7421.0  144.69  280.34      0   7059.9  152.09  294.67      0
  2147483648     536870912     float     sum      -1    14051  152.84  296.12      0    14061  152.72  295.90      0
  4294967296    1073741824     float     sum      -1    28187  152.37  295.22      0    28145  152.60  295.67      0
  8589934592    2147483648     float     sum      -1    56366  152.39  295.26      0    56564  151.86  294.23      0
 17179869184    4294967296     float     sum      -1   112687  152.46  295.38      0   112808  152.29  295.07      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 94.9049