Network Operator

  1. Verify that the NVIDIA Mellanox OFED package version on the DGX systems matches the version listed in DGX OS release notes.

    1  # cmsh
    2  % device
    3  % pexec -c dgx-a100 -j "ofed_info -s"
    4  [dgx01..dgx04]
    5  MLNX_OFED_LINUX-23.10-0.5.5.0:
    

    The correct InfiniBand interfaces used in the compute fabric must be identified and their operation status checked. As noted, mlx5_0, mlx5_2, mlx5_6, and mlx5_8 are used and should be verified in working condition. Each interface on each node should be State: Active, Physical stat: LinkUp, and Link layer: InfiniBand.

  2. Verify that the interfaces are working properly with the following command:

     1  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do ibstat -d \ mlx5_${i} | grep -i \"mlx5_\\|state\\|infiniband\"; done"
     2  [dgx01..dgx04]
     3  CA 'mlx5_0'
     4                  State: Active
     5                  Physical state: LinkUp
     6                  Link layer: InfiniBand
     7  CA 'mlx5_2'
     8                  State: Active
     9                  Physical state: LinkUp
    10                  Link layer: InfiniBand
    11  CA 'mlx5_6'
    12                  State: Active
    13                  Physical state: LinkUp
    14                  Link layer: InfiniBand
    15  CA 'mlx5_8'
    16                  State: Active
    17                  Physical state: LinkUp
    18                  Link layer: InfiniBand`
    
  3. Check the SRIOV interface status.

    1. NUM_OF_VFS should be set to 8.

    2. SRIOV_EN should be True(1).

    3. Link_TYPE_P1 should be IB(1).

    In this example, only the Link_TYPE_P1 is set correctly. The others need to be set in the next step.

     1  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} q; done | grep -e \ \"SRIOV_EN\\|LINK_TYPE\\|NUM_OF_VFS\""
     2  [dgx01..dgx04]
     3          NUM_OF_VFS                          0
     4          SRIOV_EN                            False(0)
     5          LINK_TYPE_P1                        IB(1)
     6          NUM_OF_VFS                          0
     7          SRIOV_EN                            False(0)
     8          LINK_TYPE_P1                        IB(1)
     9          NUM_OF_VFS                          0
    10          SRIOV_EN                            False(0)
    11          LINK_TYPE_P1                        IB(1)
    12          NUM_OF_VFS                          0
    13          SRIOV_EN                            False(0)
    14          LINK_TYPE_P1                        IB(1)
    
  4. Enable SRIOV and set NUM_OF_VFS to 8 for each interface.

    Since Link_TYPE_P1 was set correctly, only the two other values are set below.

     1[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} -y set SRIOV_EN=1 NUM_OF_VFS=8; done"
     2[dgx01..dgx04]
     3Starting MST (Mellanox Software Tools) driver set
     4Loading MST PCI module - Success
     5[warn] mst_pciconf is already loaded, skipping
     6Create devices
     7Unloading MST PCI module (unused) - Success
     8
     9Device #1:
    10----------
    11
    12Device type:    ConnectX6
    13Name:           MCX653105A-HDA_Ax
    14Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
    15Device:         /dev/mst/mt4123_pciconf0
    16
    17Configurations:                              Next Boot       New
    18        SRIOV_EN                            False(0)        True(1)
    19        NUM_OF_VFS                          0               8
    20
    21Apply new Configuration? (y/n) [n] : y
    22Applying... Done!
    23-I- Please reboot machine to load new configurations.
    24. . . some output omitted . . .
    
  5. Reboot the DGX nodes to load the configuration.

    % reboot -c dgx-a100
    
  6. Wait for the DGX nodes to be UP before continuing to the next step.

    1% list -c dgx-a100 -f hostname:20,category:10,ip:20,status:10
    2hostname (key)       category   ip                   status
    3-------------------- ---------- -------------------- ----------
    4dgx01                dgx-a100   10.184.71.11         [   UP   +
    5dgx02                dgx-a100   10.184.71.12         [   UP   +
    6dgx03                dgx-a100   10.184.71.13         [   UP   +
    7dgx04                dgx-a100   10.184.71.14         [   UP   +
    
  7. Configure eight SRIOV VFs on the InfiniBand ports.

    [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do echo 8 > \ /sys/class/infiniband/mlx5_${i}/device/sriov_numvfs; done"
    
  8. On the primary Headnode, load the Kubernetes environment module.

    # module load kubernetes/default/1.27.11-150500.1.1
    
  9. Add and install the Network Operator Helm repo.

    1# helm repo add nvidia-networking https://mellanox.github.io/network-operator
    2"nvidia-networking" has been added to your repositories
    3
    4# helm repo update
    5Hang tight while we grab the latest from your chart repositories...
    6...Successfully got an update from the "nvidia-networking" chart repository
    7...Successfully got an update from the "prometheus-community" chart repository
    8...Successfully got an update from the "nvidia" chart repository
    9Update Complete. ⎈Happy Helming!⎈
    
  10. Create the directory ./network-operator.

    # mkdir ./network-operator
    
  11. Create the values.yaml file for Helm to install Network Operator.

     1# vi ./network-operator/values.yaml
     2
     3nfd:
     4  enabled: true
     5sriovNetworkOperator:
     6  enabled: true
     7
     8# NicClusterPolicy CR values:
     9deployCR: true
    10ofedDriver:
    11  deploy: false
    12rdmaSharedDevicePlugin:
    13  deploy: false
    14sriovDevicePlugin:
    15  deploy: false
    16
    17secondaryNetwork:
    18  deploy: true
    19  multus:
    20    deploy: true
    21  cniPlugins:
    22    deploy: true
    23  ipamPlugin:
    24    deploy: true
    
  12. Create the sriov-ib-network-node-policy.yaml file.

     1# vi ./network-operator/sriov-ib-network-node-policy.yaml
     2
     3apiVersion: sriovnetwork.openshift.io/v1
     4kind: SriovNetworkNodePolicy
     5metadata:
     6  name: ibp12s0
     7  namespace: network-operator
     8spec:
     9  deviceType: netdevice
    10  nodeSelector:
    11    feature.node.kubernetes.io/network-sriov.capable: "true"
    12  nicSelector:
    13    vendor: "15b3"
    14    pfNames: ["ibp12s0"]
    15  linkType: ib
    16  isRdma: true
    17  numVfs: 8
    18  priority: 90
    19  resourceName: resibp12s0
    20
    21---
    22apiVersion: sriovnetwork.openshift.io/v1
    23kind: SriovNetworkNodePolicy
    24metadata:
    25  name: ibp75s0
    26  namespace: network-operator
    27spec:
    28  deviceType: netdevice
    29  nodeSelector:
    30    feature.node.kubernetes.io/network-sriov.capable: "true"
    31  nicSelector:
    32    vendor: "15b3"
    33    pfNames: ["ibp75s0"]
    34  linkType: ib
    35  isRdma: true
    36  numVfs: 8
    37  priority: 90
    38  resourceName: resibp75s0
    39
    40---
    41apiVersion: sriovnetwork.openshift.io/v1
    42kind: SriovNetworkNodePolicy
    43metadata:
    44  name: ibp141s0
    45  namespace: network-operator
    46spec:
    47  deviceType: netdevice
    48  nodeSelector:
    49    feature.node.kubernetes.io/network-sriov.capable: "true"
    50  nicSelector:
    51    vendor: "15b3"
    52    pfNames: ["ibp141s0"]
    53  linkType: ib
    54  isRdma: true
    55  numVfs: 8
    56  priority: 90
    57  resourceName: resibp141s0
    58
    59---
    60apiVersion: sriovnetwork.openshift.io/v1
    61kind: SriovNetworkNodePolicy
    62metadata:
    63  name: ibp186s0
    64  namespace: network-operator
    65spec:
    66  deviceType: netdevice
    67  nodeSelector:
    68    feature.node.kubernetes.io/network-sriov.capable: "true"
    69  nicSelector:
    70    vendor: "15b3"
    71    pfNames: ["ibp186s0"]
    72  linkType: ib
    73  isRdma: true
    74  numVfs: 8
    75  priority: 90
    76  resourceName: resibp186s0
    
  13. Create the sriovibnetwork.yaml file.

     1# vi ./network-operator/sriovibnetwork.yaml
     2
     3apiVersion: sriovnetwork.openshift.io/v1
     4kind: SriovIBNetwork
     5metadata:
     6  name: ibp12s0
     7  namespace: network-operator
     8spec:
     9  ipam: |
    10    {
    11      "type": "whereabouts",
    12      "datastore": "kubernetes",
    13      "kubernetes": {
    14        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    15      },
    16      "range": "192.168.1.0/24",
    17      "log_file": "/var/log/whereabouts.log",
    18      "log_level": "info"
    19    }
    20  resourceName: resibp12s0
    21  linkState: enable
    22  networkNamespace: default
    23
    24---
    25apiVersion: sriovnetwork.openshift.io/v1
    26kind: SriovIBNetwork
    27metadata:
    28  name: ibp75s0
    29  namespace: network-operator
    30spec:
    31  ipam: |
    32    {
    33      "type": "whereabouts",
    34      "datastore": "kubernetes",
    35      "kubernetes": {
    36        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    37      },
    38      "range": "192.168.2.0/24",
    39      "log_file": "/var/log/whereabouts.log",
    40      "log_level": "info"
    41    }
    42  resourceName: resibp75s0
    43  linkState: enable
    44  networkNamespace: default
    45
    46---
    47apiVersion: sriovnetwork.openshift.io/v1
    48kind: SriovIBNetwork
    49metadata:
    50  name: ibpi141s0
    51  namespace: network-operator
    52spec:
    53  ipam: |
    54    {
    55      "type": "whereabouts",
    56      "datastore": "kubernetes",
    57      "kubernetes": {
    58        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    59      },
    60      "range": "192.168.3.0/24",
    61      "log_file": "/var/log/whereabouts.log",
    62      "log_level": "info"
    63    }
    64  resourceName: resibp141s0
    65  linkState: enable
    66  networkNamespace: default
    67
    68---
    69apiVersion: sriovnetwork.openshift.io/v1
    70kind: SriovIBNetwork
    71metadata:
    72  name: ibp186s0
    73  namespace: network-operator
    74spec:
    75  ipam: |
    76    {
    77      "type": "whereabouts",
    78      "datastore": "kubernetes",
    79      "kubernetes": {
    80        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    81      },
    82      "range": "192.168.4.0/24",
    83      "log_file": "/var/log/whereabouts.log",
    84      "log_level": "info"
    85    }
    86  resourceName: resibp186s0
    87  linkState: enable
    88  networkNamespace: default
    
  14. Deploy the configuration files.

     1  # kubectl apply -f ./network-operator/sriov-ib-network-node-policy.yaml
     2  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp12s0 created
     3  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp75s0 created
     4  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp141s0 created
     5  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp186s0 created
     6
     7  # kubectl apply -f ./network-operator/sriovibnetwork.yaml
     8  sriovibnetwork.sriovnetwork.openshift.io/ibp12s0 created
     9  sriovibnetwork.sriovnetwork.openshift.io/ibp75s0 created
    10  sriovibnetwork.sriovnetwork.openshift.io/ibpi141s0 created
    11  sriovibnetwork.sriovnetwork.openshift.io/ibp186s0 created
    
  15. Deploy the mpi-operator.

     1  # kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-\ operator/master/deploy/v2beta1/mpi-operator.yaml
     2  namespace/mpi-operator created
     3  customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
     4  serviceaccount/mpi-operator created
     5  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
     6  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
     7  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
     8  clusterrole.rbac.authorization.k8s.io/mpi-operator created
     9  clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
    10  deployment.apps/mpi-operator created
    
  16. Copy the Network Operator /opt/cni/bin directory to /cm/shared, where it will be accessed by the head nodes.

    1  # ssh dgx01
    2  # cp -r /opt/cni/bin /cm/shared/dgx_opt_cni_bin
    3  # exit
    
  17. Create the network-validation.yaml file and run a simple validation test.

     1  # vi network-operator/network-validation.yaml
     2
     3  apiVersion: v1
     4  kind: Pod
     5  metadata:
     6    name: network-validation-pod
     7  spec:
     8    containers:
     9      - name: network-validation-pod
    10        image: docker.io/deepops/nccl-tests:latest
    11        imagePullPolicy: IfNotPresent
    12        command:
    13          - sh
    14          - -c
    15          - sleep inf
    16        securityContext:
    17          capabilities:
    18            add: ["IPC_LOCK"]
    19        resources:
    20          requests:
    21            nvidia.com/resibp75s0: "1"
    22            nvidia.com/resibp186s0: "1"
    23            nvidia.com/resibp12s0: "1"
    24            nvidia.com/resibp141s0: "1"
    25          limits:
    26            nvidia.com/resibp75s0: "1"
    27            nvidia.com/resibp186s0: "1"
    28            nvidia.com/resibp12s0: "1"
    29            nvidia.com/resibp141s0: "1"
    
  18. Apply the network-validation.yaml file.

    1  # kubectl apply -f ./network-operator/network-validation.yaml
    2  pod/network-validation-pod created
    

    If the pod successfully runs and does not give any errors, it has passed the network-validation test.

  19. Run a Multi-node NCCL Test.

    The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking which is the foundation for many AI/ML training and deep learning applications. A successful run of a multi-node NCCL test is a good indicator that multi-node MPI and NCCL communication between GPUs is operating correctly. Create the nccl_test.yaml file within the ./network-operator directory.

     1  # vi ./network-operator/nccl_test.yaml
     2
     3  apiVersion: kubeflow.org/v2beta1
     4  kind: MPIJob
     5  metadata:
     6    name: nccltest
     7  spec:
     8    slotsPerWorker: 8
     9    runPolicy:
    10      cleanPodPolicy: Running
    11    mpiReplicaSpecs:
    12      Launcher:
    13        replicas: 1
    14        template:
    15          spec:
    16            containers:
    17              - image: docker.io/deepops/nccl-tests:latest
    18                name: nccltest
    19                imagePullPolicy: IfNotPresent
    20                command:
    21                  - sh
    22                  - "-c"
    23                  - |
    24                    /bin/bash << 'EOF'
    25
    26                    mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_ALGO=RING -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl self,tcp -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 172.29.0.0/16 /nccl_tests/build/all_reduce_perf -b 8 -e 4G -f2 -g 1
    27
    28                    EOF
    29      Worker:
    30        replicas: 4
    31        template:
    32          metadata:
    33          spec:
    34            containers:
    35              - image: docker.io/deepops/nccl-tests:latest
    36                name: nccltest
    37                imagePullPolicy: IfNotPresent
    38                securityContext:
    39                  capabilities:
    40                    add: ["IPC_LOCK"]
    41                resources:
    42                  limits:
    43                    nvidia.com/resibp12s0: "1"
    44                    nvidia.com/resibp75s0: "1"
    45                    nvidia.com/resibp141s0: "1"
    46                    nvidia.com/resibp186s0: "1"
    47                    nvidia.com/gpu: 8
    
  20. Run the nccl_test file.

     1  # kubectl apply -f ./network-operator/nccl_test.yaml
     2  mpijob.kubeflow.org/nccltest created
     3  root@basepod-head1:~#
     4  # kubectl get pods
     5  NAME                      READY   STATUS    RESTARTS   AGE
     6  nccltest-launcher-9pp28   1/1     Running   0          3m6s
     7  nccltest-worker-0         1/1     Running   0          3m6s
     8  nccltest-worker-1         1/1     Running   0          3m6s
     9  nccltest-worker-2         1/1     Running   0          3m6s
    10  nccltest-worker-3         1/1     Running   0          3m6s
    

    To see the logs run kubectl logs nccltest-launcher-<ID>. A sample logfile follows.

    _images/network-operator-1.png