Network Operator#

Verify that the NVIDIA Mellanox OFED package version on the DGX systems matches the version listed in DGX OS release notes.
1 # cmsh 2 % device 3 % pexec -c dgx-a100 -j "ofed_info -s" 4 [dgx01..dgx04] 5 MLNX_OFED_LINUX-23.10-0.5.5.0:
The correct InfiniBand interfaces used in the compute fabric must be identified and their operation status checked. As noted, mlx5_0, mlx5_2, mlx5_6, and mlx5_8 are used and should be verified in working condition. Each interface on each node should be State: Active, Physical stat: LinkUp, and Link layer: InfiniBand.

Verify that the interfaces are working properly with the following command:

  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do ibstat -d \ mlx5_${i} | grep -i \"mlx5_\\|state\\|infiniband\"; done"
  [dgx01..dgx04]
  CA 'mlx5_0'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_2'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_6'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_8'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand`

Check the SRIOV interface status.

NUM_OF_VFS should be set to 8.
SRIOV_EN should be True(1).
Link_TYPE_P1 should be IB(1).

In this example, only the Link_TYPE_P1 is set correctly. The others need to be set in the next step.

  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} q; done | grep -e \ \"SRIOV_EN\\|LINK_TYPE\\|NUM_OF_VFS\""
  [dgx01..dgx04]
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)

Enable SRIOV and set NUM_OF_VFS to 8 for each interface.

Since Link_TYPE_P1 was set correctly, only the two other values are set below.

[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} -y set SRIOV_EN=1 NUM_OF_VFS=8; done"
[dgx01..dgx04]
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success

Device #1:
----------

Device type:    ConnectX6
Name:           MCX653105A-HDA_Ax
Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device:         /dev/mst/mt4123_pciconf0

Configurations:                              Next Boot       New
        SRIOV_EN                            False(0)        True(1)
        NUM_OF_VFS                          0               8

Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
. . . some output omitted . . .

Reboot the DGX nodes to load the configuration.
% reboot -c dgx-a100

Wait for the DGX nodes to be UP before continuing to the next step.

% list -c dgx-a100 -f hostname:20,category:10,ip:20,status:10
hostname (key)       category   ip                   status
-------------------- ---------- -------------------- ----------
dgx01                dgx-a100   10.184.71.11         [   UP   +
dgx02                dgx-a100   10.184.71.12         [   UP   +
dgx03                dgx-a100   10.184.71.13         [   UP   +
dgx04                dgx-a100   10.184.71.14         [   UP   +

Configure eight SRIOV VFs on the InfiniBand ports.

[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do echo 8 > \ /sys/class/infiniband/mlx5_${i}/device/sriov_numvfs; done"

On the primary Headnode, load the Kubernetes environment module.
# module load kubernetes/default/1.27.11-150500.1.1

Add and install the Network Operator Helm repo.

# helm repo add nvidia-networking https://mellanox.github.io/network-operator
"nvidia-networking" has been added to your repositories

# helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia-networking" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈

Create the directory ./network-operator.
# mkdir ./network-operator

Create the values.yaml file for Helm to install Network Operator.

# vi ./network-operator/values.yaml

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

Create the sriov-ib-network-node-policy.yaml file.

# vi ./network-operator/sriov-ib-network-node-policy.yaml

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp12s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp12s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp12s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp75s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp75s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp75s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp141s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp141s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp141s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp186s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp186s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp186s0

Create the sriovibnetwork.yaml file.

# vi ./network-operator/sriovibnetwork.yaml

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp12s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.1.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp12s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp75s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.2.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp75s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibpi141s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp141s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp186s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.4.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp186s0
  linkState: enable
  networkNamespace: default

Deploy the configuration files.

  # kubectl apply -f ./network-operator/sriov-ib-network-node-policy.yaml
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp12s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp75s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp141s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp186s0 created

  # kubectl apply -f ./network-operator/sriovibnetwork.yaml
  sriovibnetwork.sriovnetwork.openshift.io/ibp12s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibp75s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibpi141s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibp186s0 created

Deploy the mpi-operator.

  # kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-\ operator/master/deploy/v2beta1/mpi-operator.yaml
  namespace/mpi-operator created
  customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
  serviceaccount/mpi-operator created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
  clusterrole.rbac.authorization.k8s.io/mpi-operator created
  clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
  deployment.apps/mpi-operator created

Copy the Network Operator /opt/cni/bin directory to /cm/shared, where it will be accessed by the head nodes.
1 # ssh dgx01 2 # cp -r /opt/cni/bin /cm/shared/dgx_opt_cni_bin 3 # exit

Create the network-validation.yaml file and run a simple validation test.

  # vi network-operator/network-validation.yaml

  apiVersion: v1
  kind: Pod
  metadata:
    name: network-validation-pod
  spec:
    containers:
      - name: network-validation-pod
        image: docker.io/deepops/nccl-tests:latest
        imagePullPolicy: IfNotPresent
        command:
          - sh
          - -c
          - sleep inf
        securityContext:
          capabilities:
            add: ["IPC_LOCK"]
        resources:
          requests:
            nvidia.com/resibp75s0: "1"
            nvidia.com/resibp186s0: "1"
            nvidia.com/resibp12s0: "1"
            nvidia.com/resibp141s0: "1"
          limits:
            nvidia.com/resibp75s0: "1"
            nvidia.com/resibp186s0: "1"
            nvidia.com/resibp12s0: "1"
            nvidia.com/resibp141s0: "1"

Apply the network-validation.yaml file.
1 # kubectl apply -f ./network-operator/network-validation.yaml 2 pod/network-validation-pod created
If the pod successfully runs and does not give any errors, it has passed the network-validation test.

Run a Multi-node NCCL Test.

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking which is the foundation for many AI/ML training and deep learning applications. A successful run of a multi-node NCCL test is a good indicator that multi-node MPI and NCCL communication between GPUs is operating correctly. Create the nccl_test.yaml file within the ./network-operator directory.

  # vi ./network-operator/nccl_test.yaml

  apiVersion: kubeflow.org/v2beta1
  kind: MPIJob
  metadata:
    name: nccltest
  spec:
    slotsPerWorker: 8
    runPolicy:
      cleanPodPolicy: Running
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          spec:
            containers:
              - image: docker.io/deepops/nccl-tests:latest
                name: nccltest
                imagePullPolicy: IfNotPresent
                command:
                  - sh
                  - "-c"
                  - |
                    /bin/bash << 'EOF'

                    mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_ALGO=RING -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl self,tcp -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 172.29.0.0/16 /nccl_tests/build/all_reduce_perf -b 8 -e 4G -f2 -g 1

                    EOF
      Worker:
        replicas: 4
        template:
          metadata:
          spec:
            containers:
              - image: docker.io/deepops/nccl-tests:latest
                name: nccltest
                imagePullPolicy: IfNotPresent
                securityContext:
                  capabilities:
                    add: ["IPC_LOCK"]
                resources:
                  limits:
                    nvidia.com/resibp12s0: "1"
                    nvidia.com/resibp75s0: "1"
                    nvidia.com/resibp141s0: "1"
                    nvidia.com/resibp186s0: "1"
                    nvidia.com/gpu: 8

Run the nccl_test file.

  # kubectl apply -f ./network-operator/nccl_test.yaml
  mpijob.kubeflow.org/nccltest created
  root@basepod-head1:~#
  # kubectl get pods
  NAME                      READY   STATUS    RESTARTS   AGE
  nccltest-launcher-9pp28   1/1     Running   0          3m6s
  nccltest-worker-0         1/1     Running   0          3m6s
  nccltest-worker-1         1/1     Running   0          3m6s
  nccltest-worker-2         1/1     Running   0          3m6s
  nccltest-worker-3         1/1     Running   0          3m6s

To see the logs run kubectl logs nccltest-launcher-<ID>. A sample logfile follows.