Aerial SDK 23-1
Aerial SDK 23-1

cuBB on NVIDIA Cloud Native Stack

NVIDIA Cloud Native Stack (formerly known as Cloud Native Core) is a collection of software to run cloud native workloads on NVIDIA GPUs. This section describes how to install and run the cuBB SDK software examples on NVIDIA Cloud Native Stack and related components to run the cuBB SDK.

The steps to install NVIDIA Cloud Native Stack follows the installation guide on GitHub and skip the details in this document.

The contents of this section have verified with NVIDIA Cloud Native Stack v8.0, but we are keeping the OS version only to Ubuntu 20.04 LTS with 5.4.0.65-lowlatency kernel.

This section describes how to enable SR-IOV for Mellanox NICs and converged cards.

Note

Some servers need to change BIOS settings to enable SR-IOV. For Aerial DevKit, “SR-IOV Support” in the BIOS menu is enabled by default. For Dell R750, “SR-IOV Global Enable” should be enabled.

GPU Operator with Host MOFED Driver and RDMA (without Network Operator)

In this subsection, we assume that Network Operator is not installed and MOFED is installed on the host. Instead of Network Operator, some steps need to be done manually, i.e., configuring SR-IOV for Mellanox NIC and converged card installing network plugins for the cloud native stack (see the table below in detail). This subsection describes the configuration, installations, and an example of Kubernetes manifest for cuBB.

Kubernetes Network Plugin

Tested Version

Multus CNI 3.7.1
SR-IOV Network Device Plugin 3.5.1
SR-IOV CNI 2.7.0

Enabling SR-IOV

Configure one-time FW settings to enable SR-IOV for Mellanox NIC and BF2 cards.

Copy
Copied!
            

# Define variables of interface names and PCI address export MLX0IFNAME=ens2f0np0 # CHANGE HERE export MLX1IFNAME=ens2f1np1 # CHANGE HERE export MLX0PCIEADDR=`ethtool -i \${MLX0IFNAME} | grep bus-info | awk '{print $2}'` export MLX1PCIEADDR=`ethtool -i \${MLX1IFNAME} | grep bus-info | awk '{print $2}'` # Enable SR-IOV in the FW level and 8 VFs sudo -E mlxconfig -d $MLX0PCIEADDR --yes set SRIOV_EN=1 sudo -E mlxconfig -d $MLX1PCIEADDR --yes set SRIOV_EN=1 sudo -E mlxconfig -d $MLX0PCIEADDR --yes set NUM_OF_VFS=8 sudo -E mlxconfig -d $MLX1PCIEADDR --yes set NUM_OF_VFS=8

Create a configuration file for the following script to configure SR-IOV, which the number of virtual functions (VFs) for each physical port is 8 here.

Copy
Copied!
            

cat << EOF | sudo sudo tee /etc/sriov.conf 0000:19:00.0 8 # CHANGE HERE 0000:19:00.1 8 # CHANGE HERE EOF

Configurations of SR-IOV will be reset by rebooting the machine. We are going to create a systemd service here to enable SR-IOV after every rebooting.

Create the start-up configuration script for enabling SR-IOV and all virtual functions (VFs) to be available (link-up).

Copy
Copied!
            

cat << EOF | sudo sudo tee /usr/local/bin/configure-sriov.sh #!/bin/bash set -eux input="/etc/sriov.conf" UDEV_RULE_FILE='/etc/udev/rules.d/10-persistent-net.rules' append_to_file(){ content="\$1" file_name="\$2" if ! test -f "\$file_name" then echo "\$content" > "\$file_name" else if ! grep -Fxq "\$content" "\$file_name" then echo "\$content" >> "\$file_name" fi fi } add_udev_rule_for_sriov_pf(){ pf_pci=\$(grep PCI_SLOT_NAME /sys/class/net/\$1/device/uevent | cut -d'=' -f2) udev_data_line="SUBSYSTEM==\"net\", ACTION==\"add\", DRIVERS==\"?*\", KERNELS==\"\$pf_pci\", NAME=\"\$1\"" append_to_file "\$udev_data_line" "\$UDEV_RULE_FILE" } names=() while read pci_addr num_vfs do # Increase the PCIe Max Read Request Size to 4kB setpci -s \${pci_addr} 68.w=5000:f000 setpci -s \${pci_addr} 68.w=5000:f000 echo "Set \$num_vfs VFs on device \$pci_addr" name=\$(ls /sys/bus/pci/devices/\${pci_addr}/net/) names+=(\$name) # Create udev rule to save PF name add_udev_rule_for_sriov_pf \$name # configure ALL VF's to be trusted for FW. Needed MFT v4.19+ (https://docs.nvidia.com/doca/sdk/virtual-functions/index.html#prerequisites) mlxreg -d \${pci_addr} --reg_id 0xc007 --reg_len 0x40 --indexes "0x0.0:32=0x80000000" --yes --set "0x4.0:32=0x1" # create VFs echo \$num_vfs > /sys/bus/pci/devices/\${pci_addr}/sriov_numvfs done <"\$input" # wait for vfs to be ready sleep 5 i=0 while read pci_addr num_vfs do # unload VF driver vf_dirs=\$(ls /sys/bus/pci/devices/\${pci_addr} | grep virtfn) for vf_dir in \$vf_dirs do vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )") echo \$vf_pci_addr > /sys/bus/pci/drivers/mlx5_core/unbind || true done ip link set \${names[i]} up i=\$(( i+1 )) # load VF driver for vf_dir in \$vf_dirs do vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )") echo \$vf_pci_addr > /sys/bus/pci/drivers_probe vf_if_name=\$(lshw -c network -businfo | grep \$vf_pci_addr | awk '{print \$2}') ip link set \$vf_if_name up done done <"\$input" EOF

Allow the script execution.

Copy
Copied!
            

sudo chmod +x /usr/local/bin/configure-sriov.sh

Create the systemd service to run the script.

Copy
Copied!
            

cat << EOF | sudo tee /etc/systemd/system/sriov-configuration.service [Unit] Description=Configures SRIOV NIC Wants=network-pre.target Before=network-pre.target [Service] Type=oneshot ExecStart=/usr/local/bin/configure-sriov.sh StandardOutput=journal+console StandardError=journal+console [Install] WantedBy=network-online.target EOF

Enable autostart for the systemd service.

Copy
Copied!
            

sudo systemctl daemon-reload sudo systemctl enable sriov-configuration sudo systemctl start sriov-configuration

Ensure SR-IOV is enabled.

Copy
Copied!
            

lspci -tvvv (snip) | \-02.0-[17-1e]----00.0-[18-1e]--+-00.0-[19-1a]--+-00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller | | +-00.1 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller | | +-00.2 Mellanox Technologies MT42822 BlueField-2 SoC Management Interface | | +-00.3 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-00.4 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-00.5 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-00.6 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-00.7 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.0 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.1 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.2 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.3 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.4 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.5 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.6 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-01.7 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-02.0 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | +-02.1 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | | \-02.2 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function | \-01.0-[1b-1e]----00.0-[1c-1e]----08.0-[1d-1e]----00.0 NVIDIA Corporation Device 20b8


Installing Multus CNI

A SR-IOV VF interface will be attached to a Pod as a secondary network interface. This is enabled by Multus CNI.

To install Multus CNI, create the Multus DaemonSet.

Copy
Copied!
            

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\ /multus-cni/v3.7.1/images/multus-daemonset.yml

Validate the status of Multus Pods.

Copy
Copied!
            

kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE (snip) kube-system kube-multus-ds-77zjm 1/1 Running 0 2m52s kube-system kube-multus-ds-b69wn 1/1 Running 0 2m52s


Installing the SR-IOV Network Device Plugin

The SR-IOV Network Device Plugin discovers and advertises networking resources of SR-IOV VFs and PFs available on a Kubernetes host.

To install the SR-IOV Network Device Plugin, firstly create a Kubernetes manifest for a ConfigMap resource for the SR-IOV Network Device Plugin.

Copy
Copied!
            

MLX0IFNAME=ens2f0np0 # CHANGE HERE cat << EOF | tee ./sriovdp-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: |- { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "vfpool", "selectors": { "isRdma": true, "vendors": ["15b3"], "pfNames": ["${MLX0IFNAME}#0-7"] } } ] } EOF

Create the ConfigMap resource.

Copy
Copied!
            

kubectl apply -f ./sriovdp-configmap.yaml

Create the SR-IOV Network Device Plugin resource.

Copy
Copied!
            

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\ /sriov-network-device-plugin/v3.5.1/deployments\ /k8s-v1.16/sriovdp-daemonset.yaml

Check the number of available VFs (nvidia.com/vfpool) in the node.

Copy
Copied!
            

kubectl describe nodes <host name> | grep "Capacity" -A 9 Output: Capacity: cpu: 48 ephemeral-storage: 1844295220Ki hugepages-1Gi: 16Gi memory: 515509Mi nvidia.com/gpu: 2 nvidia.com/vfpool: 8 pods: 110 Allocatable: cpu: 48


Installing the SR-IOV CNI

The SR-IOV CNI works with the SR-IOV Network Device Plugin for VF allocation in Kubernetes.

To install the SR-IOV CNI, firstly deploy the SR-IOV CNI.

Copy
Copied!
            

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\ /sriov-cni/v2.7.0/images/k8s-v1.16/sriov-cni-daemonset.yaml

Verify the status of SR-IOV CNI

Copy
Copied!
            

kubectl get po -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES (snip) kube-sriov-cni-ds-amd64-7rs7t 1/1 Running 0 4m9s 192.168.10.236 tme-r750-03 <none> <none> <--- This one

Create a Kubernetes manifest of a custum resource as NetworkAttachmentDifinition for the secondary networking using Multus CNI.

Copy
Copied!
            

cat << EOF | tee ./sriov-nad.yaml apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: sriov-vf annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/vfpool spec: config: '{ "cniVersion": "0.3.1", "name": "sriov-vf", "type": "sriov" }' EOF

Create the custom resource.

Copy
Copied!
            

kubectl apply -f sriov-nad.yaml

Check if the NetworkAttachmentDefinition resource was created.

Copy
Copied!
            

kubectl get network-attachment-definition NAME AGE sriov-vf 5d3h


An Example of a Kubernetes manifest for cuBB with SR-IOV VF

Copy
Copied!
            

cat << EOF | tee ./cubb-pod.yaml apiVersion: v1 kind: Pod metadata: name: cubb-22-4 annotations: k8s.v1.cni.cncf.io/networks: sriov-vf spec: nodeName: <node name> imagePullSecrets: - name: ngc-secret # Need to create a Secret resource for NGC if pulling the container image from NGC: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ containers: - name: cubb-ctr image: nvcr.io/ea-aerial-sdk/aerial:22-4-cubb imagePullPolicy: Always command: ["/bin/sh","-c","sleep infinity"] securityContext: privileged: true workingDir: /opt/nvidia/cuBB volumeMounts: - mountPath: /hugepages name: hugepage - mountPath: /dev/shm name: dshm - mountPath: /usr/src name: nvidia-driver - mountPath: /lib/modules name: lib-modules resources: limits: hugepages-1Gi: 2Gi memory: 16Gi nvidia.com/gpu: 1 nvidia.com/vfpool: 1 requests: hugepages-1Gi: 2Gi memory: 16Gi nvidia.com/gpu: 1 nvidia.com/vfpool: 1 volumes: - name: hugepage emptyDir: medium: HugePages - name: dshm # Unable to configure shm size in Kubernetes, need to use this WAR: https://github.com/kubernetes/kubernetes/issues/28272#issuecomment-540943623 emptyDir: { medium: 'Memory', sizeLimit: '4Gi' } - name: nvidia-driver hostPath: path: /run/nvidia/driver/usr/src - name: lib-modules hostPath: path: /lib/modules restartPolicy: Never EOF

Create a cuBB Pod.

Copy
Copied!
            

kubectl apply -f cubb-pod.yaml

Check the status of all Pods.

Copy
Copied!
            

kubectl get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default cubb-22-4 1/1 Running 0 2m47s 192.168.10.246 tme-r750-03 <none> <none> gpu-operator gpu-feature-discovery-fpl85 1/1 Running 0 6m39s 192.168.10.233 tme-r750-03 <none> <none> gpu-operator gpu-operator-1669877423-node-feature-discovery-master-57964grsj 1/1 Running 0 22h 192.168.144.7 tme-r630-02 <none> <none> gpu-operator gpu-operator-1669877423-node-feature-discovery-worker-9kg8m 1/1 Running 1 (7m58s ago) 21m 192.168.10.228 tme-r750-03 <none> <none> gpu-operator gpu-operator-5dc6b8989b-6lz89 1/1 Running 1 (7m58s ago) 13m 192.168.10.224 tme-r750-03 <none> <none> gpu-operator nvidia-container-toolkit-daemonset-xdcwx 1/1 Running 0 6m39s 192.168.10.230 tme-r750-03 <none> <none> gpu-operator nvidia-cuda-validator-2k52t 0/1 Completed 0 4m13s 192.168.10.243 tme-r750-03 <none> <none> gpu-operator nvidia-dcgm-exporter-tvx6j 1/1 Running 0 6m39s 192.168.10.225 tme-r750-03 <none> <none> gpu-operator nvidia-device-plugin-daemonset-lkn84 1/1 Running 0 6m39s 192.168.10.232 tme-r750-03 <none> <none> gpu-operator nvidia-device-plugin-validator-99d7x 0/1 Completed 0 4m2s 192.168.10.245 tme-r750-03 <none> <none> gpu-operator nvidia-driver-daemonset-bj9rx 2/2 Running 3 (2m28s ago) 20m 192.168.10.231 tme-r750-03 <none> <none> gpu-operator nvidia-mig-manager-ntnrn 1/1 Running 0 6m39s 192.168.10.226 tme-r750-03 <none> <none> gpu-operator nvidia-operator-validator-2bn29 1/1 Running 0 6m39s 192.168.10.242 tme-r750-03 <none> <none> kube-system calico-kube-controllers-58dbc876ff-5lhnp 1/1 Running 0 22h 192.168.144.4 tme-r630-02 <none> <none> kube-system calico-node-d7sf5 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system calico-node-zkjbv 1/1 Running 3 (7m58s ago) 23h 10.136.139.154 tme-r750-03 <none> <none> kube-system coredns-565d847f94-9h9pn 1/1 Running 0 23h 192.168.144.2 tme-r630-02 <none> <none> kube-system coredns-565d847f94-nfwzf 1/1 Running 0 23h 192.168.144.1 tme-r630-02 <none> <none> kube-system etcd-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-apiserver-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-controller-manager-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-multus-ds-amd64-25922 1/1 Running 1 (7m58s ago) 159m 10.136.139.154 tme-r750-03 <none> <none> kube-system kube-multus-ds-amd64-pqfvk 1/1 Running 0 159m 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-proxy-2cfnc 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-proxy-7jbgw 1/1 Running 3 (7m58s ago) 23h 10.136.139.154 tme-r750-03 <none> <none> kube-system kube-scheduler-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none> kube-system kube-sriov-cni-ds-amd64-ntvj7 1/1 Running 1 (7m58s ago) 4h52m 192.168.10.229 tme-r750-03 <none> <none> kube-system kube-sriov-device-plugin-amd64-wpxx2 1/1 Running 1 (7m58s ago) 4h52m 10.136.139.154 tme-r750-03 <none> <none>

Get a shell to the cuBB Pod.

Copy
Copied!
            

kubectl exec -it cubb-22-4 -- bash

Check the attached VF in the Pod. net1 will be attached as the second network interface from nvidia.com/vfpool.

Copy
Copied!
            

(cubb-22-4 #) ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480 inet 192.168.10.246 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::2090:fdff:fe3c:e547 prefixlen 64 scopeid 0x20<link> ether 22:90:fd:3c:e5:47 txqueuelen 0 (Ethernet) RX packets 13 bytes 1912 (1.9 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 14 bytes 1076 (1.0 KB) TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1514 inet6 fe80::f823:e9ff:fe34:143f prefixlen 64 scopeid 0x20<link> ether fa:23:e9:34:14:3f txqueuelen 1000 (Ethernet) RX packets 2 bytes 324 (324.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 38 bytes 3834 (3.8 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 (cubb-22-4 #) ibdev2netdev -v 0000:19:00.3 mlx5_4 (MT4126 - NA) fw 24.35.1012 port 1 (ACTIVE) ==> net1 (Up)

In summary, the following network interface of VF is available in this example in the Pod.

  • MAC address of the assigned VF: fa:23:e9:34:14:3f

  • PCIe address of the assigned VF: 0000:19:00.3

Configurations of cuBB

The required changes for SR-IOV are the PCI address of the assigned VF in the cuphycontroler yaml file and the MAC address of the assigned VF in the config yaml file for RU-emulator. Here is an example of the cuphycontroller yaml file.

Copy
Copied!
            

cuphydriver_config: (snip) nics: - nic: 0000:19:00.3 cells: - name: O-RU 0 nic: 0000:19:00.3 - name: O-RU 1

Here is an example of the ru-emulator yaml file.

Copy
Copied!
            

ru_emulator: (snip) peers: - peerethaddr: fa:23:e9:34:14:3f

The other steps to run the cuBB End-to-End is the same as the usual sequences to run cuBB End-to-End.

Previous Running cuBB End-to-End
Next cuBB Installation Guide
© Copyright 2022-2023, NVIDIA.. Last updated on Apr 20, 2024.