cuBB on NVIDIA Cloud Native Stack

NVIDIA Cloud Native Stack (formerly known as Cloud Native Core) is a collection of software to run cloud native workloads on NVIDIA GPUs. This section describes how to install and run the cuBB SDK software examples on NVIDIA Cloud Native Stack and related components to run the cuBB SDK.

Installation of NVIDIA Cloud Native Stack

The steps to install NVIDIA Cloud Native Stack follows the installation guide on GitHub and skip the details in this document.

The contents of this section have verified with NVIDIA Cloud Native Stack v8.0, but we are keeping the OS version only to Ubuntu 20.04 LTS with 5.4.0.65-lowlatency kernel.

cuBB on NVIDIA Cloud Native Stack with SR-IOV

This section describes how to enable SR-IOV for Mellanox NICs and converged cards.

Note

Some servers need to change BIOS settings to enable SR-IOV. For Aerial DevKit, “SR-IOV Support” in the BIOS menu is enabled by default. For Dell R750, “SR-IOV Global Enable” should be enabled.

GPU Operator with Host MOFED Driver and RDMA (without Network Operator)

In this subsection, we assume that Network Operator is not installed and MOFED is installed on the host. Instead of Network Operator, some steps need to be done manually, i.e., configuring SR-IOV for Mellanox NIC and converged card installing network plugins for the cloud native stack (see the table below in detail). This subsection describes the configuration, installations, and an example of Kubernetes manifest for cuBB.

Kubernetes Network Plugin	Tested Version
Multus CNI	3.7.1
SR-IOV Network Device Plugin	3.5.1
SR-IOV CNI	2.7.0

Enabling SR-IOV

Configure one-time FW settings to enable SR-IOV for Mellanox NIC and BF2 cards.

Copy
Copied!

            
            # Define variables of interface names and PCI address
export MLX0IFNAME=ens2f0np0  # CHANGE HERE
export MLX1IFNAME=ens2f1np1  # CHANGE HERE
export MLX0PCIEADDR=`ethtool -i \${MLX0IFNAME} | grep bus-info | awk '{print $2}'`
export MLX1PCIEADDR=`ethtool -i \${MLX1IFNAME} | grep bus-info | awk '{print $2}'`

# Enable SR-IOV in the FW level and 8 VFs
sudo -E mlxconfig -d $MLX0PCIEADDR --yes set SRIOV_EN=1
sudo -E mlxconfig -d $MLX1PCIEADDR --yes set SRIOV_EN=1
sudo -E mlxconfig -d $MLX0PCIEADDR --yes set NUM_OF_VFS=8
sudo -E mlxconfig -d $MLX1PCIEADDR --yes set NUM_OF_VFS=8

Create a configuration file for the following script to configure SR-IOV, which the number of virtual functions (VFs) for each physical port is 8 here.

Copy
Copied!

            
            cat << EOF | sudo sudo tee /etc/sriov.conf
0000:19:00.0 8  # CHANGE HERE
0000:19:00.1 8  # CHANGE HERE
EOF

Configurations of SR-IOV will be reset by rebooting the machine. We are going to create a systemd service here to enable SR-IOV after every rebooting.

Create the start-up configuration script for enabling SR-IOV and all virtual functions (VFs) to be available (link-up).

Copy
Copied!

            
            cat << EOF | sudo sudo tee /usr/local/bin/configure-sriov.sh
#!/bin/bash
set -eux
input="/etc/sriov.conf"
UDEV_RULE_FILE='/etc/udev/rules.d/10-persistent-net.rules'

append_to_file(){
content="\$1"
file_name="\$2"
if ! test -f "\$file_name"
then
echo "\$content" > "\$file_name"
else
if ! grep -Fxq "\$content" "\$file_name"
then
echo "\$content" >> "\$file_name"
fi
fi
}

add_udev_rule_for_sriov_pf(){
pf_pci=\$(grep PCI_SLOT_NAME /sys/class/net/\$1/device/uevent | cut -d'=' -f2)
udev_data_line="SUBSYSTEM==\"net\", ACTION==\"add\", DRIVERS==\"?*\", KERNELS==\"\$pf_pci\", NAME=\"\$1\""
append_to_file "\$udev_data_line" "\$UDEV_RULE_FILE"
}

names=()

while read pci_addr num_vfs
do
# Increase the PCIe Max Read Request Size to 4kB
setpci -s \${pci_addr} 68.w=5000:f000
setpci -s \${pci_addr} 68.w=5000:f000

echo "Set \$num_vfs VFs on device \$pci_addr"

name=\$(ls /sys/bus/pci/devices/\${pci_addr}/net/)
names+=(\$name)
# Create udev rule to save PF name
add_udev_rule_for_sriov_pf \$name

# configure ALL VF's to be trusted for FW. Needed MFT v4.19+  (https://docs.nvidia.com/doca/sdk/virtual-functions/index.html#prerequisites)
mlxreg -d \${pci_addr} --reg_id 0xc007 --reg_len 0x40 --indexes "0x0.0:32=0x80000000" --yes --set "0x4.0:32=0x1"

# create VFs
echo \$num_vfs > /sys/bus/pci/devices/\${pci_addr}/sriov_numvfs
done <"\$input"

# wait for vfs to be ready
sleep 5
i=0

while read pci_addr num_vfs
do
# unload VF driver
vf_dirs=\$(ls /sys/bus/pci/devices/\${pci_addr} | grep virtfn)
for vf_dir in \$vf_dirs
do
vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )")
echo \$vf_pci_addr > /sys/bus/pci/drivers/mlx5_core/unbind || true
done

ip link set \${names[i]} up

i=\$(( i+1 ))

# load VF driver
for vf_dir in \$vf_dirs
do
vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )")
echo \$vf_pci_addr > /sys/bus/pci/drivers_probe
vf_if_name=\$(lshw -c network -businfo | grep \$vf_pci_addr | awk '{print \$2}')
ip link set \$vf_if_name up
done
done <"\$input"
EOF

Allow the script execution.

Copy
Copied!

            
            sudo chmod +x /usr/local/bin/configure-sriov.sh

Create the systemd service to run the script.

Copy
Copied!

            
            cat << EOF | sudo tee /etc/systemd/system/sriov-configuration.service
[Unit]
Description=Configures SRIOV NIC
Wants=network-pre.target
Before=network-pre.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/configure-sriov.sh
StandardOutput=journal+console
StandardError=journal+console

[Install]
WantedBy=network-online.target
EOF

Enable autostart for the systemd service.

Copy
Copied!

            
            sudo systemctl daemon-reload
sudo systemctl enable sriov-configuration
sudo systemctl start sriov-configuration

Ensure SR-IOV is enabled.

Copy
Copied!

            
            lspci -tvvv (snip) | | | | | | | | | | | | | | | | | | | |

\-02.0-[17-1e]----00.0-[18-1e]--+-00.0-[19-1a]--+-00.0  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller |               +-00.1  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller |               +-00.2  Mellanox Technologies MT42822 BlueField-2 SoC Management Interface |               +-00.3  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-00.4  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-00.5  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-00.6  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-00.7  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.0  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.1  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.2  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.3  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.4  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.5  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.6  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-01.7  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-02.0  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               +-02.1  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function |               \-02.2  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function \-01.0-[1b-1e]----00.0-[1c-1e]----08.0-[1d-1e]----00.0  NVIDIA Corporation Device 20b8

Installing Multus CNI

A SR-IOV VF interface will be attached to a Pod as a secondary network interface. This is enabled by Multus CNI.

To install Multus CNI, create the Multus DaemonSet.

Copy
Copied!

            
            kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/multus-cni/v3.7.1/images/multus-daemonset.yml

Validate the status of Multus Pods.

Copy
Copied!

            
            kubectl get pods --all-namespaces
NAMESPACE      NAME                                                              READY   STATUS      RESTARTS      AGE
(snip)
kube-system    kube-multus-ds-77zjm                                              1/1     Running     0             2m52s
kube-system    kube-multus-ds-b69wn                                              1/1     Running     0             2m52s

Installing the SR-IOV Network Device Plugin

The SR-IOV Network Device Plugin discovers and advertises networking resources of SR-IOV VFs and PFs available on a Kubernetes host.

To install the SR-IOV Network Device Plugin, firstly create a Kubernetes manifest for a ConfigMap resource for the SR-IOV Network Device Plugin.

Copy
Copied!

            
            MLX0IFNAME=ens2f0np0  # CHANGE HERE

cat << EOF | tee ./sriovdp-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |-
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "vfpool",
"selectors": {
"isRdma": true,
"vendors": ["15b3"],
"pfNames": ["${MLX0IFNAME}#0-7"]
}
}
]
}
EOF

Create the ConfigMap resource.

Copy
Copied!

            
            kubectl apply -f ./sriovdp-configmap.yaml

Create the SR-IOV Network Device Plugin resource.

Copy
Copied!

            
            kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/sriov-network-device-plugin/v3.5.1/deployments\
/k8s-v1.16/sriovdp-daemonset.yaml

Check the number of available VFs (nvidia.com/vfpool) in the node.

Copy
Copied!

            
            kubectl describe nodes <host name> | grep "Capacity" -A 9
Output:

Capacity:
    cpu:                48
    ephemeral-storage:  1844295220Ki
    hugepages-1Gi:      16Gi
    memory:             515509Mi
    nvidia.com/gpu:     2
    nvidia.com/vfpool:  8
    pods:               110
Allocatable:
    cpu:                48

Installing the SR-IOV CNI

The SR-IOV CNI works with the SR-IOV Network Device Plugin for VF allocation in Kubernetes.

To install the SR-IOV CNI, firstly deploy the SR-IOV CNI.

Copy
Copied!

            
            kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/sriov-cni/v2.7.0/images/k8s-v1.16/sriov-cni-daemonset.yaml

Verify the status of SR-IOV CNI

Copy
Copied!

            
            kubectl get po -n kube-system -o wide
NAME                                       READY   STATUS    RESTARTS      AGE     IP               NODE          NOMINATED NODE   READINESS GATES
(snip)
kube-sriov-cni-ds-amd64-7rs7t              1/1     Running   0             4m9s    192.168.10.236   tme-r750-03   <none>           <none>         <--- This one

Create a Kubernetes manifest of a custum resource as NetworkAttachmentDifinition for the secondary networking using Multus CNI.

Copy
Copied!

            
            cat << EOF | tee ./sriov-nad.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-vf
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/vfpool
spec:
config: '{
"cniVersion": "0.3.1",
"name": "sriov-vf",
"type": "sriov"
}'
EOF

Create the custom resource.

Copy
Copied!

            
            kubectl apply -f sriov-nad.yaml

Check if the NetworkAttachmentDefinition resource was created.

Copy
Copied!

            
            kubectl get network-attachment-definition

NAME       AGE
sriov-vf   5d3h

An Example of a Kubernetes manifest for cuBB with SR-IOV VF

Copy
Copied!

            
            cat << EOF | tee ./cubb-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cubb-22-4
annotations:
k8s.v1.cni.cncf.io/networks: sriov-vf
spec:
nodeName: <node name>
imagePullSecrets:
- name: ngc-secret  # Need to create a Secret resource for NGC if pulling the container image from NGC: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
containers:
- name: cubb-ctr
image: nvcr.io/ea-aerial-sdk/aerial:22-4-cubb
imagePullPolicy: Always
command: ["/bin/sh","-c","sleep infinity"]
securityContext:
privileged: true
workingDir: /opt/nvidia/cuBB
volumeMounts:
- mountPath: /hugepages
name: hugepage
- mountPath: /dev/shm
name: dshm
- mountPath: /usr/src
name: nvidia-driver
- mountPath: /lib/modules
name: lib-modules
resources:
limits:
hugepages-1Gi: 2Gi
memory: 16Gi
nvidia.com/gpu: 1
nvidia.com/vfpool: 1
requests:
hugepages-1Gi: 2Gi
memory: 16Gi
nvidia.com/gpu: 1
nvidia.com/vfpool: 1
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- name: dshm
# Unable to configure shm size in Kubernetes, need to use this WAR: https://github.com/kubernetes/kubernetes/issues/28272#issuecomment-540943623
emptyDir: {
medium: 'Memory',
sizeLimit: '4Gi'
}
- name: nvidia-driver
hostPath:
path: /run/nvidia/driver/usr/src
- name: lib-modules
hostPath:
path: /lib/modules
restartPolicy: Never
EOF

Create a cuBB Pod.

Copy
Copied!

            
            kubectl apply -f cubb-pod.yaml

Check the status of all Pods.

Copy
Copied!

            
            kubectl get pods --all-namespaces -o wide
NAMESPACE      NAME                                                              READY   STATUS      RESTARTS        AGE     IP               NODE          NOMINATED NODE   READINESS GATES
default        cubb-22-4                                                         1/1     Running     0               2m47s   192.168.10.246   tme-r750-03   <none>           <none>
gpu-operator   gpu-feature-discovery-fpl85                                       1/1     Running     0               6m39s   192.168.10.233   tme-r750-03   <none>           <none>
gpu-operator   gpu-operator-1669877423-node-feature-discovery-master-57964grsj   1/1     Running     0               22h     192.168.144.7    tme-r630-02   <none>           <none>
gpu-operator   gpu-operator-1669877423-node-feature-discovery-worker-9kg8m       1/1     Running     1 (7m58s ago)   21m     192.168.10.228   tme-r750-03   <none>           <none>
gpu-operator   gpu-operator-5dc6b8989b-6lz89                                     1/1     Running     1 (7m58s ago)   13m     192.168.10.224   tme-r750-03   <none>           <none>
gpu-operator   nvidia-container-toolkit-daemonset-xdcwx                          1/1     Running     0               6m39s   192.168.10.230   tme-r750-03   <none>           <none>
gpu-operator   nvidia-cuda-validator-2k52t                                       0/1     Completed   0               4m13s   192.168.10.243   tme-r750-03   <none>           <none>
gpu-operator   nvidia-dcgm-exporter-tvx6j                                        1/1     Running     0               6m39s   192.168.10.225   tme-r750-03   <none>           <none>
gpu-operator   nvidia-device-plugin-daemonset-lkn84                              1/1     Running     0               6m39s   192.168.10.232   tme-r750-03   <none>           <none>
gpu-operator   nvidia-device-plugin-validator-99d7x                              0/1     Completed   0               4m2s    192.168.10.245   tme-r750-03   <none>           <none>
gpu-operator   nvidia-driver-daemonset-bj9rx                                     2/2     Running     3 (2m28s ago)   20m     192.168.10.231   tme-r750-03   <none>           <none>
gpu-operator   nvidia-mig-manager-ntnrn                                          1/1     Running     0               6m39s   192.168.10.226   tme-r750-03   <none>           <none>
gpu-operator   nvidia-operator-validator-2bn29                                   1/1     Running     0               6m39s   192.168.10.242   tme-r750-03   <none>           <none>
kube-system    calico-kube-controllers-58dbc876ff-5lhnp                          1/1     Running     0               22h     192.168.144.4    tme-r630-02   <none>           <none>
kube-system    calico-node-d7sf5                                                 1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    calico-node-zkjbv                                                 1/1     Running     3 (7m58s ago)   23h     10.136.139.154   tme-r750-03   <none>           <none>
kube-system    coredns-565d847f94-9h9pn                                          1/1     Running     0               23h     192.168.144.2    tme-r630-02   <none>           <none>
kube-system    coredns-565d847f94-nfwzf                                          1/1     Running     0               23h     192.168.144.1    tme-r630-02   <none>           <none>
kube-system    etcd-tme-r630-02                                                  1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-apiserver-tme-r630-02                                        1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-controller-manager-tme-r630-02                               1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-multus-ds-amd64-25922                                        1/1     Running     1 (7m58s ago)   159m    10.136.139.154   tme-r750-03   <none>           <none>
kube-system    kube-multus-ds-amd64-pqfvk                                        1/1     Running     0               159m    10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-proxy-2cfnc                                                  1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-proxy-7jbgw                                                  1/1     Running     3 (7m58s ago)   23h     10.136.139.154   tme-r750-03   <none>           <none>
kube-system    kube-scheduler-tme-r630-02                                        1/1     Running     0               23h     10.136.139.228   tme-r630-02   <none>           <none>
kube-system    kube-sriov-cni-ds-amd64-ntvj7                                     1/1     Running     1 (7m58s ago)   4h52m   192.168.10.229   tme-r750-03   <none>           <none>
kube-system    kube-sriov-device-plugin-amd64-wpxx2                              1/1     Running     1 (7m58s ago)   4h52m   10.136.139.154   tme-r750-03   <none>           <none>

Get a shell to the cuBB Pod.

Copy
Copied!

            
            kubectl exec -it cubb-22-4 -- bash

Check the attached VF in the Pod. net1 will be attached as the second network interface from nvidia.com/vfpool.

Copy
Copied!

            
            (cubb-22-4 #) ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1480
        inet 192.168.10.246  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::2090:fdff:fe3c:e547  prefixlen 64  scopeid 0x20<link>
        ether 22:90:fd:3c:e5:47  txqueuelen 0  (Ethernet)
        RX packets 13  bytes 1912 (1.9 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14  bytes 1076 (1.0 KB)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1514
        inet6 fe80::f823:e9ff:fe34:143f  prefixlen 64  scopeid 0x20<link>
        ether fa:23:e9:34:14:3f  txqueuelen 1000  (Ethernet)
        RX packets 2  bytes 324 (324.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 38  bytes 3834 (3.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

(cubb-22-4 #) ibdev2netdev -v
0000:19:00.3 mlx5_4 (MT4126 - NA)  fw 24.35.1012 port 1 (ACTIVE) ==> net1 (Up)

In summary, the following network interface of VF is available in this example in the Pod.

MAC address of the assigned VF: fa:23:e9:34:14:3f
PCIe address of the assigned VF: 0000:19:00.3

Configurations of cuBB

The required changes for SR-IOV are the PCI address of the assigned VF in the cuphycontroler yaml file and the MAC address of the assigned VF in the config yaml file for RU-emulator. Here is an example of the cuphycontroller yaml file.

Copy
Copied!

            
            cuphydriver_config:
    (snip)
    nics:
        - nic: 0000:19:00.3
    cells:
        - name: O-RU 0
        nic: 0000:19:00.3
        - name: O-RU 1

Here is an example of the ru-emulator yaml file.

Copy
Copied!

            
            ru_emulator:
    (snip)
    peers:
      - peerethaddr: fa:23:e9:34:14:3f

The other steps to run the cuBB End-to-End is the same as the usual sequences to run cuBB End-to-End.