cuBB on NVIDIA Cloud Native Stack
NVIDIA Cloud Native Stack (formerly known as Cloud Native Core) is a collection of software to run cloud native workloads on NVIDIA GPUs. This section describes how to install and run the cuBB SDK software examples on NVIDIA Cloud Native Stack and related components to run the cuBB SDK.
The steps to install NVIDIA Cloud Native Stack follows the installation guide on GitHub and skip the details in this document.
The contents of this section have verified with NVIDIA Cloud Native Stack v8.0, but we are keeping the OS version only to Ubuntu 20.04 LTS with 5.4.0.65-lowlatency kernel.
This section describes how to enable SR-IOV for Mellanox NICs and converged cards.
Some servers need to change BIOS settings to enable SR-IOV. For Aerial DevKit, “SR-IOV Support” in the BIOS menu is enabled by default. For Dell R750, “SR-IOV Global Enable” should be enabled.
GPU Operator with Host MOFED Driver and RDMA (without Network Operator)
In this subsection, we assume that Network Operator is not installed and MOFED is installed on the host. Instead of Network Operator, some steps need to be done manually, i.e., configuring SR-IOV for Mellanox NIC and converged card installing network plugins for the cloud native stack (see the table below in detail). This subsection describes the configuration, installations, and an example of Kubernetes manifest for cuBB.
Kubernetes Network Plugin |
Tested Version |
---|---|
Multus CNI | 3.7.1 |
SR-IOV Network Device Plugin | 3.5.1 |
SR-IOV CNI | 2.7.0 |
Enabling SR-IOV
Configure one-time FW settings to enable SR-IOV for Mellanox NIC and BF2 cards.
# Define variables of interface names and PCI address
export MLX0IFNAME=ens2f0np0 # CHANGE HERE
export MLX1IFNAME=ens2f1np1 # CHANGE HERE
export MLX0PCIEADDR=`ethtool -i \${MLX0IFNAME} | grep bus-info | awk '{print $2}'`
export MLX1PCIEADDR=`ethtool -i \${MLX1IFNAME} | grep bus-info | awk '{print $2}'`
# Enable SR-IOV in the FW level and 8 VFs
sudo -E mlxconfig -d $MLX0PCIEADDR --yes set SRIOV_EN=1
sudo -E mlxconfig -d $MLX1PCIEADDR --yes set SRIOV_EN=1
sudo -E mlxconfig -d $MLX0PCIEADDR --yes set NUM_OF_VFS=8
sudo -E mlxconfig -d $MLX1PCIEADDR --yes set NUM_OF_VFS=8
Create a configuration file for the following script to configure SR-IOV, which the number of virtual functions (VFs) for each physical port is 8 here.
cat << EOF | sudo sudo tee /etc/sriov.conf
0000:19:00.0 8 # CHANGE HERE
0000:19:00.1 8 # CHANGE HERE
EOF
Configurations of SR-IOV will be reset by rebooting the machine. We are going to create a systemd service here to enable SR-IOV after every rebooting.
Create the start-up configuration script for enabling SR-IOV and all virtual functions (VFs) to be available (link-up).
cat << EOF | sudo sudo tee /usr/local/bin/configure-sriov.sh
#!/bin/bash
set -eux
input="/etc/sriov.conf"
UDEV_RULE_FILE='/etc/udev/rules.d/10-persistent-net.rules'
append_to_file(){
content="\$1"
file_name="\$2"
if ! test -f "\$file_name"
then
echo "\$content" > "\$file_name"
else
if ! grep -Fxq "\$content" "\$file_name"
then
echo "\$content" >> "\$file_name"
fi
fi
}
add_udev_rule_for_sriov_pf(){
pf_pci=\$(grep PCI_SLOT_NAME /sys/class/net/\$1/device/uevent | cut -d'=' -f2)
udev_data_line="SUBSYSTEM==\"net\", ACTION==\"add\", DRIVERS==\"?*\", KERNELS==\"\$pf_pci\", NAME=\"\$1\""
append_to_file "\$udev_data_line" "\$UDEV_RULE_FILE"
}
names=()
while read pci_addr num_vfs
do
# Increase the PCIe Max Read Request Size to 4kB
setpci -s \${pci_addr} 68.w=5000:f000
setpci -s \${pci_addr} 68.w=5000:f000
echo "Set \$num_vfs VFs on device \$pci_addr"
name=\$(ls /sys/bus/pci/devices/\${pci_addr}/net/)
names+=(\$name)
# Create udev rule to save PF name
add_udev_rule_for_sriov_pf \$name
# configure ALL VF's to be trusted for FW. Needed MFT v4.19+ (https://docs.nvidia.com/doca/sdk/virtual-functions/index.html#prerequisites)
mlxreg -d \${pci_addr} --reg_id 0xc007 --reg_len 0x40 --indexes "0x0.0:32=0x80000000" --yes --set "0x4.0:32=0x1"
# create VFs
echo \$num_vfs > /sys/bus/pci/devices/\${pci_addr}/sriov_numvfs
done <"\$input"
# wait for vfs to be ready
sleep 5
i=0
while read pci_addr num_vfs
do
# unload VF driver
vf_dirs=\$(ls /sys/bus/pci/devices/\${pci_addr} | grep virtfn)
for vf_dir in \$vf_dirs
do
vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )")
echo \$vf_pci_addr > /sys/bus/pci/drivers/mlx5_core/unbind || true
done
ip link set \${names[i]} up
i=\$(( i+1 ))
# load VF driver
for vf_dir in \$vf_dirs
do
vf_pci_addr=\$(basename "\$( readlink -f /sys/bus/pci/devices/\${pci_addr}/\$vf_dir )")
echo \$vf_pci_addr > /sys/bus/pci/drivers_probe
vf_if_name=\$(lshw -c network -businfo | grep \$vf_pci_addr | awk '{print \$2}')
ip link set \$vf_if_name up
done
done <"\$input"
EOF
Allow the script execution.
sudo chmod +x /usr/local/bin/configure-sriov.sh
Create the systemd service to run the script.
cat << EOF | sudo tee /etc/systemd/system/sriov-configuration.service
[Unit]
Description=Configures SRIOV NIC
Wants=network-pre.target
Before=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/configure-sriov.sh
StandardOutput=journal+console
StandardError=journal+console
[Install]
WantedBy=network-online.target
EOF
Enable autostart for the systemd service.
sudo systemctl daemon-reload
sudo systemctl enable sriov-configuration
sudo systemctl start sriov-configuration
Ensure SR-IOV is enabled.
lspci -tvvv
(snip)
| \-02.0-[17-1e]----00.0-[18-1e]--+-00.0-[19-1a]--+-00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
| | +-00.1 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
| | +-00.2 Mellanox Technologies MT42822 BlueField-2 SoC Management Interface
| | +-00.3 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-00.4 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-00.5 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-00.6 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-00.7 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.0 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.1 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.2 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.3 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.4 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.5 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.6 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-01.7 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-02.0 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | +-02.1 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| | \-02.2 Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
| \-01.0-[1b-1e]----00.0-[1c-1e]----08.0-[1d-1e]----00.0 NVIDIA Corporation Device 20b8
Installing Multus CNI
A SR-IOV VF interface will be attached to a Pod as a secondary network interface. This is enabled by Multus CNI.
To install Multus CNI, create the Multus DaemonSet.
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/multus-cni/v3.7.1/images/multus-daemonset.yml
Validate the status of Multus Pods.
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
(snip)
kube-system kube-multus-ds-77zjm 1/1 Running 0 2m52s
kube-system kube-multus-ds-b69wn 1/1 Running 0 2m52s
Installing the SR-IOV Network Device Plugin
The SR-IOV Network Device Plugin discovers and advertises networking resources of SR-IOV VFs and PFs available on a Kubernetes host.
To install the SR-IOV Network Device Plugin, firstly create a Kubernetes manifest for a ConfigMap resource for the SR-IOV Network Device Plugin.
MLX0IFNAME=ens2f0np0 # CHANGE HERE
cat << EOF | tee ./sriovdp-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |-
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "vfpool",
"selectors": {
"isRdma": true,
"vendors": ["15b3"],
"pfNames": ["${MLX0IFNAME}#0-7"]
}
}
]
}
EOF
Create the ConfigMap resource.
kubectl apply -f ./sriovdp-configmap.yaml
Create the SR-IOV Network Device Plugin resource.
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/sriov-network-device-plugin/v3.5.1/deployments\
/k8s-v1.16/sriovdp-daemonset.yaml
Check the number of available VFs (nvidia.com/vfpool) in the node.
kubectl describe nodes <host name> | grep "Capacity" -A 9
Output:
Capacity:
cpu: 48
ephemeral-storage: 1844295220Ki
hugepages-1Gi: 16Gi
memory: 515509Mi
nvidia.com/gpu: 2
nvidia.com/vfpool: 8
pods: 110
Allocatable:
cpu: 48
Installing the SR-IOV CNI
The SR-IOV CNI works with the SR-IOV Network Device Plugin for VF allocation in Kubernetes.
To install the SR-IOV CNI, firstly deploy the SR-IOV CNI.
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg\
/sriov-cni/v2.7.0/images/k8s-v1.16/sriov-cni-daemonset.yaml
Verify the status of SR-IOV CNI
kubectl get po -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
(snip)
kube-sriov-cni-ds-amd64-7rs7t 1/1 Running 0 4m9s 192.168.10.236 tme-r750-03 <none> <none> <--- This one
Create a Kubernetes manifest of a custum resource as NetworkAttachmentDifinition for the secondary networking using Multus CNI.
cat << EOF | tee ./sriov-nad.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-vf
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/vfpool
spec:
config: '{
"cniVersion": "0.3.1",
"name": "sriov-vf",
"type": "sriov"
}'
EOF
Create the custom resource.
kubectl apply -f sriov-nad.yaml
Check if the NetworkAttachmentDefinition resource was created.
kubectl get network-attachment-definition
NAME AGE
sriov-vf 5d3h
An Example of a Kubernetes manifest for cuBB with SR-IOV VF
cat << EOF | tee ./cubb-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cubb-22-4
annotations:
k8s.v1.cni.cncf.io/networks: sriov-vf
spec:
nodeName: <node name>
imagePullSecrets:
- name: ngc-secret # Need to create a Secret resource for NGC if pulling the container image from NGC: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
containers:
- name: cubb-ctr
image: nvcr.io/ea-aerial-sdk/aerial:22-4-cubb
imagePullPolicy: Always
command: ["/bin/sh","-c","sleep infinity"]
securityContext:
privileged: true
workingDir: /opt/nvidia/cuBB
volumeMounts:
- mountPath: /hugepages
name: hugepage
- mountPath: /dev/shm
name: dshm
- mountPath: /usr/src
name: nvidia-driver
- mountPath: /lib/modules
name: lib-modules
resources:
limits:
hugepages-1Gi: 2Gi
memory: 16Gi
nvidia.com/gpu: 1
nvidia.com/vfpool: 1
requests:
hugepages-1Gi: 2Gi
memory: 16Gi
nvidia.com/gpu: 1
nvidia.com/vfpool: 1
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- name: dshm
# Unable to configure shm size in Kubernetes, need to use this WAR: https://github.com/kubernetes/kubernetes/issues/28272#issuecomment-540943623
emptyDir: {
medium: 'Memory',
sizeLimit: '4Gi'
}
- name: nvidia-driver
hostPath:
path: /run/nvidia/driver/usr/src
- name: lib-modules
hostPath:
path: /lib/modules
restartPolicy: Never
EOF
Create a cuBB Pod.
kubectl apply -f cubb-pod.yaml
Check the status of all Pods.
kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default cubb-22-4 1/1 Running 0 2m47s 192.168.10.246 tme-r750-03 <none> <none>
gpu-operator gpu-feature-discovery-fpl85 1/1 Running 0 6m39s 192.168.10.233 tme-r750-03 <none> <none>
gpu-operator gpu-operator-1669877423-node-feature-discovery-master-57964grsj 1/1 Running 0 22h 192.168.144.7 tme-r630-02 <none> <none>
gpu-operator gpu-operator-1669877423-node-feature-discovery-worker-9kg8m 1/1 Running 1 (7m58s ago) 21m 192.168.10.228 tme-r750-03 <none> <none>
gpu-operator gpu-operator-5dc6b8989b-6lz89 1/1 Running 1 (7m58s ago) 13m 192.168.10.224 tme-r750-03 <none> <none>
gpu-operator nvidia-container-toolkit-daemonset-xdcwx 1/1 Running 0 6m39s 192.168.10.230 tme-r750-03 <none> <none>
gpu-operator nvidia-cuda-validator-2k52t 0/1 Completed 0 4m13s 192.168.10.243 tme-r750-03 <none> <none>
gpu-operator nvidia-dcgm-exporter-tvx6j 1/1 Running 0 6m39s 192.168.10.225 tme-r750-03 <none> <none>
gpu-operator nvidia-device-plugin-daemonset-lkn84 1/1 Running 0 6m39s 192.168.10.232 tme-r750-03 <none> <none>
gpu-operator nvidia-device-plugin-validator-99d7x 0/1 Completed 0 4m2s 192.168.10.245 tme-r750-03 <none> <none>
gpu-operator nvidia-driver-daemonset-bj9rx 2/2 Running 3 (2m28s ago) 20m 192.168.10.231 tme-r750-03 <none> <none>
gpu-operator nvidia-mig-manager-ntnrn 1/1 Running 0 6m39s 192.168.10.226 tme-r750-03 <none> <none>
gpu-operator nvidia-operator-validator-2bn29 1/1 Running 0 6m39s 192.168.10.242 tme-r750-03 <none> <none>
kube-system calico-kube-controllers-58dbc876ff-5lhnp 1/1 Running 0 22h 192.168.144.4 tme-r630-02 <none> <none>
kube-system calico-node-d7sf5 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system calico-node-zkjbv 1/1 Running 3 (7m58s ago) 23h 10.136.139.154 tme-r750-03 <none> <none>
kube-system coredns-565d847f94-9h9pn 1/1 Running 0 23h 192.168.144.2 tme-r630-02 <none> <none>
kube-system coredns-565d847f94-nfwzf 1/1 Running 0 23h 192.168.144.1 tme-r630-02 <none> <none>
kube-system etcd-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-apiserver-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-controller-manager-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-multus-ds-amd64-25922 1/1 Running 1 (7m58s ago) 159m 10.136.139.154 tme-r750-03 <none> <none>
kube-system kube-multus-ds-amd64-pqfvk 1/1 Running 0 159m 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-proxy-2cfnc 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-proxy-7jbgw 1/1 Running 3 (7m58s ago) 23h 10.136.139.154 tme-r750-03 <none> <none>
kube-system kube-scheduler-tme-r630-02 1/1 Running 0 23h 10.136.139.228 tme-r630-02 <none> <none>
kube-system kube-sriov-cni-ds-amd64-ntvj7 1/1 Running 1 (7m58s ago) 4h52m 192.168.10.229 tme-r750-03 <none> <none>
kube-system kube-sriov-device-plugin-amd64-wpxx2 1/1 Running 1 (7m58s ago) 4h52m 10.136.139.154 tme-r750-03 <none> <none>
Get a shell to the cuBB Pod.
kubectl exec -it cubb-22-4 -- bash
Check the attached VF in the Pod. net1
will be attached as the second network interface from nvidia.com/vfpool
.
(cubb-22-4 #) ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
inet 192.168.10.246 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::2090:fdff:fe3c:e547 prefixlen 64 scopeid 0x20<link>
ether 22:90:fd:3c:e5:47 txqueuelen 0 (Ethernet)
RX packets 13 bytes 1912 (1.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14 bytes 1076 (1.0 KB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1514
inet6 fe80::f823:e9ff:fe34:143f prefixlen 64 scopeid 0x20<link>
ether fa:23:e9:34:14:3f txqueuelen 1000 (Ethernet)
RX packets 2 bytes 324 (324.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38 bytes 3834 (3.8 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
(cubb-22-4 #) ibdev2netdev -v
0000:19:00.3 mlx5_4 (MT4126 - NA) fw 24.35.1012 port 1 (ACTIVE) ==> net1 (Up)
In summary, the following network interface of VF is available in this example in the Pod.
MAC address of the assigned VF:
fa:23:e9:34:14:3f
PCIe address of the assigned VF:
0000:19:00.3
Configurations of cuBB
The required changes for SR-IOV are the PCI address of the assigned VF in the cuphycontroler yaml file and the MAC address of the assigned VF in the config yaml file for RU-emulator. Here is an example of the cuphycontroller yaml file.
cuphydriver_config:
(snip)
nics:
- nic: 0000:19:00.3
cells:
- name: O-RU 0
nic: 0000:19:00.3
- name: O-RU 1
Here is an example of the ru-emulator yaml file.
ru_emulator:
(snip)
peers:
- peerethaddr: fa:23:e9:34:14:3f
The other steps to run the cuBB End-to-End is the same as the usual sequences to run cuBB End-to-End.