Installing and Upgrading cuBB SDK
If you are using AX800 converged accelerator, please follow Installing Tools on Dell R750 then go directly to Installing New cuBB Container.
This page describes how to install or upgrade cuBB SDK and the dependent CUDA driver, MOFED, NIC firmware, and nvidia-peermem driver on the host system per release. You must update the dependent software components to the specific version listed in the Release Manifest.
To prevent dependency errors, you should perform all steps in this section during the initial installation and each time you upgrade to a new Aerial release.
The default nvidia-peermem kernel module included in the CUDA driver is needed to run Aerial SDK and it doesn’t work with MOFED driver container. If the host system has previous MOFED and nv-peer-mem containers running, please stop and remove existing MOFED and nv-peer-mem driver containers first.
$ sudo docker stop OFED
$ sudo docker rm OFED
$ sudo docker stop nv_peer_mem
$ sudo docker rm nv_peer_mem
Check if there is an existing MOFED installed on the host system.
$ ofed_info -s
If the MODFED version is older than the version specified in the release manifest, use the ofed_uninstall.sh
script to uninstall it.
$ sudo /usr/sbin/ofed_uninstall.sh
Execute the following commands to install MOFED on the host.
# Install MOFED
$ export OFED_VERSION=23.07-0.5.0.0
$ export UBUNTU_VERSION=22.04
$ wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-$OFED_VERSION/MLNX_OFED_LINUX-$OFED_VERSION-ubuntu$UBUNTU_VERSION-x86_64.tgz
$ tar xvf MLNX_OFED_LINUX-$OFED_VERSION-ubuntu$UBUNTU_VERSION-x86_64.tgz
$ cd MLNX_OFED_LINUX-$OFED_VERSION-ubuntu$UBUNTU_VERSION-x86_64
$ sudo ./mlnxofedinstall --dpdk --without-mft --with-rshim --add-kernel-support --force --without-ucx-cuda --without-fw-update
$ sudo rmmod nv_peer_mem nvidia_peermem
$ sudo /etc/init.d/openibd restart
# Verify the installed MOFED version
$ ofed_info -s
MLNX_OFED_LINUX-23.07-0.5.0.0:
# Install Mellanox Firmware Tools
$ export MFT_VERSION=4.25.0-62
$ wget https://www.mellanox.com/downloads/MFT/mft-$MFT_VERSION-x86_64-deb.tgz
$ tar xvf mft-$MFT_VERSION-x86_64-deb.tgz
$ cd mft-$MFT_VERSION-x86_64-deb
$ sudo ./install.sh
# Verify the install Mellanox firmware tool version
$ sudo mst version
mst, mft 4.25.0-62, built on Aug 03 2023, 12:15:13. Git SHA Hash: c14a8d9
$ sudo mst start
# check NIC PCIe bus addresses and network interface names
$ sudo mst status -v
This section describes how to update the Mellanox NIC firmware.
To download the NIC firmware, refer to the Mellanox firmware download page. For example, to update the CX6-DX NIC firmware, download it from the ConnectX-6 Dx Ethernet Firmware Download Center page.
In the download menu, there are multiple versions of the firmware specific to the NIC hardware, as identified by its OPN and PSID. To look up the OPN and PSID, use this command:
$ sudo mlxfwmanager -d $MLX0PCIEADDR
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX6DX
Part Number: MCX623106AE-CDA_Ax
Description: ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Crypto; No Secure Boot
PSID: MT_0000000528
PCI Device Name: b5:00.0
Base GUID: b8cef6030033fdee
Base MAC: b8cef633fdee
Versions: Current Available
FW 22.34.1002 N/A
PXE 3.6.0700 N/A
UEFI 14.27.0014 N/A
The OPN and PSID are the “Part Number” and the “PSID” shown in the output. For example, depending on the hardware, the CX6-DX OPN and PSID could be:
OPN = MCX623106AC-CDA_Ax, PSID = MT_0000000436
OPN = MCX623106AE-CDA_Ax, PSID = MT_0000000528
Download the firmware bin file that matches the hardware. If the file has a .zip
extension, unzip
it with the unzip command to get the .bin
file. Here is an example for the CX6-DX NIC with OPN MCX623106AE-CDA_Ax:
$ wget https://www.mellanox.com/downloads/firmware/fw-ConnectX6Dx-rel-22_38_1002-MCX623106AE-CDA_Ax-UEFI-14.31.20-FlexBoot-3.7.201.bin.zip
$ unzip fw-ConnectX6Dx-rel-22_38_1002-MCX623106AE-CDA_Ax-UEFI-14.31.20-FlexBoot-3.7.201.bin.zip
To flash the firmware image onto the CX6-DX NIC, enter the following commands:
$ sudo flint -d $MLX0PCIEADDR --no -i fw-ConnectX6Dx-rel-22_38_1002-MCX623106AE-CDA_Ax-UEFI-14.31.20-FlexBoot-3.7.201.bin b
Current FW version on flash: 22.34.1002
New FW version: 22.38.1002
FSMST_INITIALIZE - OK
Writing Boot image component - OK
# Reset the NIC
$ sudo mlxfwreset -d $MLX0PCIEADDR --yes --level 3 r
Perform the steps below to enable the NIC firmware features required for Aerial.
# eCPRI flow steering enable
$ sudo mlxconfig -d $MLX0PCIEADDR --yes set FLEX_PARSER_PROFILE_ENABLE=4
$ sudo mlxconfig -d $MLX0PCIEADDR --yes set PROG_PARSE_GRAPH=1
# Accurate TX scheduling enable
$ sudo mlxconfig -d $MLX0PCIEADDR --yes set REAL_TIME_CLOCK_ENABLE=1
$ sudo mlxconfig -d $MLX0PCIEADDR --yes set ACCURATE_TX_SCHEDULER=1
# Maximum level of CQE compression
$ sudo mlxconfig -d $MLX0PCIEADDR --yes set CQE_COMPRESSION=1
# Reset NIC
$ sudo mlxfwreset -d $MLX0PCIEADDR --yes --level 3 r
To verify the above NIC features are enabled:
$ sudo mlxconfig -d $MLX0PCIEADDR q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|ACCURATE_TX_SCHEDULER"
FLEX_PARSER_PROFILE_ENABLE 4
PROG_PARSE_GRAPH True(1)
ACCURATE_TX_SCHEDULER True(1)
CQE_COMPRESSION AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE True(1)
Contact NVIDIA CPM to download the A100X SW from PID.
NOTE: The following instructions are for A100X board specifically.
# Enable MST
$ sudo mst start
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt41686_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:b8:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 01
# Update NIC FW first
$ sudo flint -d /dev/mst/mt41686_pciconf0 -i fw-BlueField-2-rel-24_38_1002-699210040230_Ax-NVME-20.4.1-UEFI-21.4.10-UEFI-22.4.10-UEFI-14.31.20-FlexBoot-3.7.201.signed.bin -y b
Current FW version on flash: 24.33.1702
New FW version: 24.38.1002
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Restoring signature - OK
# NOTE: need full Power cycle from host with cold boot
# Update BFB image
$ sudo bfb-install -r rshim0 -b DOCA_2.2.0_BSP_4.2.0_Ubuntu_20.04-2.23-07.prod.bfb
Pushing bfb
920MiB 0:01:51 [8.22MiB/s] [ <=> ]
Collecting BlueField booting status. Press Ctrl+C to stop…
INFO[BL2]: start
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle Secured (development)
INFO[BL31]: runtime
INFO[UEFI]: eMMC init
INFO[UEFI]: UPVS valid
INFO[UEFI]: eMMC probed
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: boot image update
INFO[UEFI]: PMI: updates completed, status 0
INFO[UEFI]: PCIe enum start
INFO[UEFI]: PCIe enum end
INFO[UEFI]: exit Boot Service
INFO[MISC]: Ubuntu installation started
INFO[MISC]: Installing OS image
INFO[MISC]: Installation finished
Run the following code to switch the A100x to the BF2-as-CX mode:
# Enable MST
$ sudo mst start
$ sudo mst status
# MST modules:
# ------------
# MST PCI module is not loaded
# MST PCI configuration module loaded
#
# MST devices:
# ------------
# /dev/mst/mt4125_pciconf0 - PCI configuration cycles access.
# domain:bus:dev.fn=0000:b5:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
# Chip revision is: 00
# /dev/mst/mt41686_pciconf0 - PCI configuration cycles access.
# domain:bus:dev.fn=0000:b8:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
# Chip revision is: 01
# Setting BF2 CX6 Dx port to Ethernet mode (not Infiniband)
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set LINK_TYPE_P1=2
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set LINK_TYPE_P2=2
# Setting BF2 Embedded CPU mode
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set INTERNAL_CPU_MODEL=1
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set INTERNAL_CPU_PAGE_SUPPLIER=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set INTERNAL_CPU_ESWITCH_MANAGER=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set INTERNAL_CPU_IB_VPORT0=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
# Accurate scheduling related settings
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set CQE_COMPRESSION=1
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set PROG_PARSE_GRAPH=1
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set FLEX_PARSER_PROFILE_ENABLE=4
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 --yes set REAL_TIME_CLOCK_ENABLE=1
# NOTE: need power cycle the host for those settings to take effect
# Verify that the NIC FW changes have been applied
$ sudo mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\|ACCURATE_TX_SCHEDULER\
\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\|LINK_TYPE_P1\|LINK_TYPE_P2\
\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE 4
PROG_PARSE_GRAPH True(1)
ACCURATE_TX_SCHEDULER True(1)
CQE_COMPRESSION AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE True(1)
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
Run below to disable flow rules for each reboot:
# Update interface names if needed
$ sudo ethtool -A ens6f0 rx off tx off
$ sudo ethtool -A ens6f1 rx off tx off
# Verify the change has been applied
$ sudo ethtool -a ens6f0
Pause parameters for ens6f0:
Autonegotiate: off
RX: off
TX: off
$ sudo ethtool -a ens6f1
Pause parameters for ens6f1:
Autonegotiate: off
RX: off
TX: off
CUDA driver should be installed after MOFED.
If the installed CUDA driver is older than the version specified in the release manifest, follow the below instructions to remove the old CUDA driver and install the driver version that matches the version specified in the release manifest.
Removing Old CUDA Driver
If this is the first time to install CUDA driver on the system, please go directly to the next section “Installing CUDA Driver”. If this is an existing system and you don’t know how the CUDA driver was installed previously, see How to remove old CUDA toolkit and driver. Run the following commands to remove the old CUDA driver.
# Unload dependent kernel modules
$ sudo service nvidia-persistenced stop
$ sudo service nvidia-docker stop
$ sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nv_peer_mem nvidia_peermem gdrdrv nvidia
# Check the installed CUDA driver version
$ apt list --installed | grep cuda-drivers
# Remove the driver if you have the older version installed.
# For example, cuda-drivers-520 was installed on the system.
$ sudo apt purge cuda-drivers-520
$ sudo apt autoremove
# Remove the driver if it was installed by runfile installer before.
$ sudo /usr/bin/nvidia-uninstall
Installing CUDA Driver
Since 23-2 release, Ubuntu 22.04 server is used as the host OS and NVIDIA runfile is used to install CUDA driver instead of the apt method used previously. Please remove the old CUDA driver first then install the new driver.
# Install CUDA driver
$ wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
$ sudo sh cuda_12.2.0_535.54.03_linux.run --driver --silent
# check dkms status and nvidia-smi
$ dkms status
nvidia/535.54.03, 5.15.0-72-lowlatency, x86_64: installed
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:B6:00.0 Off | 0 |
| N/A 37C P0 58W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Loading nvidia-peermem
Currently, there is no service to load nvidia-peermem.ko automatically. See the system initialization script section to load in aerial-init.sh. User can also load the module manually by the following command.
$ sudo modprobe nvidia-peermem
# Verify it is loaded
$ lsmod | grep peer
nvidia_peermem 16384 0
ib_core 344064 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 39055360 430 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
If the system fails to load nvidia-peermem, try the instructions in How to remove old CUDA toolkit and driver then follow the above “Installing CUDA Driver” instructions again.
This step is optional. To remove the old cuBB container, enter the following commands:
$ sudo docker stop <cuBB container name>
$ sudo docker rm <cuBB container name>
The cuBB container is available from the Nvidia GPU Cloud (NGC).
Once logged in to NGC, there is a common NGC container page for a variety of containers. Toward the top of that page, look for a docker pull command line for the cuBB SDK container.
Enter the command shown on the NGC page. Once the docker pull command is executed, the cuBB docker image will be loaded and ready to use.
Use the following command to run the cuBB container if it is not running. The --restart unless-stopped
option instructs the container to run automatically after reboot.
$ sudo docker run --restart unless-stopped -dP --gpus all --network host --shm-size=4096m --privileged -it -v /lib/modules:/lib/modules -v /dev/hugepages:/dev/hugepages -v /usr/src:/usr/src -v ~/share:/opt/cuBB/share --userns=host --ipc=host -v /var/log/aerial:/var/log/aerial --name cuBB <cuBB container IMAGE ID>
$ sudo docker exec -it cuBB /bin/bash
If you receive the cuBB container image via nvonline, please run “docker load < cuBB container image file” to load the image.
Note that this instruction is only to run Aerial 23-1 cuBB container with AX800. There is no need to do so in newer cuBB container.
# Backup the existing doca and dpdk folders
$ cp -r /opt/mellanox/doca /tmp/doca
$ cp -r /opt/mellanox/dpdk /tmp/dpdk
$ export OFED_VERSION=23.04-0.5.3.3
$ export UBUNTU_VERSION=22.04
$ wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-$OFED_VERSION/MLNX_OFED_LINUX-$OFED_VERSION-ubuntu$UBUNTU_VERSION-x86_64.tgz
$ tar xvf MLNX_OFED_LINUX-$OFED_VERSION-ubuntu$UBUNTU_VERSION-x86_64.tgz
$ cd MLNX_OFED_LINUX-23.04-0.5.3.3-ubuntu$UBUNTU_VERSION-x86_64
# Uninstall current MOFED
$ sudo ./uninstall.sh
# Install new MOFED
$ ./mlnxofedinstall --user-space-only --without-fw-update --force --without-ucx-cuda --add-kernel-support
# Restore the doca and dpdk folders
$ cp -r /tmp/doca /opt/mellanox/doca
$ cp -r /tmp/dpdk /opt/mellanox/dpdk