NVIDIA BlueField DPU BSP v4.5.1 LTS
1.0

Installation Troubleshooting and How-Tos

The following is a comprehensive list of the supported parameters to customize bf.cfg during BFB installation:

Copy
Copied!
            

############################################################################### # Configuration which can also be set in # UEFI->Device Manager->System Configuration ############################################################################### # Enable SMMU in ACPI. #SYS_ENABLE_SMMU = TRUE   # Enable I2C0 in ACPI. #SYS_ENABLE_I2C0 = FALSE   # Disable SPMI in ACPI. #SYS_DISABLE_SPMI = FALSE   # Enable the second eMMC card which is only available on the BlueField Reference Platform. #SYS_ENABLE_2ND_EMMC = FALSE   # Enable eMMC boot partition protection. #SYS_BOOT_PROTECT = FALSE   # Enable SPCR table in ACPI. #SYS_ENABLE_SPCR = FALSE   # Disable PCIe in ACPI. #SYS_DISABLE_PCIE = FALSE   # Enable OP-TEE in ACPI. #SYS_ENABLE_OPTEE = FALSE   ############################################################################### # Boot Order configuration # Each entry BOOT<N> could have the following format: # PXE: # BOOT<N> = NET-<NIC_P0 | NIC_P1 | OOB | RSHIM>-<IPV4 | IPV6> # PXE over VLAN (vlan-id in decimal): # BOOT<N> = NET-<NIC_P0 | NIC_P1 | OOB | RSHIM>[.<vlan-id>]-<IPV4 | IPV6> # UEFI Shell: # BOOT<N> = UEFI_SHELL # DISK: boot entries created during OS installation. # BOOT<N> = DISK ############################################################################### # This example configures PXE boot over the 2nd ConnectX port. # If fails, it continues to boot from disk with boot entries created during OS # installation. #BOOT0 = NET-NIC_P1-IPV4 #BOOT1 = DISK   ############################################################################### # Other misc configuration ###############################################################################   # MAC address of the rshim network interface (tmfifo_net0). #NET_RSHIM_MAC = 00:1a:ca:ff:ff:01   # DHCP class identifier for PXE (arbitrary string up to 32 characters) #PXE_DHCP_CLASS_ID = NVIDIA/BF/PXE   # Create dual boot partition scheme (Ubuntu only) # DUAL_BOOT=yes   # Upgrade NIC firmware # WITH_NIC_FW_UPDATE=yes   # Target storage device for the DPU OS (Default SSD: /dev/nvme0n1) device=/dev/nvme0n1   # bfb_modify_os – SHELL function called after file the system is extracted on the target partitions. # It can be used to modify files or create new files on the target file system mounted under # /mnt. So the file path should look as follows: /mnt/<expected_path_on_target_OS>. This # can be used to run a specific tool from the target OS (remember to add /mnt to the path for # the tool).   # bfb_pre_install – SHELL function called before EMMC partitions format # and OS filesystem is extracted   # bfb_post_install – SHELL function called as a last step before reboot. # All EMMC partitions are unmounted at this stage.

If the .bfb file cannot recognize the BlueField board type, it reverts to low core operation. The following message will be printed on your screen:

Copy
Copied!
            

***System type can't be determined*** ***Booting as a minimal system***

Please contact NVIDIA Support if this occurs.

The following errors appear in console if images are corrupted or not signed properly:

Device

Error

BlueField

ERROR: Failed to load BL2 firmware

BlueField-2

ERROR: Failed to load BL2R firmware

BlueField-3

Failed to load PSC-BL1 or PSC VERIFY_BCT timeout

This is most likely configuration related.

  • If installing through the RShim interface, check whether /var/pxe/centos7 is mounted or not. If not, either manually mount it or re-run the setup.sh script.

  • Check the Linux boot message to see whether eMMC is found or not. If not, the BlueField driver patch is missing. For local installation via RShim, run the setup.sh script with the absolute path and check if there are any errors. For a corporate PXE server, make sure the BlueField and ConnectX driver disk are patched into the initrd image.

Run the following:

Copy
Copied!
            

/opt/mellanox/scripts/bfvcheck: root@bluefield:/usr/bin/bfvcheck# ./bfvcheck Beginning version check... -RECOMMENDED VERSIONS- ATF: v1.5(release):BL2.0-1-gf9f7cdd UEFI: 2.0-6004a6b FW: 18.25.1010 -INSTALLED VERSIONS- ATF: v1.5(release):BL2.0-1-gf9f7cdd UEFI: 2.0-6004a6b FW: 18.25.1010 Version checked

Also, the version information is printed to the console.

For ATF, a version string is printed as the system boots.

Copy
Copied!
            

"NOTICE: BL2: v1.3(release):v1.3-554-ga622cde"

For UEFI, a version string is printed as the system boots.

Copy
Copied!
            

"UEFI firmware (version 0.99-18d57e3 built at 00:55:30 on Apr 13 2018)"

For Yocto, run:

Copy
Copied!
            

$ cat /etc/bluefield_version 2.0.0.10817

See the readme at <BF_INST_DIR>/src/drivers/rshim/README.

  1. Boot the target through the RShim interface from a host machine:

    Copy
    Copied!
                

    $ cat <BF_INST_DIR>/sample/install.bfb > /dev/rshim<N>/boot

  2. Log into the BlueField target:

    Copy
    Copied!
                

    $ /opt/mlnx/scripts/bfrec

The mst, mlxburn, and flint tools can be used to update firmware.

For Ubuntu, CentOS and Debian, run the following command from the Arm side:

Copy
Copied!
            

sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

Configuring ConnectX firmware can be done using the mlxconfig tool.

It is possible to configure privileges of both the internal (Arm) and the external host (for DPUs) from a privileged host. According to the configured privilege, a host may or may not perform certain operations related to the NIC (e.g. determine if a certain host is allowed to read port counters).

For more information and examples please refer to the MFT User Manual which can be found at the following link.

Press the "Esc" key when prompted after booting (before the countdown timer runs out) to enter the UEFI boot menu and use the arrows to select the menu option.

It could take 1-2 minutes to enter the Boot Manager depending on how many devices are installed or whether the EXPROM is programmed or not.

Once in the boot manager:

  • "EFI Network xxx" entries with device path "PciRoot..." are ConnectX interface

  • "EFI Network xxx" entries with device path "MAC(..." are for the RShim interface and the BlueField OOB Ethernet interface

Select the interface and press ENTER will start PXE boot.

The following are several useful commands under UEFI shell:

Copy
Copied!
            

Shell> ls FS0: # display file Shell> ls FS0:\EFI # display file Shell> cls # clear screen Shell> ifconfig -l # show interfaces Shell> ifconfig -s eth0 dhcp # request DHCP Shell> ifconfig -l eth0 # show one interface Shell> tftp 192.168.100.1 grub.cfg FS0:\grub.cfg # tftp download a file Shell> bcfg boot dump # dump boot variables Shell> bcfg boot add 0 FS0:\EFI\centos\shim.efi "CentOS" # create an entry

The default Yocto kernel has CONFIG_KGDB and CONFIG_KGDB_SERIAL_CONSOLE enabled. This allows the Linux kernel on BlueField to be debugged over the serial port. A single serial port cannot be used both as a console and by KGDB at the same time. It is recommended to use the RShim for console access (/dev/rshim0/console) and the UART port (/dev/ttyAMA0 or /dev/ttyAMA1) for KGDB. Kernel GDB over console (KGDBOC) does not work over the RShim console. If the RShim console is not available, there are open-source packages such as KGDB demux and agent-proxy which allow a single serial port to be shared.

There are two ways to configure KGDBOC. If the OS is already booted, then write the name of the serial device to the KGDBOC module parameter. For example:

Copy
Copied!
            

$ echo ttyAMA1 > /sys/module/kgdboc/parameters/kgdboc

To attach GDB to the kernel, it must be stopped first. One way to do that is to send a "g" to /proc/sysrq-trigger.

Copy
Copied!
            

$ echo g > /proc/sysrq-trigger

To debug incidents that occur at boot time, kernel boot parameters must be configured. Add "kgdboc=ttyAMA1,115200 kgdwait" to the boot arguments to use UART1 for debugging and force it to wait for GDB to attach before booting.

Once the KGDBOC module is configured and the kernel stopped, run the Arm64 GDB on the host machine connected to the serial port, then set the remote target to the serial device on the host side.

Copy
Copied!
            

<BF_INST_DIR>/sdk/sysroots/x86_64-pokysdk-linux/usr/bin/aarch64-poky-linux/aarch64-poky-linux-gdb <BF_INST_DIR>/sample/vmlinux   (gdb) target remote /dev/ttyUSB3 Remote debugging using /dev/ttyUSB3 arch_kgdb_breakpoint () at /labhome/dwoods/src/bf/linux/arch/arm64/include/asm/kgdb.h:32 32 asm ("brk %0" : : "I" (KGDB_COMPILED_DBG_BRK_IMM)); (gdb)

<BF_INST_DIR> is the directory where the BlueField software is installed. It is assumed that the SDK has been unpacked in the same directory.

SMMU could affect performance for certain applications. It is disabled by default and can be modified in different ways.

  • Enable/disable SMMU in the UEFI System Configuration

  • Set it in bf.cfg and push it together with the install.bfb (see section "Installing Popular Linux Distributions on BlueField")

  • In BlueField Linux, create a file with one line with SYS_ENABLE_SMMU=TRUE, then run bfcfg.

The configuration change will take effect after reboot. The configuration value is stored in a persistent UEFI variable. It is not modified by OS installation.

See section "UEFI System Configuration" for information on how to access the UEFI System Configuration menu.

On UART0:

Copy
Copied!
            

$ echo "console=ttyAMA0 earlycon=pl011,0x01000000 initrd=initramfs" > bootarg $ <BF_INST_DIR>/bin/mlx-mkbfb --boot-args bootarg \ <BF_INST_DIR>/sample/ install.bfb

On UART1:

Copy
Copied!
            

$ echo "console=ttyAMA1 earlycon=pl011,0x01000000 initrd=initramfs" > bootarg $ <BF_INST_DIR>/bin/mlx-mkbfb --boot-args bootarg \ <BF_INST_DIR>/sample/install.bfb

On RShim:

Copy
Copied!
            

$ echo "console=hvc0 initrd=initramfs" > bootarg $ <BF_INST_DIR>/bin/mlx-mkbfb --boot-args bootarg \ <BF_INST_DIR>/sample/install.bfb

On Ubuntu OS, the default network configuration for tmfifo_net0 and oob_net0 interfaces is set by the cloud-init service upon first boot after BFB installation.

The default content of /var/lib/cloud/seed/nocloud-net/network-config as follows:

Copy
Copied!
            

# cat /var/lib/cloud/seed/nocloud-net/network-config version: 2 renderer: NetworkManager ethernets:   tmfifo_net0:     dhcp4: false     addresses:       - 192.168.100.2/30     nameservers:       addresses: [ 192.168.100.1 ]     routes:     - to: 0.0.0.0/0       via: 192.168.100.1       metric: 1025   oob_net0:     dhcp4: true

This content can be modified during BFB installation using bf.cfg. For example:

Copy
Copied!
            

# cat bf.cfg bfb_modify_os() {         sed -i -e '/oob_net0/,+1d' /mnt/var/lib/cloud/seed/nocloud-net/network-config cat >> /mnt/var/lib/cloud/seed/nocloud-net/network-config << EOF   oob_net0:     dhcp4: false     addresses:       - 10.0.0.1/24 EOF }   # bfb-install  -c bf.cfg -r rshim0 -b <BFB>

Warning

Using the same technique, any configuration file on the BlueField DPU side can be updated during the BFB installation process.

During the BFB installation process, DPU storage can be securely sanitized either using the shred or the mmc and nvme utilities in the bf.cfg configuration file as illustrated in the following subsections.

Warning

By default, only the installation target storage is formatted using the Linux mkfs utility.

Using shred Utility

Copy
Copied!
            

# cat bf.cfg SANITIZE_DONE=${SANITIZE_DONE:-0} export SANITIZE_DONE if [ $SANITIZE_DONE -eq 0 ]; then sleep 3m /sbin/modprobe nvme   if [ -e /dev/mmcblk0 ]; then echo Sanitizing /dev/mmcblk0 | tee /dev/kmsg echo Sanitizing /dev/mmcblk0 > /tmp/sanitize.emmc.log mmc sanitize /dev/mmcblk0 >> /tmp/sanitize.emmc.log 2>&1 fi if [ -e /dev/nvme0n1 ]; then echo Sanitizing /dev/nvme0n1 | tee /dev/kmsg echo Sanitizing /dev/nvme0n1 > /tmp/sanitize.ssd.log nvme sanitize /dev/nvme0n1 -a 2 >> /tmp/sanitize.ssd.log 2>&1 nvme sanitize-log /dev/nvme0n1 >> /tmp/sanitize.ssd.log 2>&1 fi SANITIZE_DONE=1 echo ===================== sanitize.log ===================== | tee /dev/kmsg cat /tmp/sanitize.*.log | tee /dev/kmsg sync fi bfb_modify_os() { echo ===================== bfb_modify_os ===================== | tee /dev/kmsg if ( /bin/ls -1 /tmp/sanitize.*.log > /dev/null 2>&1 ); then cat /tmp/sanitize.*.log > /mnt/root/sanitize.log fi }


Using mmc and nvme Utilities

Copy
Copied!
            

# cat bf.cfg SANITIZE_DONE=${SANITIZE_DONE:-0} export SANITIZE_DONE if [ $SANITIZE_DONE -eq 0 ]; then sleep 3m /sbin/modprobe nvme   if [ -e /dev/mmcblk0 ]; then echo Sanitizing /dev/mmcblk0 | tee /dev/kmsg echo Sanitizing /dev/mmcblk0 > /tmp/sanitize.emmc.log mmc sanitize /dev/mmcblk0 >> /tmp/sanitize.emmc.log 2>&1 fi if [ -e /dev/nvme0n1 ]; then echo Sanitizing /dev/nvme0n1 | tee /dev/kmsg echo Sanitizing /dev/nvme0n1 > /tmp/sanitize.ssd.log nvme sanitize /dev/nvme0n1 -a 2 >> /tmp/sanitize.ssd.log 2>&1 nvme sanitize-log /dev/nvme0n1 >> /tmp/sanitize.ssd.log 2>&1 fi SANITIZE_DONE=1 echo ===================== sanitize.log ===================== | tee /dev/kmsg cat /tmp/sanitize.*.log | tee /dev/kmsg sync fi bfb_modify_os() { echo ===================== bfb_modify_os ===================== | tee /dev/kmsg if ( /bin/ls -1 /tmp/sanitize.*.log > /dev/null 2>&1 ); then cat /tmp/sanitize.*.log > /mnt/root/sanitize.log fi }

Before powering off or power cycling the DPU, it is strongly recommended to perform graceful shutdown of the DPU Arm OS.

Graceful shutdown of the Arm OS ensures that data within the eMMC/NVMe cache is properly written to storage, and helps prevent filesystem inconsistencies and file corruption.

There are several ways to gracefully shutdown the DPU Arm OS:

  • Log into the DPU Arm OS and perform a shutdown command prior to power cycling the host server. For example:

    Copy
    Copied!
                

    sudo shutdown -h now

  • Assuming the DPU BMC can issue NC-SI OEM commands to the DPU:

    1. Issue the Shutdown Smart NIC OS NC-SI OEM command.

    2. After DPU Arm OS shutdown, it is recommended to issue DPU Arm OS state query which indicates whether DPU Arm OS shutdown has completed (standby indication). This can be done by issuing the Get Smart NIC OS State NC-SI OEM command.

© Copyright 2023, NVIDIA. Last updated on Mar 3, 2024.