What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

SoC Management Interface

RShim, the SoC management interface in the BlueField System-on-Chip (SoC), enables management, monitoring, and debugging of the device. It offers key functions like firmware updates, system status checks, Arm console access, and network communication through device files (e.g., boot, misc, console) and the RShim network interface (i.e., tmfifo_net0). This guide focuses on practical usage and troubleshooting from the user's side.

Command

Description

rshim --version

Check version

echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc

cat /dev/rshim0/misc

Check RShim log

journalctl -u rshim > rshim_logs.txt

Check RShim system log

journalctl > all_logs.txt

Check all system log

minicom -D /dev/rshim0/console -C rshim_console.txt

screen /dev/rshim0/console 115200

Access RShim Console

cat new_firmware.bfb > /dev/rshim0/boot

dd if=new_firmware.bfb of=/dev/rshim0/boot bs=1M

bfb-install -b /tmp/new_firmware.bfb -r /dev/rshim0

Update BlueField firmware (BFB) locally

scp new_firmware.bfb root@<bf-bmc-hostname>:/dev/rshim0/boot

bfb-install -b new_firmware.bfb -r 15.22.111.63:rshim0

Update BlueField firmware (BFB) remotely

ifconfig tmfifo_net0 192.168.100.2 netmask 255.255.255.252 up

Configure the RShim network interface

RShim logging uses an internal 1KB hardware buffer to track booting progress and record important messages. It is written by the NVIDIA BlueField Arm cores and is displayed by the RShim driver from the USB/PCIe host machine.

The RShim log messages can be displayed described in the following:

  1. Check the DISPLAY_LEVEL level in file /dev/rshim0/misc:

    Copy
    Copied!
                

    # cat /dev/rshim0/misc DISPLAY_LEVEL 0 (0:basic, 1:advanced, 2:log) …

  2. Set DISPLAY_LEVEL to 2:

    Copy
    Copied!
                

    # echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc

  3. Log messages are displayed in the misc file. The following is an example output from BlueField-2:

    Copy
    Copied!
                

    # cat /dev/rshim0/misc ... --------------------------------------- Log Messages --------------------------------------- INFO[BL2]: start INFO[BL2]: no DDR on MSS0 INFO[BL2]: calc DDR freq (clk_ref 53836948) INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: runtime INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: PCIe enum start INFO[UEFI]: PCIe enum end

The BFB installation flow can be traced using the following interfaces:

  • From the host –

    • RShim console (/dev/rshim0/console)

    • RShim log buffer (/dev/rshim0/misc); also included in bfb-install's output

    • UART console (/dev/ttyUSB0)

  • From the BMC console –

    • SSH to the BMC and run obmc-console-client

      Additional information about BMC interfaces is available in BMC software documentation.

  • From the BlueField –

    • /root/<OS>.installation.log available on the DPU OS after installation

Non-secure BlueField devices support GDB using OpenOCD. BlueField RShim support is up-streamed to the OpenOCD project which implements a GDB server for BlueField debugging. OpenOCD can use the RShim driver to access the Arm debug access port (DAP) directly on the BlueField SoC from the RShim. For more information, refer to documentation in /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-26/build/install/Documentation/HOWTO-openocd which also describes how to use GDB to debug the Linux kernel.

  1. To get started, boot the BlueField with the EFI stub debug image to reproduce the crash and halt the system when the Synchronous Exception occurs. It is also possible to add an infinite loop to the code where attaching the debugger is desired, and to then manually set the program counter to jump past the loop.

  2. Run GDB and OpenOCD on the host server machine connected to the BlueField. It is best practice to copy the OpenOCD binary and config files to a separate directory so the config can be edited as needed:

    Copy
    Copied!
                

    # Create writeable OpenOCD copy. Edit target/bluefield.cfg to specify which rshim device to use. root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp openocd ~/james/openocd/ root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp interface/rshim.cfg ~/james/openocd/interface/ root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp target/bluefield.cfg ~/james/openocd/target/ root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp board/bluefield.cfg ~/james/openocd/board/   # Run OpenOCD (GDB server communicating with BF through rshim) in one window root@bu-lab102:~/james/openocd# ./openocd -f board/bluefield.cfg   # In another window source toolchain and run GDB client root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# ./poky-glibc-x86_64-core-image-initramfs-aarch64-bluefield-toolchain-BlueField-4.7.0.13127.2.7.4.sh root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# . /opt/poky/2.7.4/environment-setup-aarch64-poky-linux root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# aarch64-poky-linux-gdb

  3. In GDB client window, perform:

    Copy
    Copied!
                

    # Connect to GDB server and set remote timeout (seconds) (gdb) target extended-remote :3333 (gdb) set remotetimeout 60   # Source helpful debug functions (gdb) source /auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd/scripts/bfdbg.py   # Available commands (gdb) bf-help bf-edk2 symbol [all] -- Load symbols bf-info -- Display info bf-mmu <virt2phys | lookup> <vaddr> -- MMU operation bf-reg [<reg-name [value]> | all] -- Show/Set registers   # Verify the BF is in EDK2 mode (may need to reboot and restart if not) (gdb) bf-info PC = 0x45fee7294, EL = 2 EDK2   # Load all EDK2/UEFI symbols (this can take a while) (gdb) bf-edk2 symbol all   # Now can look at the backtrace with symbol information (gdb) bt #0 0x000000045fee6680 in CpuDeadLoop () at /home/scratch/james/build/edk2/edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:31 #1 0x000000045fee6a14 in DefaultExceptionHandler (ExceptionType=<optimized out>, SystemContext=...) at /home/scratch/james/build/edk2/edk2/MlxPlatformPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c:336 #2 0x000000045fee7340 in ExceptionHandlersEnd () Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    The backtrace above does not provide much helpful information in this case (it shows the device is halted in the EDK2 exception handler), but may be useful depending on the issue.

  4. The RShim log can provide the PC address:

    Copy
    Copied!
                

    Synchronous Exception at 0x459B89420   ERR[UEFI]: PC=0x459B89420(B900003F D5033F9F 94000076 34000960) ERR[UEFI]: PC=0x459B88F48 ERR[UEFI]: PC=0x459B84998 ERR[UEFI]: PC=0x98D05A68 (0x13A68) [ 1] DxeCore.dll ERR[UEFI]: PC=0x45A7973A8 (0x103A8) [ 2] BdsDxe.dll ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568 ERR[UEFI]: X4=0x99FFF570 X5=0x82000000 X6=0x45F2363C0 X7=0x11A18F858A986D85 ERR[UEFI]: X8=0x4A3823DC9042A9DE X9=0x4D54D42AC44A6076 X10=0x1 X11=0x99FFF3F7 ERR[UEFI]: X12=0x45F2AC018 X13=0x99FFF3F8 X14=0x1 X15=0x88000C40

  5. Dump the 32 bit instructions at that address:

    Copy
    Copied!
                

    (gdb) x /32i 0x459B89420 0x459b89420: str wzr, [x1] 0x459b89424: dsb sy 0x459b89428: bl 0x459b89600 0x459b8942c: cbz w0, 0x459b89558 0x459b89430: bl 0x459b89610 0x459b89434: cbz w0, 0x459b89544 0x459b89438: and x1, x19, #0xffffffffffe00000 0x459b8943c: adrp x26, 0x45a1e2000 ...

    This shows that the issue is related to wzr, [x1] which shows zero is being written to the memory address contained in a variable x1 (brackets are dereferencing). This hints that x1 contains a memory address that cannot be written to. Looking at the RShim logs, this variable is actually printed and its value/address can be seen as 0x400000 (secure RAM that the executing code cannot write to causing synchronous exception):

    Copy
    Copied!
                

    ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568

    Note that this x1 variable is part of the EDK2 EFI_SYSTEM_CONTEXT_AARCH64 structure and the assembly code can be read to determine which register this is stored in for more debug if needed.

  6. Various system registers can also be inspected with GDB (refer to /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-30/build/install/lib/openocd/scripts/aarch64.py or Arm spec for list of relevant register names to use):

    Copy
    Copied!
                

    (gdb) bf-reg ttbr0_el2 ttbr0_el2 = 0x99feb000   (gdb) info reg ...

    Note

    There may be issues accessing some registers depending on the current exception level (reference the Arm specifications for more information).

Using Breakpoints

Make sure to use hardware breakpoints (hbreak) rather than software breakpoints with BlueField due to issues that can occur when software breakpoints are inserted. To demonstrate breakpoint usage the following example adds an infinite loop to the code before the crash occurs so that the debugger can be attached and breakpoints can be added. The following diff has been added to the test/crash image:

Copy
Copied!
            

--- a/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c +++ b/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c @@ -1790,6 +1790,8 @@ EfiBootManagerBoot ( return; } + __asm__ volatile("b ."); + ...

OpenOCD SMP support also has to be disabled for hardware breakpoints to avoid halting all cores. Make the following change to your target/bluefield.cfg:

Copy
Copied!
            

# Configure SMP if { $_cores > 1 } { - eval $_smp_command +# eval $_smp_command }

Load the new test image and follow the previous instructions for attaching OpenOCD and GDB and loading EDK2 symbols. Make sure to attach to the port for a specific core.

Tip

Users may have better luck installing a preboot-install.bfb with the infinite loop and booting from flash rather than RShim because the code would not continue executing after jumping past the loop if the RShim installation times out. To reproduce the issue above, this would also mean installing the Linux image to flash.

Verify the system has stopped at the expected location:

Copy
Copied!
            

(gdb) where #0 0x000000045a796d60 in EfiBootManagerBoot (BootOption=BootOption@entry=0x99fff968) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c:1793 ... (gdb) stepi 0x000000045a796d60 1793 __asm__ volatile("b .");

At this point, hardware breakpoints can be added using symbol names:

Copy
Copied!
            

# Adding breakpoint to a spot close to crash (gdb) hbreak /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654 Hardware assisted breakpoint 1 at 0x98d05a58: file /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c, line 1654.   # Use 'delete <n>' to delete breakpoint number n) (gdb) info b Num Type Disp Enb Address What 1 hw breakpoint keep y 0x0000000098d05a58 in CoreStartImage at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654

After breakpoints have been added, the following can be done to move the program counter past the infinite loop (a single 4-byte instruction) and continue execution:

Copy
Copied!
            

(gdb) set $pc+=4 (gdb) c Continuing.   Breakpoint 1, CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654 1654 Image->Status = Image->EntryPoint (ImageHandle, Image->Info.SystemTable);   (gdb) where #0 CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654 ...

Many of the normal GDB commands are supported.

Note

Sometimes adding breakpoints can cause boot issues, and if the breakpoints cannot be deleted with GDB a hard reboot may be needed to recover.

Note

OpenOCD logs how many hardware breakpoints are available:

Copy
Copied!
            

Info : bluefield.cpu0: hardware has 6 breakpoints, 4 watchpoints Info : bluefield.cpu1: hardware has 6 breakpoints, 4 watchpoints ...


Another Backend Already Attached

BlueField devices are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver. In this case, typically following a system reboot, the RShim over USB prevails and the BlueField host reports the RShim status as another backend already attached. This is correct behavior as there can only be one RShim back end active at any given time. However, this means that the BlueField host does not own RShim access. To debug an issue, the user may need to access RShim from the BlueField BMC or host, but RShim is attached to the other side (host or BMC respectively).

The user is able to reclaim RShim ownership safely without logging into the other side:

  1. Stop the RShim driver on the remote Linux. Run:

    Copy
    Copied!
                

    systemctl stop rshim systemctl disable rshim

  2. Restart RShim on the BlueField host. Run:

    Copy
    Copied!
                

    systemctl enable rshim systemctl start rshim

This another backend already attached error can also be attributed to the RShim back end being owned by the BMC in BlueField devices with an integrated BMC. This is elaborated on further down on this page.

RShim Driver Not Loading

Verify whether your BlueField features an integrated BMC or not. Run:

Copy
Copied!
            

# sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"

Example output for a BlueField with an integrated BMC:

Copy
Copied!
            

Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

If your BlueField has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.

If your BlueField does not have an integrated BMC, refer to RShim driver not loading on host on DPU without integrated BMC.

RShim Driver Not Loading on DPU with Integrated BMC

RShim Driver Not Loading on Host

  1. Access the BMC via the RJ45 management port of the BlueField.

  2. Delete RShim on the BMC:

    Copy
    Copied!
                

    systemctl stop rshim systemctl disable rshim

  3. Enable RShim on the host:

    Copy
    Copied!
                

    systemctl enable rshim systemctl start rshim

  4. Restart RShim service. Run:

    Copy
    Copied!
                

    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                

    sudo systemctl status rshim

    This command is expected to display active (running).

  5. Display the current setting. Run:

    Copy
    Copied!
                

    # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on BMC

  1. Verify that the RShim service is not running on host. Run:

    Copy
    Copied!
                

    systemctl status rshim

    If the output is active, then it may be presumed that the host has ownership of the RShim.

  2. Delete RShim on the host. Run:

    Copy
    Copied!
                

    systemctl stop rshim systemctl disable rshim

  3. Enable RShim on the BMC. Run:

    Copy
    Copied!
                

    systemctl enable rshim systemctl start rshim

  4. Display the current setting. Run:

    Copy
    Copied!
                

    # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME usb-1.0

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on Host on DPU Without Integrated BMC

  1. Download the suitable deb/rpm for RShim (management interface for DPU from the host) driver.

  2. Reinstall RShim package on the host.

    • For Ubuntu/Debian, run:

      Copy
      Copied!
                  

      sudo dpkg --force-all -i rshim-<version>.deb

    • For RHEL/CentOS, run:

      Copy
      Copied!
                  

      sudo rpm -Uhv rshim-<version>.rpm

  3. Restart RShim service. Run:

    Copy
    Copied!
                

    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                

    sudo systemctl status rshim

    This command is expected to display active (running).

  4. Display the current setting. Run:

    Copy
    Copied!
                

    # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.

Error Messages

The following is an informational message printed by RShim driver when trying to access via IOMMU:

Copy
Copied!
            

rshim service: /sys/bus/pci/devices/0000:01:00.2/iommu_group: failed to read iommu link

The RShim driver probes RShim in the following order: IOMMU, UIO, Direct Map. It then continues the probe until success, and one mechanism failure does not mean that the RShim driver fails unless some mechanism is really necessary (such as IOMMU) when Linux kernel lockdown is enabled.

Change Ownership of RShim from NIC BMC to Host

  1. Verify that your BlueField has an integrated BMC. Run the following on the host:

    Copy
    Copied!
                

    # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv |grep "Product Name" Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

    The product name is supposed to show integrated BMC .

  2. Access the BMC via the RJ45 management port of the BlueField.

  3. Delete RShim on the BMC:

    Copy
    Copied!
                

    systemctl stop rshim systemctl disable rshim

  4. Enable RShim on the host:

    Copy
    Copied!
                

    systemctl enable rshim systemctl start rshim

  5. Restart RShim service. Run:

    Copy
    Copied!
                

    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                

    sudo systemctl status rshim

    This command is expected to display active (running).

  6. Display the current setting. Run:

    Copy
    Copied!
                

    # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.