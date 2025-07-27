NVIDIA BlueField Platform Software Troubleshooting Guide
SoC Management Interface

Preface

RShim, the SoC management interface in the BlueField System-on-Chip (SoC), enables management, monitoring, and debugging of the device. It offers key functions like firmware updates, system status checks, Arm console access, and network communication through device files (e.g., boot, misc, console) and the RShim network interface (i.e., tmfifo_net0). This guide focuses on practical usage and troubleshooting from the user's side.

Command Cheat Sheet

Command

Description

rshim --version

Check version

echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc

cat /dev/rshim0/misc

Check RShim log

journalctl -u rshim > rshim_logs.txt

Check RShim system log

journalctl > all_logs.txt

Check all system log

minicom -D /dev/rshim0/console -C rshim_console.txt

screen /dev/rshim0/console 115200

Access RShim Console

cat new_firmware.bfb > /dev/rshim0/boot

dd if=new_firmware.bfb of=/dev/rshim0/boot bs=1M

bfb-install -b /tmp/new_firmware.bfb -r /dev/rshim0

Update BlueField firmware (BFB) locally

scp new_firmware.bfb root@<bf-bmc-hostname>:/dev/rshim0/boot

bfb-install -b new_firmware.bfb -r 15.22.111.63:rshim0

Update BlueField firmware (BFB) remotely

ifconfig tmfifo_net0 192.168.100.2 netmask 255.255.255.252 up

Configure the RShim network interface

Logging and Counters

RShim logging uses an internal 1KB hardware buffer to track booting progress and record important messages. It is written by the NVIDIA BlueField Arm cores and is displayed by the RShim driver from the USB/PCIe host machine.

The RShim log messages can be displayed described in the following:

  1. Check the DISPLAY_LEVEL level in file /dev/rshim0/misc:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim0/misc
DISPLAY_LEVEL   0 (0:basic, 1:advanced, 2:log)
…

  2. Set DISPLAY_LEVEL to 2:

    Copy
    Copied!
                
    
            
    # echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc

  3. Log messages are displayed in the misc file. The following is an example output from BlueField-2:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim0/misc
...
---------------------------------------
	Log Messages
---------------------------------------
 INFO[BL2]: start
 INFO[BL2]: no DDR on MSS0
 INFO[BL2]: calc DDR freq (clk_ref 53836948)
 INFO[BL2]: DDR POST passed
 INFO[BL2]: UEFI loaded
 INFO[BL31]: start
 INFO[BL31]: runtime
 INFO[UEFI]: eMMC init
 INFO[UEFI]: eMMC probed
 INFO[UEFI]: PCIe enum start
 INFO[UEFI]: PCIe enum end

The BFB installation flow can be traced using the following interfaces:

  • From the host –

    • RShim console (/dev/rshim0/console)

    • RShim log buffer (/dev/rshim0/misc); also included in bfb-install's output

    • UART console (/dev/ttyUSB0)

  • From the BMC console –

    • SSH to the BMC and run obmc-console-client

      Additional information about BMC interfaces is available in BMC software documentation.

  • From the BlueField –

    • /root/<OS>.installation.log available on the DPU OS after installation

Debug Info Package

Non-secure BlueField devices support GDB using OpenOCD. BlueField RShim support is up-streamed to the OpenOCD project which implements a GDB server for BlueField debugging. OpenOCD can use the RShim driver to access the Arm debug access port (DAP) directly on the BlueField SoC from the RShim. For more information, refer to documentation in /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-26/build/install/Documentation/HOWTO-openocd which also describes how to use GDB to debug the Linux kernel.

  1. To get started, boot the BlueField with the EFI stub debug image to reproduce the crash and halt the system when the Synchronous Exception occurs. It is also possible to add an infinite loop to the code where attaching the debugger is desired, and to then manually set the program counter to jump past the loop.

  2. Run GDB and OpenOCD on the host server machine connected to the BlueField. It is best practice to copy the OpenOCD binary and config files to a separate directory so the config can be edited as needed:

    Copy
    Copied!
                
    
            
    # Create writeable OpenOCD copy. Edit target/bluefield.cfg to specify which rshim device to use.
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp openocd ~/james/openocd/
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp interface/rshim.cfg ~/james/openocd/interface/
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp target/bluefield.cfg ~/james/openocd/target/
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp board/bluefield.cfg ~/james/openocd/board/
 
# Run OpenOCD (GDB server communicating with BF through rshim) in one window
root@bu-lab102:~/james/openocd# ./openocd -f board/bluefield.cfg
 
# In another window source toolchain and run GDB client
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# ./poky-glibc-x86_64-core-image-initramfs-aarch64-bluefield-toolchain-BlueField-4.7.0.13127.2.7.4.sh
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# . /opt/poky/2.7.4/environment-setup-aarch64-poky-linux
root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# aarch64-poky-linux-gdb

  3. In GDB client window, perform:

    Copy
    Copied!
                
    
            
    # Connect to GDB server and set remote timeout (seconds)
(gdb) target extended-remote :3333
(gdb) set remotetimeout 60
 
# Source helpful debug functions
(gdb) source /auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd/scripts/bfdbg.py
 
# Available commands
(gdb) bf-help
 bf-edk2 symbol [all]                 -- Load symbols
 bf-info                              -- Display info
 bf-mmu <virt2phys | lookup> <vaddr>  -- MMU operation
 bf-reg [<reg-name [value]> | all]    -- Show/Set registers
 
# Verify the BF is in EDK2 mode (may need to reboot and restart if not)
(gdb) bf-info
PC = 0x45fee7294, EL = 2
EDK2
 
# Load all EDK2/UEFI symbols (this can take a while)
(gdb) bf-edk2 symbol all
 
# Now can look at the backtrace with symbol information
(gdb) bt
#0  0x000000045fee6680 in CpuDeadLoop () at /home/scratch/james/build/edk2/edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:31
#1  0x000000045fee6a14 in DefaultExceptionHandler (ExceptionType=<optimized out>, SystemContext=...) at /home/scratch/james/build/edk2/edk2/MlxPlatformPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c:336
#2  0x000000045fee7340 in ExceptionHandlersEnd ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    The backtrace above does not provide much helpful information in this case (it shows the device is halted in the EDK2 exception handler), but may be useful depending on the issue.

  4. The RShim log can provide the PC address:

    Copy
    Copied!
                
    
            
    Synchronous Exception at 0x459B89420
 
 ERR[UEFI]: PC=0x459B89420(B900003F D5033F9F 94000076 34000960)
 ERR[UEFI]: PC=0x459B88F48
 ERR[UEFI]: PC=0x459B84998
 ERR[UEFI]: PC=0x98D05A68 (0x13A68) [ 1] DxeCore.dll
 ERR[UEFI]: PC=0x45A7973A8 (0x103A8) [ 2] BdsDxe.dll
 ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568
 ERR[UEFI]: X4=0x99FFF570 X5=0x82000000 X6=0x45F2363C0 X7=0x11A18F858A986D85
 ERR[UEFI]: X8=0x4A3823DC9042A9DE X9=0x4D54D42AC44A6076 X10=0x1 X11=0x99FFF3F7
 ERR[UEFI]: X12=0x45F2AC018 X13=0x99FFF3F8 X14=0x1 X15=0x88000C40

  5. Dump the 32 bit instructions at that address:

    Copy
    Copied!
                
    
            
    (gdb) x /32i 0x459B89420
   0x459b89420: str     wzr, [x1]
   0x459b89424: dsb     sy
   0x459b89428: bl      0x459b89600
   0x459b8942c: cbz     w0, 0x459b89558
   0x459b89430: bl      0x459b89610
   0x459b89434: cbz     w0, 0x459b89544
   0x459b89438: and     x1, x19, #0xffffffffffe00000
   0x459b8943c: adrp    x26, 0x45a1e2000
...

    This shows that the issue is related to wzr, [x1] which shows zero is being written to the memory address contained in a variable x1 (brackets are dereferencing). This hints that x1 contains a memory address that cannot be written to. Looking at the RShim logs, this variable is actually printed and its value/address can be seen as 0x400000 (secure RAM that the executing code cannot write to causing synchronous exception):

    Copy
    Copied!
                
    
            
    ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568

    Note that this x1 variable is part of the EDK2 EFI_SYSTEM_CONTEXT_AARCH64 structure and the assembly code can be read to determine which register this is stored in for more debug if needed.

  6. Various system registers can also be inspected with GDB (refer to /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-30/build/install/lib/openocd/scripts/aarch64.py or Arm spec for list of relevant register names to use):

    Copy
    Copied!
                
    
            
    (gdb) bf-reg ttbr0_el2
ttbr0_el2 = 0x99feb000
 
(gdb) info reg
...

    Note

    There may be issues accessing some registers depending on the current exception level (reference the Arm specifications for more information).

Using Breakpoints

Make sure to use hardware breakpoints (hbreak) rather than software breakpoints with BlueField due to issues that can occur when software breakpoints are inserted. To demonstrate breakpoint usage the following example adds an infinite loop to the code before the crash occurs so that the debugger can be attached and breakpoints can be added. The following diff has been added to the test/crash image:

Copy
Copied!
            

            
--- a/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
+++ b/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
@@ -1790,6 +1790,8 @@ EfiBootManagerBoot (
     return;
   }

+  __asm__ volatile("b .");
+
...

OpenOCD SMP support also has to be disabled for hardware breakpoints to avoid halting all cores. Make the following change to your target/bluefield.cfg:

Copy
Copied!
            

            
 # Configure SMP
 if { $_cores > 1 } {
-    eval $_smp_command
+#    eval $_smp_command
 }

Load the new test image and follow the previous instructions for attaching OpenOCD and GDB and loading EDK2 symbols. Make sure to attach to the port for a specific core.

Tip

Users may have better luck installing a preboot-install.bfb with the infinite loop and booting from flash rather than RShim because the code would not continue executing after jumping past the loop if the RShim installation times out. To reproduce the issue above, this would also mean installing the Linux image to flash.

Verify the system has stopped at the expected location:

Copy
Copied!
            

            
(gdb) where
#0  0x000000045a796d60 in EfiBootManagerBoot (BootOption=BootOption@entry=0x99fff968) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c:1793
...
(gdb) stepi
0x000000045a796d60      1793      __asm__ volatile("b .");

At this point, hardware breakpoints can be added using symbol names:

Copy
Copied!
            

            
# Adding breakpoint to a spot close to crash
(gdb) hbreak /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
Hardware assisted breakpoint 1 at 0x98d05a58: file /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c, line 1654.
 
# Use 'delete <n>' to delete breakpoint number n)
(gdb) info b
Num     Type           Disp Enb Address            What
1       hw breakpoint  keep y   0x0000000098d05a58 in CoreStartImage at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654

After breakpoints have been added, the following can be done to move the program counter past the infinite loop (a single 4-byte instruction) and continue execution:

Copy
Copied!
            

            
(gdb) set $pc+=4
(gdb) c
Continuing.
 
Breakpoint 1, CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
1654        Image->Status = Image->EntryPoint (ImageHandle, Image->Info.SystemTable);
 
(gdb) where
#0  CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
...

Many of the normal GDB commands are supported.

Note

Sometimes adding breakpoints can cause boot issues, and if the breakpoints cannot be deleted with GDB a hard reboot may be needed to recover.

Note

OpenOCD logs how many hardware breakpoints are available:

Copy
Copied!
            

            
Info : bluefield.cpu0: hardware has 6 breakpoints, 4 watchpoints
Info : bluefield.cpu1: hardware has 6 breakpoints, 4 watchpoints
...


Scenarios

Another Backend Already Attached

BlueField devices are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver. In this case, typically following a system reboot, the RShim over USB prevails and the BlueField host reports the RShim status as another backend already attached. This is correct behavior as there can only be one RShim back end active at any given time. However, this means that the BlueField host does not own RShim access. To debug an issue, the user may need to access RShim from the BlueField BMC or host, but RShim is attached to the other side (host or BMC respectively).

The user is able to reclaim RShim ownership safely without logging into the other side:

  1. Stop the RShim driver on the remote Linux. Run:

    Copy
    Copied!
                
    
            
    systemctl stop rshim
systemctl disable rshim

  2. Restart RShim on the BlueField host. Run:

    Copy
    Copied!
                
    
            
    systemctl enable rshim
systemctl start rshim

This another backend already attached error can also be attributed to the RShim back end being owned by the BMC in BlueField devices with an integrated BMC. This is elaborated on further down on this page.

RShim Driver Not Loading

Verify whether your BlueField features an integrated BMC or not. Run:

Copy
Copied!
            

            
# sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"

Example output for a BlueField with an integrated BMC:

Copy
Copied!
            

            
Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

If your BlueField has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.

If your BlueField does not have an integrated BMC, refer to RShim driver not loading on host on DPU without integrated BMC.

RShim Driver Not Loading on DPU with Integrated BMC

RShim Driver Not Loading on Host

  1. Access the BMC via the RJ45 management port of the BlueField.

  2. Delete RShim on the BMC:

    Copy
    Copied!
                
    
            
    systemctl stop rshim
systemctl disable rshim

  3. Enable RShim on the host:

    Copy
    Copied!
                
    
            
    systemctl enable rshim
systemctl start rshim

  4. Restart RShim service. Run:

    Copy
    Copied!
                
    
            
    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                
    
            
    sudo systemctl status rshim

    This command is expected to display active (running).

  5. Display the current setting. Run:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on BMC

  1. Verify that the RShim service is not running on host. Run:

    Copy
    Copied!
                
    
            
    systemctl status rshim

    If the output is active, then it may be presumed that the host has ownership of the RShim.

  2. Delete RShim on the host. Run:

    Copy
    Copied!
                
    
            
    systemctl stop rshim
systemctl disable rshim

  3. Enable RShim on the BMC. Run:

    Copy
    Copied!
                
    
            
    systemctl enable rshim
systemctl start rshim

  4. Display the current setting. Run:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        usb-1.0

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on Host on DPU Without Integrated BMC

  1. Download the suitable deb/rpm for RShim (management interface for DPU from the host) driver.

  2. Reinstall RShim package on the host.

    • For Ubuntu/Debian, run:

      Copy
      Copied!
                  
      
            
      sudo dpkg --force-all -i rshim-<version>.deb

    • For RHEL/CentOS, run:

      Copy
      Copied!
                  
      
            
      sudo rpm -Uhv rshim-<version>.rpm

  3. Restart RShim service. Run:

    Copy
    Copied!
                
    
            
    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                
    
            
    sudo systemctl status rshim

    This command is expected to display active (running).

  4. Display the current setting. Run:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.

RShim Failed to Set Up CUSE RShim Error

Symptom

When starting the rshim service, the systemd journal may show an error similar to:

Copy
Copied!
            

            
$ sudo systemctl status rshim
...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Starting rshim driver for BlueField SoC...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Started rshim driver for BlueField SoC.
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Created PID file: /var/run/rshim.pid
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Probing pcie-0000:b1:00.2(uio)
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Create rshim pcie-0000:b1:00.2
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: pcie-0000:b1:00.2 enable
Apr 30 14:08:21 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: rshim1 failed to setup CUSE rshim
...


Cause

The rshim driver depends on the cuse.ko kernel module, which is typically provided by the kernel-modules-extra package. This package is usually installed as a dependency during the RShim RPM or DEB installation.

However, on some RHEL- or Rocky Linux-based systems, this dependency may not be enforced, resulting in a missing cuse.ko module and a failure during RShim initialization.

Solution

Note

Installing kernel-modules-extra may trigger a kernel upgrade if your current kernel version is not available in the configured repositories. For example, installing this package may update the kernel from 5.14.0-570.4.1 to 5.14.0-570.12.1. This may also pull in related packages such as kernel, kernel-core, and kernel-modules.

  1. Install kernel-modules-extra. For RHEL/Rocky Linux systems, install the package using:

    Copy
    Copied!
                
    
            
    sudo dnf install kernel-modules-extra

  2. Load the cuse module. If the installed kernel-modules-extra matches the currently running kernel, you can load the cuse.ko module:

    Copy
    Copied!
                
    
            
    sudo modprobe cuse

    If no errors are reported, the cuse module is now available for RShim.

  3. Restart the RShim service. Once cuse is loaded, restart the RShim service:

    Copy
    Copied!
                
    
            
    sudo systemctl restart rshim

    You should no longer see the failed to setup CUSE rshim error.

Additional Notes

  • If modprobe cuse fails with a message about a missing module, it likely means the newly installed kernel-modules-extra version does not match the currently running kernel.

  • In this case, reboot the system to use the updated kernel:

    Copy
    Copied!
                
    
            
    sudo reboot

  • After reboot, verify the running kernel version:

    Copy
    Copied!
                
    
            
    uname -r

  • Ensure it matches the version of kernel-modules-extra that was installed.

  • In rare cases, you may need to adjust the GRUB configuration to ensure the system boots into the new kernel automatically:

    Copy
    Copied!
                
    
            
    sudo grub2-set-default 0
sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Failed to Read IOMMU Link Error

The following is an informational message printed by RShim driver when trying to access via IOMMU:

Copy
Copied!
            

            
rshim service: /sys/bus/pci/devices/0000:01:00.2/iommu_group: failed to read iommu link

The RShim driver probes RShim in the following order: IOMMU, UIO, Direct Map. It then continues the probe until success, and one mechanism failure does not mean that the RShim driver fails unless some mechanism is really necessary (such as IOMMU) when Linux kernel lockdown is enabled.

Change Ownership of RShim from NIC BMC to Host

  1. Verify that your BlueField has an integrated BMC. Run the following on the host:

    Copy
    Copied!
                
    
            
    # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv |grep "Product Name"
Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

    The product name is supposed to show integrated BMC .

  2. Access the BMC via the RJ45 management port of the BlueField.

  3. Delete RShim on the BMC:

    Copy
    Copied!
                
    
            
    systemctl stop rshim
systemctl disable rshim

  4. Enable RShim on the host:

    Copy
    Copied!
                
    
            
    systemctl enable rshim
systemctl start rshim

  5. Restart RShim service. Run:

    Copy
    Copied!
                
    
            
    sudo systemctl restart rshim

    If RShim service does not launch automatically, run:

    Copy
    Copied!
                
    
            
    sudo systemctl status rshim

    This command is expected to display active (running).

  6. Display the current setting. Run:

    Copy
    Copied!
                
    
            
    # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-04:00.2 (ro)

    This output indicates that the RShim service is ready to use.
