SoC Management Interface
RShim, the SoC management interface in the BlueField System-on-Chip (SoC), enables management, monitoring, and debugging of the device. It offers key functions like firmware updates, system status checks, Arm console access, and network communication through device files (e.g., boot, misc, console) and the RShim network interface (i.e., tmfifo_net0). This guide focuses on practical usage and troubleshooting from the user's side.
| Command | Description | 
| 
 | Check version | 
| 
 
 | Check RShim log | 
| 
 | Check RShim system log | 
| 
 | Check all system log | 
| 
 
 | Access RShim Console | 
| 
 
 
 | Update BlueField firmware (BFB) locally | 
| 
 
 | Update BlueField firmware (BFB) remotely | 
| 
 | Configure the RShim network interface | 
RShim logging uses an internal 1KB hardware buffer to track booting progress and record important messages. It is written by the NVIDIA BlueField Arm cores and is displayed by the RShim driver from the USB/PCIe host machine.
The RShim log messages can be displayed described in the following:
- Check the - DISPLAY_LEVELlevel in file- /dev/rshim0/misc:- # cat /dev/rshim0/misc DISPLAY_LEVEL - 0(- 0:basic,- 1:advanced,- 2:log) …
- Set - DISPLAY_LEVELto 2:- # echo - "DISPLAY_LEVEL 2"> /dev/rshim0/misc
- Log messages are displayed in the - miscfile. The following is an example output from BlueField-2:- # cat /dev/rshim0/misc ... --------------------------------------- Log Messages --------------------------------------- INFO[BL2]: start INFO[BL2]: no DDR on MSS0 INFO[BL2]: calc DDR freq (clk_ref - 53836948) INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: runtime INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: PCIe- enumstart INFO[UEFI]: PCIe- enumend
The BFB installation flow can be traced using the following interfaces:
- From the host – - RShim console ( - /dev/rshim0/console)
- RShim log buffer ( - /dev/rshim0/misc); also included in- bfb-install's output
- UART console ( - /dev/ttyUSB0)
 
- From the BMC console – - SSH to the BMC and run - obmc-console-client- Additional information about BMC interfaces is available in BMC software documentation. 
 
- From the BlueField – - /root/<OS>.installation.logavailable on the DPU OS after installation
 
Non-secure BlueField devices support GDB using OpenOCD. BlueField RShim support is up-streamed to the OpenOCD project which implements a GDB server for BlueField debugging. OpenOCD can use the RShim driver to access the Arm debug access port (DAP) directly on the BlueField SoC from the RShim. For more information, refer to documentation in /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-26/build/install/Documentation/HOWTO-openocd which also describes how to use GDB to debug the Linux kernel.
- To get started, boot the BlueField with the EFI stub debug image to reproduce the crash and halt the system when the Synchronous Exception occurs. It is also possible to add an infinite loop to the code where attaching the debugger is desired, and to then manually set the program counter to jump past the loop. 
- Run GDB and OpenOCD on the host server machine connected to the BlueField. It is best practice to copy the OpenOCD binary and config files to a separate directory so the config can be edited as needed: - # Create writeable OpenOCD copy. Edit target/bluefield.cfg to specify which rshim device to use. root - @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/install/lib/openocd# cp openocd ~/james/openocd/ root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/install/lib/openocd# cp- interface/rshim.cfg ~/james/openocd/- interface/ root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/install/lib/openocd# cp target/bluefield.cfg ~/james/openocd/target/ root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/install/lib/openocd# cp board/bluefield.cfg ~/james/openocd/board/ # Run OpenOCD (GDB server communicating with BF through rshim) in one window root- @bu-lab102:~/james/openocd# ./openocd -f board/bluefield.cfg # In another window source toolchain and run GDB client root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/dist# ./poky-glibc-x86_64-core-image-initramfs-aarch64-bluefield-toolchain-BlueField-- 4.7.- 0.13127.- 2.7.- 4.sh root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/dist# . /opt/poky/- 2.7.- 4/environment-setup-aarch64-poky-linux root- @bu-lab102:/auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/dist# aarch64-poky-linux-gdb
- In GDB client window, perform: - # Connect to GDB server and set remote timeout (seconds) (gdb) target extended-remote : - 3333(gdb) set remotetimeout- 60# Source helpful debug functions (gdb) source /auto/sw_soc_dev/bluefield-rel-- 4.7.- 0/last/build/install/lib/openocd/scripts/bfdbg.py # Available commands (gdb) bf-help bf-edk2 symbol [all] -- Load symbols bf-info -- Display info bf-mmu <virt2phys | lookup> <vaddr> -- MMU operation bf-reg [<reg-name [value]> | all] -- Show/Set registers # Verify the BF is in EDK2 mode (may need to reboot and restart- ifnot) (gdb) bf-info PC =- 0x45fee7294, EL =- 2EDK2 # Load all EDK2/UEFI symbols (- thiscan take a- while) (gdb) bf-edk2 symbol all # Now can look at the backtrace with symbol information (gdb) bt #- 0- 0x000000045fee6680in CpuDeadLoop () at /home/scratch/james/build/edk2/edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:- 31#- 1- 0x000000045fee6a14in DefaultExceptionHandler (ExceptionType=<optimized out>, SystemContext=...) at /home/scratch/james/build/edk2/edk2/MlxPlatformPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c:- 336#- 2- 0x000000045fee7340in ExceptionHandlersEnd () Backtrace stopped: previous frame identical to- thisframe (corrupt stack?)- The backtrace above does not provide much helpful information in this case (it shows the device is halted in the EDK2 exception handler), but may be useful depending on the issue. 
- The RShim log can provide the PC address: - Synchronous Exception at - 0x459B89420ERR[UEFI]: PC=- 0x459B89420(B900003F D5033F9F- 94000076- 34000960) ERR[UEFI]: PC=- 0x459B88F48ERR[UEFI]: PC=- 0x459B84998ERR[UEFI]: PC=- 0x98D05A68(- 0x13A68) [- 1] DxeCore.dll ERR[UEFI]: PC=- 0x45A7973A8(- 0x103A8) [- 2] BdsDxe.dll ERR[UEFI]: X0=- 0x45FFE0018X1=- 0x400000X2=- 0x99FFF548X3=- 0x99FFF568ERR[UEFI]: X4=- 0x99FFF570X5=- 0x82000000X6=- 0x45F2363C0X7=- 0x11A18F858A986D85ERR[UEFI]: X8=- 0x4A3823DC9042A9DEX9=- 0x4D54D42AC44A6076X10=- 0x1X11=- 0x99FFF3F7ERR[UEFI]: X12=- 0x45F2AC018X13=- 0x99FFF3F8X14=- 0x1X15=- 0x88000C40
- Dump the 32 bit instructions at that address: - (gdb) x /32i - 0x459B89420- 0x459b89420: str wzr, [x1]- 0x459b89424: dsb sy- 0x459b89428: bl- 0x459b89600- 0x459b8942c: cbz w0,- 0x459b89558- 0x459b89430: bl- 0x459b89610- 0x459b89434: cbz w0,- 0x459b89544- 0x459b89438: and x1, x19, #- 0xffffffffffe00000- 0x459b8943c: adrp x26,- 0x45a1e2000...- This shows that the issue is related to - wzr, [x1]which shows zero is being written to the memory address contained in a variable- x1(brackets are dereferencing). This hints that- x1contains a memory address that cannot be written to. Looking at the RShim logs, this variable is actually printed and its value/address can be seen as- 0x400000(secure RAM that the executing code cannot write to causing synchronous exception):- ERR[UEFI]: X0= - 0x45FFE0018X1=- 0x400000X2=- 0x99FFF548X3=- 0x99FFF568- Note that this - x1variable is part of the EDK2- EFI_SYSTEM_CONTEXT_AARCH64structure and the assembly code can be read to determine which register this is stored in for more debug if needed.
- Various system registers can also be inspected with GDB (refer to - /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-30/build/install/lib/openocd/scripts/aarch64.pyor Arm spec for list of relevant register names to use):- (gdb) bf-reg ttbr0_el2 ttbr0_el2 = - 0x99feb000(gdb) info reg ...Note- There may be issues accessing some registers depending on the current exception level (reference the Arm specifications for more information). 
Using Breakpoints
Make sure to use hardware breakpoints (hbreak) rather than software breakpoints with BlueField due to issues that can occur when software breakpoints are inserted. To demonstrate breakpoint usage the following example adds an infinite loop to the code before the crash occurs so that the debugger can be attached and breakpoints can be added. The following diff has been added to the test/crash image:
            
            --- a/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
+++ b/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
@@ -1790,6 +1790,8 @@ EfiBootManagerBoot (
     return;
   }
+  __asm__ volatile("b .");
+
...
    
OpenOCD SMP support also has to be disabled for hardware breakpoints to avoid halting all cores. Make the following change to your target/bluefield.cfg:
            
             # Configure SMP
 if { $_cores > 1 } {
-    eval $_smp_command
+#    eval $_smp_command
 }
    
Load the new test image and follow the previous instructions for attaching OpenOCD and GDB and loading EDK2 symbols. Make sure to attach to the port for a specific core.
Users may have better luck installing a preboot-install.bfb with the infinite loop and booting from flash rather than RShim because the code would not continue executing after jumping past the loop if the RShim installation times out. To reproduce the issue above, this would also mean installing the Linux image to flash.
Verify the system has stopped at the expected location:
            
            (gdb) where
#0  0x000000045a796d60 in EfiBootManagerBoot (BootOption=BootOption@entry=0x99fff968) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c:1793
...
(gdb) stepi
0x000000045a796d60      1793      __asm__ volatile("b .");
    
At this point, hardware breakpoints can be added using symbol names:
            
            # Adding breakpoint to a spot close to crash
(gdb) hbreak /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
Hardware assisted breakpoint 1 at 0x98d05a58: file /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c, line 1654.
 
# Use 'delete <n>' to delete breakpoint number n)
(gdb) info b
Num     Type           Disp Enb Address            What
1       hw breakpoint  keep y   0x0000000098d05a58 in CoreStartImage at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
    
After breakpoints have been added, the following can be done to move the program counter past the infinite loop (a single 4-byte instruction) and continue execution:
            
            (gdb) set $pc+=4
(gdb) c
Continuing.
 
Breakpoint 1, CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
1654        Image->Status = Image->EntryPoint (ImageHandle, Image->Info.SystemTable);
 
(gdb) where
#0  CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
...
    
Many of the normal GDB commands are supported.
Sometimes adding breakpoints can cause boot issues, and if the breakpoints cannot be deleted with GDB a hard reboot may be needed to recover.
OpenOCD logs how many hardware breakpoints are available:
            
            Info : bluefield.cpu0: hardware has 6 breakpoints, 4 watchpoints
Info : bluefield.cpu1: hardware has 6 breakpoints, 4 watchpoints
...
    
    
    
Another Backend Already Attached
BlueField devices are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver. In this case, typically following a system reboot, the RShim over USB prevails and the BlueField host reports the RShim status as another backend already attached. This is correct behavior as there can only be one RShim back end active at any given time. However, this means that the BlueField host does not own RShim access. To debug an issue, the user may need to access RShim from the BlueField BMC or host, but RShim is attached to the other side (host or BMC respectively).
The user is able to reclaim RShim ownership safely without logging into the other side:
- Stop the RShim driver on the remote Linux. Run: - systemctl stop rshim systemctl disable rshim 
- Restart RShim on the BlueField host. Run: - systemctl enable rshim systemctl start rshim 
This another backend already attached error can also be attributed to the RShim back end being owned by the BMC in BlueField devices with an integrated BMC. This is elaborated on further down on this page.
RShim Driver Not Loading
Verify whether your BlueField features an integrated BMC or not. Run:
            
            # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"
    
Example output for a BlueField with an integrated BMC:
            
            Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL
    
If your BlueField has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.
If your BlueField does not have an integrated BMC, refer to RShim driver not loading on host on DPU without integrated BMC.
RShim Driver Not Loading on DPU with Integrated BMC
RShim Driver Not Loading on Host
- Access the BMC via the RJ45 management port of the BlueField. 
- Delete RShim on the BMC: - systemctl stop rshim systemctl disable rshim 
- Enable RShim on the host: - systemctl enable rshim systemctl start rshim 
- Restart RShim service. Run: - sudo systemctl restart rshim - If RShim service does not launch automatically, run: - sudo systemctl status rshim - This command is expected to display - active (running).
- Display the current setting. Run: - # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro) - This output indicates that the RShim service is ready to use. 
RShim Driver Not Loading on BMC
- Verify that the RShim service is not running on host. Run: - systemctl status rshim - If the output is - active, then it may be presumed that the host has ownership of the RShim.
- Delete RShim on the host. Run: - systemctl stop rshim systemctl disable rshim 
- Enable RShim on the BMC. Run: - systemctl enable rshim systemctl start rshim 
- Display the current setting. Run: - # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME usb-1.0 - This output indicates that the RShim service is ready to use. 
RShim Driver Not Loading on Host on DPU Without Integrated BMC
- Download the suitable deb/rpm for RShim (management interface for DPU from the host) driver. 
- Reinstall RShim package on the host. - For Ubuntu/Debian, run: - sudo dpkg --force-all -i rshim-<version>.deb 
- For RHEL/CentOS, run: - sudo rpm -Uhv rshim-<version>.rpm 
 
- Restart RShim service. Run: - sudo systemctl restart rshim - If RShim service does not launch automatically, run: - sudo systemctl status rshim - This command is expected to display - active (running).
- Display the current setting. Run: - # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro) - This output indicates that the RShim service is ready to use. 
RShim Failed to Set Up CUSE RShim Error
Symptom
When starting the rshim service, the systemd journal may show an error similar to:
            
            $ sudo systemctl status rshim
...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Starting rshim driver for BlueField SoC...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Started rshim driver for BlueField SoC.
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Created PID file: /var/run/rshim.pid
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Probing pcie-0000:b1:00.2(uio)
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Create rshim pcie-0000:b1:00.2
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: pcie-0000:b1:00.2 enable
Apr 30 14:08:21 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: rshim1 failed to setup CUSE rshim
...
    
    
    
        
Cause
The rshim driver depends on the cuse.ko kernel module, which is typically provided by the kernel-modules-extra package. This package is usually installed as a dependency during the RShim RPM or DEB installation.
However, on some RHEL- or Rocky Linux-based systems, this dependency may not be enforced, resulting in a missing cuse.ko module and a failure during RShim initialization.
Solution
Installing kernel-modules-extra may trigger a kernel upgrade if your current kernel version is not available in the configured repositories. For example, installing this package may update the kernel from 5.14.0-570.4.1 to 5.14.0-570.12.1. This may also pull in related packages such as kernel, kernel-core, and kernel-modules.
- Install - kernel-modules-extra. For RHEL/Rocky Linux systems, install the package using:- sudo dnf install kernel-modules-extra 
- Load the - cusemodule. If the installed- kernel-modules-extramatches the currently running kernel, you can load the- cuse.komodule:- sudo modprobe cuse - If no errors are reported, the - cusemodule is now available for RShim.
- Restart the RShim service. Once - cuseis loaded, restart the RShim service:- sudo systemctl restart rshim - You should no longer see the - failed to setup CUSE rshimerror.
Additional Notes
- If - modprobe cusefails with a message about a missing module, it likely means the newly installed- kernel-modules-extraversion does not match the currently running kernel.
- In this case, reboot the system to use the updated kernel: - sudo reboot 
- After reboot, verify the running kernel version: - uname -r 
- Ensure it matches the version of - kernel-modules-extrathat was installed.
- In rare cases, you may need to adjust the GRUB configuration to ensure the system boots into the new kernel automatically: - sudo grub2-set- - default- 0sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Failed to Read IOMMU Link Error
The following is an informational message printed by RShim driver when trying to access via IOMMU:
            
            rshim service: /sys/bus/pci/devices/0000:01:00.2/iommu_group: failed to read iommu link
    
The RShim driver probes RShim in the following order: IOMMU, UIO, Direct Map. It then continues the probe until success, and one mechanism failure does not mean that the RShim driver fails unless some mechanism is really necessary (such as IOMMU) when Linux kernel lockdown is enabled.
Change Ownership of RShim from NIC BMC to Host
- Verify that your BlueField has an integrated BMC. Run the following on the host: - # sudo sudo lspci -s $(sudo lspci -d 15b3: | head - - 1| awk- '{print $1}') -vvv |grep- "Product Name"Product Name: BlueField-- 2DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL- The product name is supposed to show - integrated BMC.
- Access the BMC via the RJ45 management port of the BlueField. 
- Delete RShim on the BMC: - systemctl stop rshim systemctl disable rshim 
- Enable RShim on the host: - systemctl enable rshim systemctl start rshim 
- Restart RShim service. Run: - sudo systemctl restart rshim - If RShim service does not launch automatically, run: - sudo systemctl status rshim - This command is expected to display - active (running).
- Display the current setting. Run: - # cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro) - This output indicates that the RShim service is ready to use.