SoC Management Interface
RShim, the SoC management interface in the BlueField System-on-Chip (SoC), enables management, monitoring, and debugging of the device. It offers key functions like firmware updates, system status checks, Arm console access, and network communication through device files (e.g., boot
, misc
, console
) and the RShim network interface (i.e., tmfifo_net0
). This guide focuses on practical usage and troubleshooting from the user's side.
Command |
Description |
|
Check version |
|
Check RShim log |
|
Check RShim system log |
|
Check all system log |
|
Access RShim Console |
|
Update BlueField firmware (BFB) locally |
|
Update BlueField firmware (BFB) remotely |
|
Configure the RShim network interface |
RShim logging uses an internal 1KB hardware buffer to track booting progress and record important messages. It is written by the NVIDIA BlueField Arm cores and is displayed by the RShim driver from the USB/PCIe host machine.
The RShim log messages can be displayed described in the following:
Check the
DISPLAY_LEVEL
level in file/dev/rshim0/misc
:# cat /dev/rshim0/misc DISPLAY_LEVEL
0
(0
:basic,1
:advanced,2
:log) …Set
DISPLAY_LEVEL
to 2:# echo
"DISPLAY_LEVEL 2"
> /dev/rshim0/miscLog messages are displayed in the
misc
file. The following is an example output from BlueField-2:# cat /dev/rshim0/misc ... --------------------------------------- Log Messages --------------------------------------- INFO[BL2]: start INFO[BL2]: no DDR on MSS0 INFO[BL2]: calc DDR freq (clk_ref
53836948
) INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: runtime INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: PCIeenum
start INFO[UEFI]: PCIeenum
end
The BFB installation flow can be traced using the following interfaces:
From the host –
RShim console (
/dev/rshim0/console
)RShim log buffer (
/dev/rshim0/misc
); also included inbfb-install
's outputUART console (
/dev/ttyUSB0
)
From the BMC console –
SSH to the BMC and run
obmc-console-client
Additional information about BMC interfaces is available in BMC software documentation.
From the BlueField –
/root/<OS>.installation.log
available on the DPU OS after installation
Non-secure BlueField devices support GDB using OpenOCD. BlueField RShim support is up-streamed to the OpenOCD project which implements a GDB server for BlueField debugging. OpenOCD can use the RShim driver to access the Arm debug access port (DAP) directly on the BlueField SoC from the RShim. For more information, refer to documentation in /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-26/build/install/Documentation/HOWTO-openocd
which also describes how to use GDB to debug the Linux kernel.
To get started, boot the BlueField with the EFI stub debug image to reproduce the crash and halt the system when the Synchronous Exception occurs. It is also possible to add an infinite loop to the code where attaching the debugger is desired, and to then manually set the program counter to jump past the loop.
Run GDB and OpenOCD on the host server machine connected to the BlueField. It is best practice to copy the OpenOCD binary and config files to a separate directory so the config can be edited as needed:
# Create writeable OpenOCD copy. Edit target/bluefield.cfg to specify which rshim device to use. root
@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/install/lib/openocd# cp openocd ~/james/openocd/ root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/install/lib/openocd# cpinterface
/rshim.cfg ~/james/openocd/interface
/ root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/install/lib/openocd# cp target/bluefield.cfg ~/james/openocd/target/ root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/install/lib/openocd# cp board/bluefield.cfg ~/james/openocd/board/ # Run OpenOCD (GDB server communicating with BF through rshim) in one window root@bu
-lab102:~/james/openocd# ./openocd -f board/bluefield.cfg # In another window source toolchain and run GDB client root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/dist# ./poky-glibc-x86_64-core-image-initramfs-aarch64-bluefield-toolchain-BlueField-4.7
.0.13127
.2.7
.4
.sh root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/dist# . /opt/poky/2.7
.4
/environment-setup-aarch64-poky-linux root@bu
-lab102:/auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/dist# aarch64-poky-linux-gdbIn GDB client window, perform:
# Connect to GDB server and set remote timeout (seconds) (gdb) target extended-remote :
3333
(gdb) set remotetimeout60
# Source helpful debug functions (gdb) source /auto/sw_soc_dev/bluefield-rel-4.7
.0
/last/build/install/lib/openocd/scripts/bfdbg.py # Available commands (gdb) bf-help bf-edk2 symbol [all] -- Load symbols bf-info -- Display info bf-mmu <virt2phys | lookup> <vaddr> -- MMU operation bf-reg [<reg-name [value]> | all] -- Show/Set registers # Verify the BF is in EDK2 mode (may need to reboot and restartif
not) (gdb) bf-info PC =0x45fee7294
, EL =2
EDK2 # Load all EDK2/UEFI symbols (this
can take awhile
) (gdb) bf-edk2 symbol all # Now can look at the backtrace with symbol information (gdb) bt #0
0x000000045fee6680
in CpuDeadLoop () at /home/scratch/james/build/edk2/edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:31
#1
0x000000045fee6a14
in DefaultExceptionHandler (ExceptionType=<optimized out>, SystemContext=...) at /home/scratch/james/build/edk2/edk2/MlxPlatformPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c:336
#2
0x000000045fee7340
in ExceptionHandlersEnd () Backtrace stopped: previous frame identical tothis
frame (corrupt stack?)The backtrace above does not provide much helpful information in this case (it shows the device is halted in the EDK2 exception handler), but may be useful depending on the issue.
The RShim log can provide the PC address:
Synchronous Exception at
0x459B89420
ERR[UEFI]: PC=0x459B89420
(B900003F D5033F9F94000076
34000960
) ERR[UEFI]: PC=0x459B88F48
ERR[UEFI]: PC=0x459B84998
ERR[UEFI]: PC=0x98D05A68
(0x13A68
) [1
] DxeCore.dll ERR[UEFI]: PC=0x45A7973A8
(0x103A8
) [2
] BdsDxe.dll ERR[UEFI]: X0=0x45FFE0018
X1=0x400000
X2=0x99FFF548
X3=0x99FFF568
ERR[UEFI]: X4=0x99FFF570
X5=0x82000000
X6=0x45F2363C0
X7=0x11A18F858A986D85
ERR[UEFI]: X8=0x4A3823DC9042A9DE
X9=0x4D54D42AC44A6076
X10=0x1
X11=0x99FFF3F7
ERR[UEFI]: X12=0x45F2AC018
X13=0x99FFF3F8
X14=0x1
X15=0x88000C40
Dump the 32 bit instructions at that address:
(gdb) x /32i
0x459B89420
0x459b89420
: str wzr, [x1]0x459b89424
: dsb sy0x459b89428
: bl0x459b89600
0x459b8942c
: cbz w0,0x459b89558
0x459b89430
: bl0x459b89610
0x459b89434
: cbz w0,0x459b89544
0x459b89438
: and x1, x19, #0xffffffffffe00000
0x459b8943c
: adrp x26,0x45a1e2000
...This shows that the issue is related to
wzr, [x1]
which shows zero is being written to the memory address contained in a variablex1
(brackets are dereferencing). This hints thatx1
contains a memory address that cannot be written to. Looking at the RShim logs, this variable is actually printed and its value/address can be seen as0x400000
(secure RAM that the executing code cannot write to causing synchronous exception):ERR[UEFI]: X0=
0x45FFE0018
X1=0x400000
X2=0x99FFF548
X3=0x99FFF568
Note that this
x1
variable is part of the EDK2EFI_SYSTEM_CONTEXT_AARCH64
structure and the assembly code can be read to determine which register this is stored in for more debug if needed.Various system registers can also be inspected with GDB (refer to
/auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-30/build/install/lib/openocd/scripts/aarch64.py
or Arm spec for list of relevant register names to use):(gdb) bf-reg ttbr0_el2 ttbr0_el2 =
0x99feb000
(gdb) info reg ...NoteThere may be issues accessing some registers depending on the current exception level (reference the Arm specifications for more information).
Using Breakpoints
Make sure to use hardware breakpoints (hbreak
) rather than software breakpoints with BlueField due to issues that can occur when software breakpoints are inserted. To demonstrate breakpoint usage the following example adds an infinite loop to the code before the crash occurs so that the debugger can be attached and breakpoints can be added. The following diff has been added to the test/crash image:
--- a/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
+++ b/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
@@ -1790,6 +1790,8 @@ EfiBootManagerBoot (
return;
}
+ __asm__ volatile("b .");
+
...
OpenOCD SMP support also has to be disabled for hardware breakpoints to avoid halting all cores. Make the following change to your target/bluefield.cfg
:
# Configure SMP
if { $_cores > 1 } {
- eval $_smp_command
+# eval $_smp_command
}
Load the new test image and follow the previous instructions for attaching OpenOCD and GDB and loading EDK2 symbols. Make sure to attach to the port for a specific core.
Users may have better luck installing a preboot-install.bfb
with the infinite loop and booting from flash rather than RShim because the code would not continue executing after jumping past the loop if the RShim installation times out. To reproduce the issue above, this would also mean installing the Linux image to flash.
Verify the system has stopped at the expected location:
(gdb) where
#0 0x000000045a796d60 in EfiBootManagerBoot (BootOption=BootOption@entry=0x99fff968) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c:1793
...
(gdb) stepi
0x000000045a796d60 1793 __asm__ volatile("b .");
At this point, hardware breakpoints can be added using symbol names:
# Adding breakpoint to a spot close to crash
(gdb) hbreak /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
Hardware assisted breakpoint 1 at 0x98d05a58: file /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c, line 1654.
# Use 'delete <n>' to delete breakpoint number n)
(gdb) info b
Num Type Disp Enb Address What
1 hw breakpoint keep y 0x0000000098d05a58 in CoreStartImage at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
After breakpoints have been added, the following can be done to move the program counter past the infinite loop (a single 4-byte instruction) and continue execution:
(gdb) set $pc+=4
(gdb) c
Continuing.
Breakpoint 1, CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
1654 Image->Status = Image->EntryPoint (ImageHandle, Image->Info.SystemTable);
(gdb) where
#0 CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
...
Many of the normal GDB commands are supported.
Sometimes adding breakpoints can cause boot issues, and if the breakpoints cannot be deleted with GDB a hard reboot may be needed to recover.
OpenOCD logs how many hardware breakpoints are available:
Info : bluefield.cpu0: hardware has 6 breakpoints, 4 watchpoints
Info : bluefield.cpu1: hardware has 6 breakpoints, 4 watchpoints
...
Another Backend Already Attached
BlueField devices are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver. In this case, typically following a system reboot, the RShim over USB prevails and the BlueField host reports the RShim status as another backend already attached
. This is correct behavior as there can only be one RShim back end active at any given time. However, this means that the BlueField host does not own RShim access. To debug an issue, the user may need to access RShim from the BlueField BMC or host, but RShim is attached to the other side (host or BMC respectively).
The user is able to reclaim RShim ownership safely without logging into the other side:
Stop the RShim driver on the remote Linux. Run:
systemctl stop rshim systemctl disable rshim
Restart RShim on the BlueField host. Run:
systemctl enable rshim systemctl start rshim
This another backend already attached
error can also be attributed to the RShim back end being owned by the BMC in BlueField devices with an integrated BMC. This is elaborated on further down on this page.
RShim Driver Not Loading
Verify whether your BlueField features an integrated BMC or not. Run:
# sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"
Example output for a BlueField with an integrated BMC:
Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL
If your BlueField has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.
If your BlueField does not have an integrated BMC, refer to RShim driver not loading on host on DPU without integrated BMC.
RShim Driver Not Loading on DPU with Integrated BMC
RShim Driver Not Loading on Host
Access the BMC via the RJ45 management port of the BlueField.
Delete RShim on the BMC:
systemctl stop rshim systemctl disable rshim
Enable RShim on the host:
systemctl enable rshim systemctl start rshim
Restart RShim service. Run:
sudo systemctl restart rshim
If RShim service does not launch automatically, run:
sudo systemctl status rshim
This command is expected to display
active (running)
.Display the current setting. Run:
# cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)
This output indicates that the RShim service is ready to use.
RShim Driver Not Loading on BMC
Verify that the RShim service is not running on host. Run:
systemctl status rshim
If the output is
active
, then it may be presumed that the host has ownership of the RShim.Delete RShim on the host. Run:
systemctl stop rshim systemctl disable rshim
Enable RShim on the BMC. Run:
systemctl enable rshim systemctl start rshim
Display the current setting. Run:
# cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME usb-1.0
This output indicates that the RShim service is ready to use.
RShim Driver Not Loading on Host on DPU Without Integrated BMC
Download the suitable deb/rpm for RShim (management interface for DPU from the host) driver.
Reinstall RShim package on the host.
For Ubuntu/Debian, run:
sudo dpkg --force-all -i rshim-<version>.deb
For RHEL/CentOS, run:
sudo rpm -Uhv rshim-<version>.rpm
Restart RShim service. Run:
sudo systemctl restart rshim
If RShim service does not launch automatically, run:
sudo systemctl status rshim
This command is expected to display
active (running)
.Display the current setting. Run:
# cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)
This output indicates that the RShim service is ready to use.
Error Messages
The following is an informational message printed by RShim driver when trying to access via IOMMU:
rshim service: /sys/bus/pci/devices/0000
:01
:00.2
/iommu_group: failed to read iommu link
The RShim driver probes RShim in the following order: IOMMU, UIO, Direct Map. It then continues the probe until success, and one mechanism failure does not mean that the RShim driver fails unless some mechanism is really necessary (such as IOMMU) when Linux kernel lockdown is enabled.
Change Ownership of RShim from NIC BMC to Host
Verify that your BlueField has an integrated BMC. Run the following on the host:
# sudo sudo lspci -s $(sudo lspci -d 15b3: | head -
1
| awk'{print $1}'
) -vvv |grep"Product Name"
Product Name: BlueField-2
DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHLThe product name is supposed to show
integrated BMC
.Access the BMC via the RJ45 management port of the BlueField.
Delete RShim on the BMC:
systemctl stop rshim systemctl disable rshim
Enable RShim on the host:
systemctl enable rshim systemctl start rshim
Restart RShim service. Run:
sudo systemctl restart rshim
If RShim service does not launch automatically, run:
sudo systemctl status rshim
This command is expected to display
active (running)
.Display the current setting. Run:
# cat /dev/rshim<N>/misc | grep DEV_NAME DEV_NAME pcie-04:00.2 (ro)
This output indicates that the RShim service is ready to use.