Software Installation and Upgrade
Information on how a user can troubleshoot issues installing software on BlueField.
Command |
Description |
|
Load software via RShim |
|
Dump RShim log |
|
Check lifecycle of the BFB |
|
Check the signature of the BFB file |
|
Dump the BFB content. If the command returned with errors or displays missing files then make sure you redownload the BFB file or request a new BFB file from NVIDIA. |
|
Reset the BlueField |
Errors During BlueField Software Install Using BFB
cat: write error: Connection Timed Out
When the BFB installation is interrupted or incomplete, this indicates an unexpected boot event that caused the BlueField to halt.
# cat bf-bundle-2.7
.0
-40_24.04_ubuntu-22
.04_prod.bfb > /dev/rshim1/boot
cat: write error: Connection timed out
To identify what could have went wrong during the BFB boot, dump the RShim log and identify error message(s) under Log Messages
section.
# echo 'DISPLAY_LEVEL 2'
> /dev/rshim0/misc
# cat /dev/rshim0/misc
"ERR[BL1]: PSC error -60" in RShim log
The message ERR[BL1]: PSC error -60
indicates that the BlueField PSC ROM failed to boot the PS firmware, and the boot for both the BlueField Arm and BlueField PSC is halted.
# cat /dev/rshim1/misc
DISPLAY_LEVEL 2
(0
:basic, 1
:advanced, 2
:log)
BOOT_MODE 0
(0
:rshim, 1
:emmc, 2
:emmc-boot-swap)
BOOT_TIMEOUT 150
(seconds)
DROP_MODE 0
(0
:normal, 1
:drop)
SW_RESET 0
(1
: reset)
DEV_NAME pcie-0000
:65
:00.1
DEV_INFO BlueField-3
(Rev 1
)
OPN_STR N/A
---------------------------------------
Log Messages
---------------------------------------
ERR[BL1]: PSC error -60
Connect to the BlueField Arm console (refer to SoC Management Interface - Logging and Counters).
Return to the original terminal and re-execute the
cat
command <or bfb-install>. Monitor the console output in parallel.
"PSC BR_EXIT timeout" Printed Out to the Console
The error message PSC BR_EXIT timeout
, printed out to the console, is likely the result of PSC ROM failing to load and authenticate PSC BL1.
Nvidia BlueField-3
rev1 BL1 V1.0
PSC BR_EXIT timeout
Reset the chip and verify its lifecycle. Run
echo 'SW_RESET 1' > /dev/rshim0/misc
and dump the RShim logecho 'DISPLAY_LEVEL 2' > /dev/rshim0/misc; cat /dev/rshim0/misc
.Identify the log
INFO[BL31]: lifecycle GA Secured
. Note that the log can display lifecycle other thanGA Secured
.GA Secured
orSecured (development)
may be printed.If the log is not present, wait until BlueField boots up and is ready, then connect to the BlueField Arm console (refer to SoC Management Interface to learn how to connect to the BlueField Arm console).
From BlueField Arm console, run:
# bfsbdump BlueField3 ---------------------- NV Production :
1
Arm Life Cycle : Secure Secure Boot : Enabled Secure Boot Key : Production ...If the
Arm Lifecycle
isSecure
, theSecure Boot
is enabled and theSecure Boot Key
isProduction
, then the chip lifecycle is equivalent toGA Secured
: Install a BFB file signed with a production key.If the
Arm Lifecycle
isSecure
, theSecure boot
is enabled and theSecure Boot Key
isDevelopment
, then the chip lifecycle is equivalent toSecured (development)
: Install a BFB file signed with a development key.Check the signature of the BFB file using the command
bfsbverify
and make sure theRoot-of-Trust Public Key
matches your BlueFieldSecure Boot Key
.# bfsbverify --bfb
default
.bfb --version2
Verify BFBfor
BlueField-3
platform ----------------------------------- Verify Root-of-Trust Public Key: NVIDIA official ROT key (production) Verify Chain-of-Trust certificates: BL2 Content Certificate...Verified OK DDR Content Certificate...Verified OK Trusted Key Certificate...Verified OK BL31 Key Certificate...Verified OK Bl31 Content Certificate...Verified OK BL32 Key Certificate...Not Found BL33 Key Certificate...Verified OK Bl33 Content Certificate...Verified OK Done.If it is not matching, request or download the correct BFB file to install on your BlueField.
Contact NVIDIA Enterprise Support if the BFB RoT Public Key is matching the BlueField Secure Boot Key.
Other PSC Boot Errors Printed Out to Console
These errors are likely the result of a corrupted BFB file. Check the integrity of the BFB file by calculating its md5sum
and compare it to the BFB file received from NVIDIA.
It is also possible to dump the BFB content using the command mlx-mkbfb -d
. If the command returns with errors or displays missing files, make sure to redownload the BFB file or request a new BFB file from NVIDIA.
Nvidia BlueField-3
rev1 BL1 V1.0
PSC VERIFY_BCT timeout
Nvidia BlueField-3
rev1 BL1 V1.0
Failed to load PSC-BL1
Nvidia BlueField-3
rev1 BL1 V1.0
PSC-BL1 BOOT_MODE_COLD timeout
Nvidia BlueField-3
rev1 BL1 V1.0
Failed to load PSC-FW
Nvidia BlueField-3
rev1 BL1 V1.0
PSC-BL1 MB1_CB_EXIT timeout
Bad Magic Number Error Printed out to console
This error is likely the result of a corrupted BFB file.
Try one of the following solutions:
Check the integrity of the BFB file by calculating its
md5sum
and compare it to the BFB file received from NVIDIA.Dump the BFB content using the command
mlx-mkbfb -d
:Nvidia BlueField-
3
rev1 BL1 V1.0
ERROR: BlueField boot: bad magic number0x7475612f
If the command returned with errors or displays missing files, then:
Redownload the BFB file; or
Request a new BFB file from NVIDIA
"PANIC(BL2): PC" Error in RShim Console
This error is likely caused by a failure in DDR training implemented by the Arm first stage bootloader.
# cat /dev/rshim0/misc
DISPLAY_LEVEL 2
(0
:basic, 1
:advanced, 2
:log)
BOOT_MODE 1
(0
:rshim, 1
:emmc, 2
:emmc-boot-swap)
BOOT_TIMEOUT 150
(seconds)
DROP_MODE 0
(0
:normal, 1
:drop)
SW_RESET 0
(1
: reset)
DEV_NAME pcie-lf-0000
:b3:00.0
DEV_INFO BlueField-3
(Rev 1
)
OPN_STR N/A
UP_TIME 350
(s)
SECURE_NIC_MODE 0
(0
:no, 1
:yes)
---------------------------------------
Log Messages
---------------------------------------
INFO[PSC]: PSC BL1 START
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: Configuring clocks for
Livefish mode
INFO[BL2]: VDDQ: 1118
mV
PANIC(BL2): PC = 0x40c7cc
elr_el1 0x0
esr_el1 0x0
far_el1 0x0
PC=0x40c7cc
is only an example. It could show any value, not necessarily 0x40c7cc.
To resolve the issue:
Verify whether the BlueField NIC is in LiveFish mode by checking the RShim log:
If the message
INFO[BL2]: Configuring clocks for Livefish mode
appears, then LiveFish mode is enabled.InfoThis message should follow
INFO[BL2]: boot mode (rshim)
.If the message in not present in the log, then the BlueField is in functional mode.
If the device is in LiveFish mode, then install the BlueField firmware prior to BFB installation.
If the device is not in LiveFish mode, then check that you have installed the correct BlueField firmware matching your configuration (please refer to Software Installation and Upgrade to learn how to install BlueField NIC firmware).
If the device is not in LiveFish mode and the BlueField firmware is matching the BlueField SKU, then contact NVIDIA Enterprise Support.
"INFO[UEFI]: Var reclaim" in RShim Console
If the variable reclaim operation is performed repeatedly, this could indicate that the UEFI Persistent Variable Store (UPVS) is running out of space.
# cat /dev/rshim0/misc
DISPLAY_LEVEL 2
(0
:basic, 1
:advanced, 2
:log)
BOOT_MODE 1
(0
:rshim, 1
:emmc, 2
:emmc-boot-swap)
BOOT_TIMEOUT 150
(seconds)
DROP_MODE 0
(0
:normal, 1
:drop)
SW_RESET 0
(1
: reset)
DEV_NAME pcie-0000
:65
:00.1
DEV_INFO BlueField-3
(Rev 1
)
OPN_STR N/A
---------------------------------------
Log Messages
---------------------------------------
INFO[PSC]: PSC BL1 START
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: VDDQ: 1118
mV
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle Secured (development)
INFO[BL31]: VDD: 751
mV
INFO[BL31]: runtime
INFO[BL31]: MB ping success
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: UPVS valid
WARN[UEFI]: UPVS full
INFO[UEFI]: Var reclaim
INFO[UEFI]: Var reclaim done
INFO[UEFI]: Var reclaim
INFO[UEFI]: Var reclaim done
INFO[UEFI]: Var reclaim
INFO[UEFI]: Var reclaim done
INFO[UEFI]: Var reclaim
INFO[UEFI]: Var reclaim done
Expect the DPU boot to be extremely slowly in this scenario.
Reset the BlueField:
echo
'SW_RESET 1'
> /dev/rshim0/miscLog into the BlueField Arm console.
Wait until you reach the Linux prompt or access into UEFI menu.
If you stop at the UEFI menu, you can either clean up the EFI variable store from Device Manager > System Configuration.
If the system gets to the Linux prompt, clean up the EFI variables under
/sys/firmware/efi/efivars
. This can be done by runningchattr -i /sys/firmware/efi/efivars/*
before runningrm -f
against any file in/sys/firmware/efi/efivars
.NoteIt is harmless to delete
dump-*
variables or any other user variables. HoweverBootXXXX
variables deletion if needed, must be performed usingefibootmgr
command line.WarningOther variable deletion can be performed at your own risk.
Boot Stops at UEFI Menu
The RShim log does not contain any specific error, but the UEFI menu screen is displayed on the BlueField Arm console.
---------------------------------------
Log Messages
---------------------------------------
INFO[PSC]: PSC BL1 START
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: VDDQ: 1118
mV
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle Secured (production)
INFO[BL31]: VDD: 751
mV
INFO[BL31]: runtime
INFO[BL31]: MB ping success
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: UPVS valid
INFO[UEFI]: PCIe enum
start
INFO[UEFI]: PCIe enum
end
INFO[UEFI]: UEFI Secure Boot (enabled)
INFO[UEFI]: Redfish enabled

This indicates that the kernel image inside the BFB file failed to boot.
To troubleshoot this issue, check the status of UEFI secure boot:
If UEFI secure boot is enabled (i.e., the message
INFO[UEFI]: UEFI Secure Boot (enabled)
is present in the RShim log), then check the signature of kernel image inside the BFB file:$ mlx-mkbfb -x bf-bundle-
2.7
.0
-40_24.04_ubuntu-22
.04_prod.bfb $ sbverify -l dump-image-v0 signature1
image signature issuers: - /C=GB/ST=Isle of Man/L=Douglas/O=Canonical Ltd./CN=Canonical Ltd. Master Certificate Authority image signature certificates: - subject: /C=GB/ST=Isle of Man/O=Canonical Ltd./OU=Secure Boot/CN=Canonical Ltd. Secure Boot Signing (Ubuntu Advantage2021
v1) issuer: /C=GB/ST=Isle of Man/L=Douglas/O=Canonical Ltd./CN=Canonical Ltd. Master Certificate AuthorityIf the signature is present:
Reset the BlueField:
echo
'SW_RESET 1'
> /dev/rshim0/miscCheck the list of the certificates enrolled in the BlueField Arm UEFI db by running
mokutil --db
from the BlueField Arm console:If the certificate is not displayed, then enroll the certificate before installing the BFB file. Refer to UEFI Secure Boot for details on how to enroll db certificate using Redfish, and/or UEFI menu.
If the certificate is displayed, then contact NVIDIA Enterprise Support
If the signature is not present, contact NVIDIA Enterprise Support
InfoIt is possible to disable UEFI secure boot and install the BFB file if you do not require UEFI secure boot.
If UEFI secure boot is disabled (i.e., the message
INFO[UEFI]: UEFI Secure Boot (disabled)
is present in the RShim log), then dump the content of the BFB file and check whetherBoot image (version 0)
is present:If
Boot image (version 0)
is not present, then you may be using a reduced BFB such aspreboot-install.bfb
. Download and install a fw-bundle BFB file.If
Boot image (version 0)
is present, contact NVIDIA Enterprise Support.$ mlx-mkbfb -d bf-bundle-
2.7
.0
-40_24.04_ubuntu-22
.04_prod.bfb ...25377280
Boot image (version0
)520665088
In-memory filesystem (version0
)
UEFI Does Not Boot the BFB Kernel Image
The RShim log does not contain a specific error but the login prompt appears on the BlueField Arm console:
---------------------------------------
Log Messages
---------------------------------------
INFO[PSC]: PSC BL1 START
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: VDDQ: 1118
mV
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle Secured (production)
INFO[BL31]: VDD: 751
mV
INFO[BL31]: runtime
INFO[BL31]: MB ping success
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: UPVS valid
INFO[UEFI]: PCIe enum
start
INFO[UEFI]: PCIe enum
end
INFO[UEFI]: UEFI Secure Boot (enabled)
INFO[UEFI]: Redfish enabled
INFO[UEFI]: DPU-BMC RF credentials found
INFO[UEFI]: exit Boot Service
INFO[MISC]: Linux up
INFO[MISC]: DPU is ready
This indicates that the kernel image inside the BFB file failed to boot so the UEFI defaulted to the first valid boot option.
To troubleshoot this issue:
Check the content of the BFB - verify that
Boot image (version 0)
is present.Check if UEFI secure boot is enabled and verify the certificates enrolled in UEFI db and the certificate used for the kernel image signature as explained earlier.
Network Boot (PXE, HTTP boot)
PXE/HTTP Boot Logging
When booting PXE or HTTP manually from the UEFI menu, helpful logging can get cut off due to the UEFI menu clearing the screen. To see the logs and ensure none are missed, dump the console logs into a file and read the log from there or get the BlueField console log dump from the BlueField BMC. For more information about retrieving BlueField console logs from the BMC, refer to the BMC and BlueField Logs page in the NVIDIA BlueField BMC Software User Manual. Alternatively, users may change the boot order so that PXE/HTTP boot is attempted before flash boot automatically and error logs are visible in real time on the console because the UEFI menu is skipped.
It is often helpful to troubleshoot and verify PXE boot before moving to HTTP boot because set up is a little easier and there is generally more UEFI logging available when PXE boot issues occur as opposed to HTTP boot issues.
The following subsections are a few examples of logs that may occur for several common scenarios.
DHCP Server is Not Running
[16:23:46]>>Start PXE over IPv4.
[16:24:45] PXE-E18: Server response timeout.
TFTP Server is Not Running
[16:35:36]>>Start PXE over IPv4.
[16:35:39] Station IP address is 192.168.100.2
[16:35:39]
[16:35:39] Server IP address is 192.168.100.1
[16:35:39] NBP filename is /shimaa64.efi
[16:35:39] NBP filesize is 0 Bytes
[16:35:39] PXE-E99: Unexpected network error.
PXE Boot File Does Not Exist
[16:28:32]>>Start PXE over IPv4.
[16:28:36] Station IP address is 192.168.100.2
[16:28:36]
[16:28:36] Server IP address is 192.168.100.1
[16:28:36] NBP filename is /PXE-TEST.efi
[16:28:36] NBP filesize is 0 Bytes
[16:28:36] PXE-E23: Client received TFTP error from server.
Shim Does Not Boot
[18:07:22]>>Start PXE over IPv4.
[18:07:26] Station IP address is 192.168.100.2
[18:07:26]
[18:07:26] Server IP address is 192.168.100.1
[18:07:26] NBP filename is /shimaa64.efi
[18:07:26] NBP filesize is 980057 Bytes
[18:07:26] Downloading NBP file...
[18:07:27]
[18:07:27] NBP file downloaded successfully.
This can often happen due to authentication issues with unsupported signatures or SBAT restrictions. It is important that UEFI supports the shim being booted and that the shim supports the version of grub being booted.
Grub Does Not Boot
[17:26:05]>>Start PXE over IPv4.
[17:26:09] Station IP address is 192.168.100.2
[17:26:09]
[17:26:09] Server IP address is 192.168.100.1
[17:26:09] NBP filename is /shimaa64.efi
[17:26:09] NBP filesize is 980056 Bytes
[17:26:09] Downloading NBP file...
[17:26:09]
[17:26:09] NBP file downloaded successfully.
[17:26:09]Fetching Netboot Image
[17:26:15]
Minimal BASH-like line editing is supported. For the first word, TAB
lists possible command completions. Anywhere else TAB lists possible
device or file completions.
Grub >
The grub being booted must support network boot. It is common for boot to stop at the grub command line when there are grub issues.
Successful PXE Boot
[16:37:10]>>Start PXE over IPv4.
[16:37:13] Station IP address is 192.168.100.2
[16:37:13]
[16:37:13] Server IP address is 192.168.100.1
[16:37:13] NBP filename is /shimaa64.efi
[16:37:13] NBP filesize is 980056 Bytes
[16:37:13] Downloading NBP file...
[16:37:14]
[16:37:14] NBP file downloaded successfully.
[16:37:14]Fetching Netboot Image
[16:37:22]
GNU GRUB version 2.06
...
At this point the GRUB menu should show some boot options which are available based on the GRUB config used for PXE boot.
DHCP Packet Inspection
It can often be helpful to look at the DHCP packets being sent over the network when troubleshooting PXE and HTTP boot issues. The sections below provide some examples for packet inspection using the Linux command line, but Wireshark is also a great alternative if supported.
IPv4
For IPv4 based PXE and HTTP boot, the tool dhcdump
can be installed on the DHCP host server and used to quickly parse different DHCP packets and options. The following is an example log taken from a BlueField PXE booting using the tmfifo_net0
interface:
root@bu-lab102:~# dhcpdump -i tmfifo_net0
TIME: 2024-06-10 10:26:29.980
IP: 0.0.0.0 (0:1a:ca:ff:ff:1) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
OP: 1 (BOOTPREQUEST)
HTYPE: 1 (Ethernet)
HLEN: 6
HOPS: 0
XID: 22093441
SECS: 0
FLAGS: 7f80
CIADDR: 0.0.0.0
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 00:1a:ca:ff:ff:01:00:00:00:00:00:00:00:00:00:00
SNAME: .
FNAME: .
OPTION: 53 ( 1) DHCP message type 1 (DHCPDISCOVER)
OPTION: 57 ( 2) Maximum DHCP message size 1472
OPTION: 55 ( 35) Parameter Request List 1 (Subnet mask)
2 (Time offset)
3 (Routers)
4 (Time server)
5 (Name server)
6 (DNS server)
12 (Host name)
13 (Boot file size)
15 (Domainname)
17 (Root path)
18 (Extensions path)
22 (Maximum datagram reassembly size)
23 (Default IP TTL)
28 (Broadcast address)
40 (NIS domain)
41 (NIS servers)
42 (NTP servers)
43 (Vendor specific info)
50 (Request IP address)
51 (IP address leasetime)
54 (Server identifier)
58 (T1)
59 (T2)
60 (Vendor class identifier)
66 (TFTP server name)
67 (Bootfile name)
97 (UUID/GUID)
128 (???)
129 (???)
130 (???)
131 (???)
132 (???)
133 (???)
134 (???)
135 (???)
OPTION: 97 ( 17) UUID/GUID 009c2debc0368611 ..-..6..
ee8000a088c20ee8 ........
18 .
OPTION: 94 ( 3) Client NDI 010300 ...
OPTION: 93 ( 2) Client System 000b ..
OPTION: 60 ( 13) Vendor class identifier NVIDIA/BF/PXE
OPTION: 43 (131) Vendor specific info 8005424633000081 ..BF3...
30426c7565466965 0BlueFie
6c643a342e382e30 ld:4.8.0
2d322d6765373965 -2-ge79e
3037662d64697274 07f-dirt
7900000000000000 y.......
0000000000000000 ........
008248444f43415f ..HDOCA_
322e352e305f4253 2.5.0_BS
505f342e352e305f P_4.5.0_
5562756e74755f32 Ubuntu_2
322e30342d312e32 2.04-1.2
3032333131303800 0231108.
0000000000000000 ........
0000000000000000 ........
0000000000000000 ........
000000 ...
---------------------------------------------------------------------------
TIME: 2024-06-10 10:26:29.981
IP: 192.168.100.1 (0:1a:ca:ff:ff:2) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
OP: 2 (BOOTPREPLY)
HTYPE: 1 (Ethernet)
HLEN: 6
HOPS: 0
XID: 22093441
SECS: 0
FLAGS: 7f80
CIADDR: 0.0.0.0
YIADDR: 192.168.100.2
SIADDR: 192.168.100.1
GIADDR: 0.0.0.0
CHADDR: 00:1a:ca:ff:ff:01:00:00:00:00:00:00:00:00:00:00
SNAME: .
FNAME: /PXE-TEST.efi.
OPTION: 53 ( 1) DHCP message type 2 (DHCPOFFER)
OPTION: 54 ( 4) Server identifier 192.168.100.1
OPTION: 51 ( 4) IP address leasetime 43200 (12h)
OPTION: 1 ( 4) Subnet mask 255.255.255.0
The example shows the DHCP discover packet sent by the client (BlueField) and the offer packet sent by the server as part of the DHCP DORA process (including useful information like the vendor class identifier and vendor-specific information). In this case, the DHCP server has been configured to serve a test file, PXE-TEST.efi
, over TFTP and it can be useful to verify DHCP, TFTP, and HTTP server configuration by looking at the packet dump.
An alternative to dhcpdump
is to use tcpdump
to look at all raw data sent over the network. For DHCP, only ports 67 and 68 need to be monitored:
# Monitor raw DHCP data
tcpdump -i tmfifo_net0 -n -vvv -xx port 67 or 78
# Convert packets to ASCII
tcpdump -i tmfifo_net0 -n -vvv -A port 67 or 78
IPv6
The dhcpdump
tool does not currently support IPv6, but tcpdump
can be used for monitoring raw and ASCII data by filtering on ports 546 and 547:
# Monitor raw DHCP data
tcpdump -i tmfifo_net0 -n -vvv -xx port 546 or 547
# Convert packets to ASCII
tcpdump -i tmfifo_net0 -n -vvv -A port 546 or 547