What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

Software Installation and Upgrade

Information on how a user can troubleshoot issues installing software on BlueField.

Command

Description

cat <filename>bfb > /dev/rshim1/boot

Load software via RShim

echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc; cat /dev/rshim0/misc

Dump RShim log

bfsbdump

Check lifecycle of the BFB

bfsbverify

Check the signature of the BFB file

mlx-mkbfb -d

Dump the BFB content. If the command returned with errors or displays missing files then make sure you redownload the BFB file or request a new BFB file from NVIDIA.

echo 'SW_RESET 1' > /dev/rshim0/misc

Reset the BlueField

Errors During BlueField Software Install Using BFB

cat: write error: Connection Timed Out

When the BFB installation is interrupted or incomplete, this indicates an unexpected boot event that caused the BlueField to halt.

Copy
Copied!
            

# cat bf-bundle-2.7.0-40_24.04_ubuntu-22.04_prod.bfb > /dev/rshim1/boot cat: write error: Connection timed out

To identify what could have went wrong during the BFB boot, dump the RShim log and identify error message(s) under Log Messages section.

Copy
Copied!
            

# echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc # cat /dev/rshim0/misc

"ERR[BL1]: PSC error -60" in RShim log

The message ERR[BL1]: PSC error -60 indicates that the BlueField PSC ROM failed to boot the PS firmware, and the boot for both the BlueField Arm and BlueField PSC is halted.

Copy
Copied!
            

# cat /dev/rshim1/misc DISPLAY_LEVEL 2 (0:basic, 1:advanced, 2:log) BOOT_MODE 0 (0:rshim, 1:emmc, 2:emmc-boot-swap) BOOT_TIMEOUT 150 (seconds) DROP_MODE 0 (0:normal, 1:drop) SW_RESET 0 (1: reset) DEV_NAME pcie-0000:65:00.1 DEV_INFO BlueField-3(Rev 1) OPN_STR N/A --------------------------------------- Log Messages --------------------------------------- ERR[BL1]: PSC error -60

  1. Connect to the BlueField Arm console (refer to SoC Management Interface - Logging and Counters).

  2. Return to the original terminal and re-execute the cat command <or bfb-install>. Monitor the console output in parallel.

"PSC BR_EXIT timeout" Printed Out to the Console

The error message PSC BR_EXIT timeout, printed out to the console, is likely the result of PSC ROM failing to load and authenticate PSC BL1.

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 PSC BR_EXIT timeout

  1. Reset the chip and verify its lifecycle. Run echo 'SW_RESET 1' > /dev/rshim0/misc and dump the RShim log echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc; cat /dev/rshim0/misc.

  2. Identify the log INFO[BL31]: lifecycle GA Secured . Note that the log can display lifecycle other than GA Secured. GA Secured or Secured (development) may be printed.

  3. If the log is not present, wait until BlueField boots up and is ready, then connect to the BlueField Arm console (refer to SoC Management Interface to learn how to connect to the BlueField Arm console).

  4. From BlueField Arm console, run:

    Copy
    Copied!
                

    # bfsbdump   BlueField3 ---------------------- NV Production : 1 Arm Life Cycle : Secure Secure Boot : Enabled Secure Boot Key : Production ...

  5. If the Arm Lifecycle is Secure, the Secure Boot is enabled and the Secure Boot Key is Production, then the chip lifecycle is equivalent to GA Secured: Install a BFB file signed with a production key.

  6. If the Arm Lifecycle is Secure, the Secure boot is enabled and the Secure Boot Key is Development, then the chip lifecycle is equivalent to Secured (development): Install a BFB file signed with a development key.

  7. Check the signature of the BFB file using the command bfsbverify and make sure the Root-of-Trust Public Key matches your BlueField Secure Boot Key.

    Copy
    Copied!
                

    # bfsbverify --bfb default.bfb --version 2   Verify BFB for BlueField-3 platform -----------------------------------   Verify Root-of-Trust Public Key:  NVIDIA official ROT key (production)   Verify Chain-of-Trust certificates: BL2 Content Certificate...Verified OK DDR Content Certificate...Verified OK Trusted Key Certificate...Verified OK BL31 Key Certificate...Verified OK Bl31 Content Certificate...Verified OK BL32 Key Certificate...Not Found BL33 Key Certificate...Verified OK Bl33 Content Certificate...Verified OK   Done.

    1. If it is not matching, request or download the correct BFB file to install on your BlueField.

    2. Contact NVIDIA Enterprise Support if the BFB RoT Public Key is matching the BlueField Secure Boot Key.

Other PSC Boot Errors Printed Out to Console

These errors are likely the result of a corrupted BFB file. Check the integrity of the BFB file by calculating its md5sum and compare it to the BFB file received from NVIDIA.

It is also possible to dump the BFB content using the command mlx-mkbfb -d. If the command returns with errors or displays missing files, make sure to redownload the BFB file or request a new BFB file from NVIDIA.

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 PSC VERIFY_BCT timeout

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 Failed to load PSC-BL1

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 PSC-BL1 BOOT_MODE_COLD timeout

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 Failed to load PSC-FW

Copy
Copied!
            

Nvidia BlueField-3 rev1 BL1 V1.0 PSC-BL1 MB1_CB_EXIT timeout

Bad Magic Number Error Printed out to console

This error is likely the result of a corrupted BFB file.

Try one of the following solutions:

  • Check the integrity of the BFB file by calculating its md5sum and compare it to the BFB file received from NVIDIA.

  • Dump the BFB content using the command mlx-mkbfb -d:

    Copy
    Copied!
                

    Nvidia BlueField-3 rev1 BL1 V1.0 ERROR: BlueField boot: bad magic number 0x7475612f

    If the command returned with errors or displays missing files, then:

    • Redownload the BFB file; or

    • Request a new BFB file from NVIDIA

"PANIC(BL2): PC" Error in RShim Console

This error is likely caused by a failure in DDR training implemented by the Arm first stage bootloader.

Copy
Copied!
            

# cat /dev/rshim0/misc DISPLAY_LEVEL 2 (0:basic, 1:advanced, 2:log) BOOT_MODE 1 (0:rshim, 1:emmc, 2:emmc-boot-swap) BOOT_TIMEOUT 150 (seconds) DROP_MODE 0 (0:normal, 1:drop) SW_RESET 0 (1: reset) DEV_NAME pcie-lf-0000:b3:00.0 DEV_INFO BlueField-3(Rev 1) OPN_STR N/A UP_TIME 350(s) SECURE_NIC_MODE 0 (0:no, 1:yes) --------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (rshim) INFO[BL2]: Configuring clocks for Livefish mode INFO[BL2]: VDDQ: 1118 mV PANIC(BL2): PC = 0x40c7cc elr_el1 0x0 esr_el1 0x0 far_el1 0x0

Note

PC=0x40c7cc is only an example. It could show any value, not necessarily 0x40c7cc.

To resolve the issue:

  1. Verify whether the BlueField NIC is in LiveFish mode by checking the RShim log:

    1. If the message INFO[BL2]: Configuring clocks for Livefish mode appears, then LiveFish mode is enabled.

      Info

      This message should follow INFO[BL2]: boot mode (rshim).

    2. If the message in not present in the log, then the BlueField is in functional mode.

  2. If the device is in LiveFish mode, then install the BlueField firmware prior to BFB installation.

  3. If the device is not in LiveFish mode, then check that you have installed the correct BlueField firmware matching your configuration (please refer to Software Installation and Upgrade to learn how to install BlueField NIC firmware).

  4. If the device is not in LiveFish mode and the BlueField firmware is matching the BlueField SKU, then contact NVIDIA Enterprise Support.

"INFO[UEFI]: Var reclaim" in RShim Console

If the variable reclaim operation is performed repeatedly, this could indicate that the UEFI Persistent Variable Store (UPVS) is running out of space.

Copy
Copied!
            

# cat /dev/rshim0/misc DISPLAY_LEVEL 2 (0:basic, 1:advanced, 2:log) BOOT_MODE 1 (0:rshim, 1:emmc, 2:emmc-boot-swap) BOOT_TIMEOUT 150 (seconds) DROP_MODE 0 (0:normal, 1:drop) SW_RESET 0 (1: reset) DEV_NAME pcie-0000:65:00.1 DEV_INFO BlueField-3(Rev 1) OPN_STR N/A --------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (rshim) INFO[BL2]: VDDQ: 1118 mV INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: lifecycle Secured (development) INFO[BL31]: VDD: 751 mV INFO[BL31]: runtime INFO[BL31]: MB ping success INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: UPVS valid WARN[UEFI]: UPVS full INFO[UEFI]: Var reclaim INFO[UEFI]: Var reclaim done INFO[UEFI]: Var reclaim INFO[UEFI]: Var reclaim done INFO[UEFI]: Var reclaim INFO[UEFI]: Var reclaim done INFO[UEFI]: Var reclaim INFO[UEFI]: Var reclaim done

Info

Expect the DPU boot to be extremely slowly in this scenario.

  1. Reset the BlueField:

    Copy
    Copied!
                

    echo 'SW_RESET 1' > /dev/rshim0/misc

  2. Log into the BlueField Arm console.

  3. Wait until you reach the Linux prompt or access into UEFI menu.

    • If you stop at the UEFI menu, you can either clean up the EFI variable store from Device Manager > System Configuration.

    • If the system gets to the Linux prompt, clean up the EFI variables under /sys/firmware/efi/efivars. This can be done by running chattr -i /sys/firmware/efi/efivars/* before running rm -f against any file in /sys/firmware/efi/efivars.

      Note

      It is harmless to delete dump-* variables or any other user variables. However BootXXXX variables deletion if needed, must be performed using efibootmgr command line.

      Warning

      Other variable deletion can be performed at your own risk.

Boot Stops at UEFI Menu

The RShim log does not contain any specific error, but the UEFI menu screen is displayed on the BlueField Arm console.

Copy
Copied!
            

--------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (rshim) INFO[BL2]: VDDQ: 1118 mV INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: lifecycle Secured (production) INFO[BL31]: VDD: 751 mV INFO[BL31]: runtime INFO[BL31]: MB ping success INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: UPVS valid INFO[UEFI]: PCIe enum start INFO[UEFI]: PCIe enum end INFO[UEFI]: UEFI Secure Boot (enabled) INFO[UEFI]: Redfish enabled

image-2024-8-23_16-34-47-version-1-modificationdate-1724445287863-api-v2.png

This indicates that the kernel image inside the BFB file failed to boot.

To troubleshoot this issue, check the status of UEFI secure boot:

  • If UEFI secure boot is enabled (i.e., the message INFO[UEFI]: UEFI Secure Boot (enabled) is present in the RShim log), then check the signature of kernel image inside the BFB file:

    Copy
    Copied!
                

    $ mlx-mkbfb -x bf-bundle-2.7.0-40_24.04_ubuntu-22.04_prod.bfb $ sbverify -l dump-image-v0 signature 1 image signature issuers: - /C=GB/ST=Isle of Man/L=Douglas/O=Canonical Ltd./CN=Canonical Ltd. Master Certificate Authority image signature certificates: - subject: /C=GB/ST=Isle of Man/O=Canonical Ltd./OU=Secure Boot/CN=Canonical Ltd. Secure Boot Signing (Ubuntu Advantage 2021 v1) issuer: /C=GB/ST=Isle of Man/L=Douglas/O=Canonical Ltd./CN=Canonical Ltd. Master Certificate Authority

    • If the signature is present:

      1. Reset the BlueField:

        Copy
        Copied!
                    

        echo 'SW_RESET 1' > /dev/rshim0/misc

      2. Check the list of the certificates enrolled in the BlueField Arm UEFI db by running mokutil --db from the BlueField Arm console:

        • If the certificate is not displayed, then enroll the certificate before installing the BFB file. Refer to UEFI Secure Boot for details on how to enroll db certificate using Redfish, and/or UEFI menu.

        • If the certificate is displayed, then contact NVIDIA Enterprise Support

    • If the signature is not present, contact NVIDIA Enterprise Support

      Info

      It is possible to disable UEFI secure boot and install the BFB file if you do not require UEFI secure boot.

  • If UEFI secure boot is disabled (i.e., the message INFO[UEFI]: UEFI Secure Boot (disabled) is present in the RShim log), then dump the content of the BFB file and check whether Boot image (version 0) is present:

    • If Boot image (version 0) is not present, then you may be using a reduced BFB such as preboot-install.bfb. Download and install a fw-bundle BFB file.

    • If Boot image (version 0) is present, contact NVIDIA Enterprise Support.

      Copy
      Copied!
                  

      $ mlx-mkbfb -d bf-bundle-2.7.0-40_24.04_ubuntu-22.04_prod.bfb ... 25377280 Boot image (version 0)  520665088 In-memory filesystem (version 0)

UEFI Does Not Boot the BFB Kernel Image

The RShim log does not contain a specific error but the login prompt appears on the BlueField Arm console:

Copy
Copied!
            

--------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (rshim) INFO[BL2]: VDDQ: 1118 mV INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: lifecycle Secured (production) INFO[BL31]: VDD: 751 mV INFO[BL31]: runtime INFO[BL31]: MB ping success INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: UPVS valid INFO[UEFI]: PCIe enum start INFO[UEFI]: PCIe enum end INFO[UEFI]: UEFI Secure Boot (enabled) INFO[UEFI]: Redfish enabled INFO[UEFI]: DPU-BMC RF credentials found INFO[UEFI]: exit Boot Service INFO[MISC]: Linux up INFO[MISC]: DPU is ready

This indicates that the kernel image inside the BFB file failed to boot so the UEFI defaulted to the first valid boot option.

To troubleshoot this issue:

  1. Check the content of the BFB - verify that Boot image (version 0) is present.

  2. Check if UEFI secure boot is enabled and verify the certificates enrolled in UEFI db and the certificate used for the kernel image signature as explained earlier.

Network Boot (PXE, HTTP boot)

PXE/HTTP Boot Logging

When booting PXE or HTTP manually from the UEFI menu, helpful logging can get cut off due to the UEFI menu clearing the screen. To see the logs and ensure none are missed, dump the console logs into a file and read the log from there or get the BlueField console log dump from the BlueField BMC. For more information about retrieving BlueField console logs from the BMC, refer to the BMC and BlueField Logs page in the NVIDIA BlueField BMC Software User Manual. Alternatively, users may change the boot order so that PXE/HTTP boot is attempted before flash boot automatically and error logs are visible in real time on the console because the UEFI menu is skipped.

Tip

It is often helpful to troubleshoot and verify PXE boot before moving to HTTP boot because set up is a little easier and there is generally more UEFI logging available when PXE boot issues occur as opposed to HTTP boot issues.

The following subsections are a few examples of logs that may occur for several common scenarios.

DHCP Server is Not Running

Copy
Copied!
            

[16:23:46]>>Start PXE over IPv4. [16:24:45] PXE-E18: Server response timeout.


TFTP Server is Not Running

Copy
Copied!
            

[16:35:36]>>Start PXE over IPv4. [16:35:39] Station IP address is 192.168.100.2 [16:35:39] [16:35:39] Server IP address is 192.168.100.1 [16:35:39] NBP filename is /shimaa64.efi [16:35:39] NBP filesize is 0 Bytes [16:35:39] PXE-E99: Unexpected network error.


PXE Boot File Does Not Exist

Copy
Copied!
            

[16:28:32]>>Start PXE over IPv4. [16:28:36] Station IP address is 192.168.100.2 [16:28:36] [16:28:36] Server IP address is 192.168.100.1 [16:28:36] NBP filename is /PXE-TEST.efi [16:28:36] NBP filesize is 0 Bytes [16:28:36] PXE-E23: Client received TFTP error from server.


Shim Does Not Boot

Copy
Copied!
            

[18:07:22]>>Start PXE over IPv4. [18:07:26] Station IP address is 192.168.100.2 [18:07:26] [18:07:26] Server IP address is 192.168.100.1 [18:07:26] NBP filename is /shimaa64.efi [18:07:26] NBP filesize is 980057 Bytes [18:07:26] Downloading NBP file... [18:07:27] [18:07:27] NBP file downloaded successfully.

This can often happen due to authentication issues with unsupported signatures or SBAT restrictions. It is important that UEFI supports the shim being booted and that the shim supports the version of grub being booted.

Grub Does Not Boot

Copy
Copied!
            

[17:26:05]>>Start PXE over IPv4. [17:26:09] Station IP address is 192.168.100.2 [17:26:09] [17:26:09] Server IP address is 192.168.100.1 [17:26:09] NBP filename is /shimaa64.efi [17:26:09] NBP filesize is 980056 Bytes [17:26:09] Downloading NBP file... [17:26:09] [17:26:09] NBP file downloaded successfully. [17:26:09]Fetching Netboot Image [17:26:15] Minimal BASH-like line editing is supported. For the first word, TAB lists possible command completions. Anywhere else TAB lists possible device or file completions. Grub >

The grub being booted must support network boot. It is common for boot to stop at the grub command line when there are grub issues.

Successful PXE Boot

Copy
Copied!
            

[16:37:10]>>Start PXE over IPv4. [16:37:13] Station IP address is 192.168.100.2 [16:37:13] [16:37:13] Server IP address is 192.168.100.1 [16:37:13] NBP filename is /shimaa64.efi [16:37:13] NBP filesize is 980056 Bytes [16:37:13] Downloading NBP file... [16:37:14] [16:37:14] NBP file downloaded successfully. [16:37:14]Fetching Netboot Image [16:37:22] GNU GRUB version 2.06 ...

At this point the GRUB menu should show some boot options which are available based on the GRUB config used for PXE boot.

DHCP Packet Inspection

It can often be helpful to look at the DHCP packets being sent over the network when troubleshooting PXE and HTTP boot issues. The sections below provide some examples for packet inspection using the Linux command line, but Wireshark is also a great alternative if supported.

IPv4

For IPv4 based PXE and HTTP boot, the tool dhcdump can be installed on the DHCP host server and used to quickly parse different DHCP packets and options. The following is an example log taken from a BlueField PXE booting using the tmfifo_net0 interface:

Copy
Copied!
            

root@bu-lab102:~# dhcpdump -i tmfifo_net0 TIME: 2024-06-10 10:26:29.980 IP: 0.0.0.0 (0:1a:ca:ff:ff:1) > 255.255.255.255 (ff:ff:ff:ff:ff:ff) OP: 1 (BOOTPREQUEST) HTYPE: 1 (Ethernet) HLEN: 6 HOPS: 0 XID: 22093441 SECS: 0 FLAGS: 7f80 CIADDR: 0.0.0.0 YIADDR: 0.0.0.0 SIADDR: 0.0.0.0 GIADDR: 0.0.0.0 CHADDR: 00:1a:ca:ff:ff:01:00:00:00:00:00:00:00:00:00:00 SNAME: . FNAME: . OPTION: 53 ( 1) DHCP message type 1 (DHCPDISCOVER) OPTION: 57 ( 2) Maximum DHCP message size 1472 OPTION: 55 ( 35) Parameter Request List 1 (Subnet mask) 2 (Time offset) 3 (Routers) 4 (Time server) 5 (Name server) 6 (DNS server) 12 (Host name) 13 (Boot file size) 15 (Domainname) 17 (Root path) 18 (Extensions path) 22 (Maximum datagram reassembly size) 23 (Default IP TTL) 28 (Broadcast address) 40 (NIS domain) 41 (NIS servers) 42 (NTP servers) 43 (Vendor specific info) 50 (Request IP address) 51 (IP address leasetime) 54 (Server identifier) 58 (T1) 59 (T2) 60 (Vendor class identifier) 66 (TFTP server name) 67 (Bootfile name) 97 (UUID/GUID) 128 (???) 129 (???) 130 (???) 131 (???) 132 (???) 133 (???) 134 (???) 135 (???) OPTION: 97 ( 17) UUID/GUID 009c2debc0368611 ..-..6.. ee8000a088c20ee8 ........ 18 . OPTION: 94 ( 3) Client NDI 010300 ... OPTION: 93 ( 2) Client System 000b .. OPTION: 60 ( 13) Vendor class identifier NVIDIA/BF/PXE OPTION: 43 (131) Vendor specific info 8005424633000081 ..BF3... 30426c7565466965 0BlueFie 6c643a342e382e30 ld:4.8.0 2d322d6765373965 -2-ge79e 3037662d64697274 07f-dirt 7900000000000000 y....... 0000000000000000 ........ 008248444f43415f ..HDOCA_ 322e352e305f4253 2.5.0_BS 505f342e352e305f P_4.5.0_ 5562756e74755f32 Ubuntu_2 322e30342d312e32 2.04-1.2 3032333131303800 0231108. 0000000000000000 ........ 0000000000000000 ........ 0000000000000000 ........ 000000 ... --------------------------------------------------------------------------- TIME: 2024-06-10 10:26:29.981 IP: 192.168.100.1 (0:1a:ca:ff:ff:2) > 255.255.255.255 (ff:ff:ff:ff:ff:ff) OP: 2 (BOOTPREPLY) HTYPE: 1 (Ethernet) HLEN: 6 HOPS: 0 XID: 22093441 SECS: 0 FLAGS: 7f80 CIADDR: 0.0.0.0 YIADDR: 192.168.100.2 SIADDR: 192.168.100.1 GIADDR: 0.0.0.0 CHADDR: 00:1a:ca:ff:ff:01:00:00:00:00:00:00:00:00:00:00 SNAME: . FNAME: /PXE-TEST.efi. OPTION: 53 ( 1) DHCP message type 2 (DHCPOFFER) OPTION: 54 ( 4) Server identifier 192.168.100.1 OPTION: 51 ( 4) IP address leasetime 43200 (12h) OPTION: 1 ( 4) Subnet mask 255.255.255.0

The example shows the DHCP discover packet sent by the client (BlueField) and the offer packet sent by the server as part of the DHCP DORA process (including useful information like the vendor class identifier and vendor-specific information). In this case, the DHCP server has been configured to serve a test file, PXE-TEST.efi, over TFTP and it can be useful to verify DHCP, TFTP, and HTTP server configuration by looking at the packet dump.

An alternative to dhcpdump is to use tcpdump to look at all raw data sent over the network. For DHCP, only ports 67 and 68 need to be monitored:

Copy
Copied!
            

# Monitor raw DHCP data tcpdump -i tmfifo_net0 -n -vvv -xx port 67 or 78 # Convert packets to ASCII tcpdump -i tmfifo_net0 -n -vvv -A port 67 or 78


IPv6

The dhcpdump tool does not currently support IPv6, but tcpdump can be used for monitoring raw and ASCII data by filtering on ports 546 and 547:

Copy
Copied!
            

# Monitor raw DHCP data tcpdump -i tmfifo_net0 -n -vvv -xx port 546 or 547 # Convert packets to ASCII tcpdump -i tmfifo_net0 -n -vvv -A port 546 or 547

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.