What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

NIC Mode

In NVIDIA® BlueField®-3 NIC mode, Arm cores are put to sleep in UEFI. Arm does not boot the OS except during the BFB installation. Most of the DRAM memory (normally allocated to the OS in DPU mode) is allocated to NIC firmware. This memory region is called static ICM.

Note

All information that follows applies to BlueField-3 NIC mode only.

Command

Description

obmc-console-client

A DPU BMC program to access the BlueField console

echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc

Set the RShim log debug level to 2

cat /dev/rhim0/misc

Dump the RShim log

for i in {1..3}; do mstdump -full /dev/mst/mt41692_pciconf0 > mstdump_"$i".log; done

Get the NIC firmware mstdumps

echo "SW_RESET 1" > /dev/rshim0/misc

Reset the Arm

There are no counters involved for debugging NIC mode on the Arm side.

Please dump the RShim log for troubleshooting, as described throughout this page.

mlx5_core Assert Messages

The following mlx5_core asserts from the x86 host usually indicate that the NIC firmware did not get the small ICMC memory range from Arm:

Copy
Copied!
            

[   69.847178] mlx5_core 0000:05:00.0: poll_health:840:(pid 0): device's health compromised - reached miss count [   69.857135] mlx5_core 0000:05:00.0: print_health_info:429:(pid 0): Health issue observed, firmware internal error, severity(2) CRITICAL: [   69.869403] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[0] 0x0000001e [   69.877764] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[1] 0x00000000 [   69.886125] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[2] 0x00000000 [   69.894485] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[3] 0x00000000 [   69.902845] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[4] 0x00000000 [   69.911204] mlx5_core 0000:05:00.0: print_health_info:433:(pid 0): assert_var[5] 0x00000000 [   69.919564] mlx5_core 0000:05:00.0: print_health_info:436:(pid 0): assert_exit_ptr 0x20adec54 [   69.928096] mlx5_core 0000:05:00.0: print_health_info:437:(pid 0): assert_callra 0x20adeeb0 [   69.936462] mlx5_core 0000:05:00.0: print_health_info:438:(pid 0): fw_ver 32.42.147 [   69.944129] mlx5_core 0000:05:00.0: print_health_info:440:(pid 0): time 0 [   69.950923] mlx5_core 0000:05:00.0: print_health_info:441:(pid 0): hw_id 0x0001021c [   69.962499] mlx5_core 0000:05:00.0: print_health_info:442:(pid 0): rfr 0 [   69.962506] mlx5_core 0000:05:00.0: print_health_info:443:(pid 0): severity 2 (CRITICAL) [   69.973134] mlx5_core 0000:05:00.0: print_health_info:444:(pid 0): irisc_index 7 [   69.973151] mlx5_core 0000:05:00.0: print_health_info:445:(pid 0): synd 0x1: firmware internal error [   69.985156] mlx5_core 0000:05:00.0: print_health_info:447:(pid 0): ext_synd 0x874f [   70.013209] mlx5_core 0000:05:00.0: print_health_info:448:(pid 0): raw fw_ver 0x202a0093

How to Troubleshoot

  1. Get the NIC firmware version and the Arm software version.

  2. Collect mstdumps. The dumps would confirm that NIC firmware did not get the expected signal from the Arm.

  3. Collect RShim log:

    Copy
    Copied!
                

    echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc cat /dev/rshim0/misc

    If the issue is the Arm failing to send memory information to the NIC, consulting the RShim log may offer greater insight into the cause.

Possible Causes

  • An ATF BL2 exception caused the NIC firmware to timeout 30 seconds after reset, waiting for the small ICMC information. The RShim log should show the exception in this case.

  • A DDR training issue slowed down the boot and caused NIC firmware to timeout 30 seconds, waiting for the small ICMC information. The RShim log tracks the time since the reset. In the example below, UP_TIME is 41 seconds while DDR training has not completed. DDR training is done if the message INFO[BL2]: DDR POST passed is displayed.

    Copy
    Copied!
                

    # cat /dev/rshim0/misc DISPLAY_LEVEL   2 (0:basic, 1:advanced, 2:log) BOOT_MODE       1 (0:rshim, 1:emmc, 2:emmc-boot-swap) BOOT_TIMEOUT    150 (seconds) DROP_MODE       0 (0:normal, 1:drop) SW_RESET        0 (1: reset) DEV_NAME        pcie-0000:05:00.1 DEV_INFO        BlueField-3(Rev 1) OPN_STR         BF3COMDPU UP_TIME         31(s) SECURE_NIC_MODE 0 (0:no, 1:yes) ---------------------------------------             Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (emmc) root@r-bobcat-01:~# cat /dev/rshim3/misc DISPLAY_LEVEL   2 (0:basic, 1:advanced, 2:log) BOOT_MODE       1 (0:rshim, 1:emmc, 2:emmc-boot-swap) BOOT_TIMEOUT    150 (seconds) DROP_MODE       0 (0:normal, 1:drop) SW_RESET        0 (1: reset) DEV_NAME        pcie-0000:05:00.1 DEV_INFO        BlueField-3(Rev 1) OPN_STR         BF3COMDPU UP_TIME         41(s) SECURE_NIC_MODE 0 (0:no, 1:yes) ---------------------------------------             Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (emmc)

How to Proceed

Update to the latest bf-bundle.bfb which would update the NIC firmware and keep it in sync with Arm software.

mlx5_core Timeout Messages

The following mlx5_core timeout messages from the PCIe host usually indicate that the NIC firmware did not get the static ICM memory range from Arm:

Copy
Copied!
            

2024-03-01T06:59:35,626713+02:00 mlx5_core 0000:83:00.0: wait_fw_init:378:(pid 372247): Waiting for FW initialization, timeout abort in 7159s (0x87020000) 2024-03-01T06:59:35,758718+02:00 mlx5_core 0000:83:00.1: wait_fw_init:378:(pid 306775): Waiting for FW initialization, timeout abort in 7159s (0x87020000) 2024-03-01T06:59:55,628014+02:00 mlx5_core 0000:83:00.0: wait_fw_init:378:(pid 372247): Waiting for FW initialization, timeout abort in 7139s (0x87020000) 2024-03-01T06:59:55,760028+02:00 mlx5_core 0000:83:00.1: wait_fw_init:378:(pid 306775): Waiting for FW initialization, timeout abort in 7139s (0x87020000) 2024-03-01T07:00:15,632300+02:00 mlx5_core 0000:83:00.0: wait_fw_init:378:(pid 372247): Waiting for FW initialization, timeout abort in 7119s (0x87020000) 2024-03-01T07:00:15,761298+02:00 mlx5_core 0000:83:00.1: wait_fw_init:378:(pid 306775): Waiting for FW initialization, timeout abort in 7119s (0x87020000)

How to Troubleshoot

  1. Get the NIC firmware version and the Arm software version.

  2. Collect mstdumps. The dumps would confirm that NIC firmware did not get the expected signal from the Arm.

  3. Collect RShim log:

    Copy
    Copied!
                

    echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc cat /dev/rshim0/misc

    If the problem is Arm not sending memory information to the NIC, the RShim log could provide more clarity on what caused that.

  4. Collect the BlueField console log.

Possible Causes

  • An ATF or UEFI exception happened before UEFI could send the static ICM information to the NIC. NIC firmware has a 120-second timeout after reset for retrieving the static ICM information. The RShim log would show the exception in this case.

  • ATF/UEFI boot was too slow and violated the NIC firmware 120-second timeout. Possible causes for the slow boot:

    • Redfish issue

    • EEPROM initialization issue

How to Proceed

  • Either issue an Arm reset to work around the issue

  • Burn the latest bf-bundle.bfb to update both NIC firmware and Arm software

mlx5_core Driver Prints Fatal Error During BFB Installation

During BFB image installation, the mlx5_core driver prints fatal messages on the x86 host. T as shown below. This behavior started in 4.7.0/2.7.0 where the BFB image also updates NIC firmware the BMC software.

Copy
Copied!
            

[Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.0: poll_health:1037:(pid 0): Fatal error 3 detected [Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.1: poll_health:1037:(pid 0): Fatal error 3 detected [Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.1: mlx5_health_try_recover:339:(pid 2778): handling bad device here [Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.1: mlx5_handle_bad_state:290:(pid 2778): starting teardown [Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.1: mlx5_error_sw_reset:241:(pid 2778): start [Tue Apr  9 06:59:36 2024] mlx5_core 0000:03:00.1: mlx5_error_sw_reset:274:(pid 2778): end [Tue Apr  9 06:59:39 2024] mlx5_core 0000:03:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [Tue Apr  9 06:59:39 2024] mlx5_core 0000:03:00.1: mlx5_wait_for_pages:916:(pid 2778): Skipping wait for vf pages stage [Tue Apr  9 06:59:39 2024] mlx5_core 0000:03:00.1: mlx5_wait_for_pages:916:(pid 2778): Skipping wait for vf pages stage [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_health_try_recover:339:(pid 5577): handling bad device here [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_handle_bad_state:290:(pid 5577): starting teardown [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_error_sw_reset:241:(pid 5577): start [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_error_sw_reset:274:(pid 5577): end [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_wait_for_pages:916:(pid 5577): Skipping wait for vf pages stage [Tue Apr  9 06:59:47 2024] mlx5_core 0000:03:00.0: mlx5_wait_for_pages:916:(pid 5577): Skipping wait for vf pages stage [Tue Apr  9 06:59:49 2024] mlx5_core 0000:03:00.1: mlx5_health_try_recover:345:(pid 2778): starting health recovery flow [Tue Apr  9 06:59:49 2024] mlx5_core 0000:03:00.1: mlx5_pci_slot_reset Device state = 2 pci_status: 0. Enter [Tue Apr  9 06:59:49 2024] mlx5_core 0000:03:00.1: wait vital counter value 0x21100 after 1 iterations [Tue Apr  9 06:59:49 2024] mlx5_core 0000:03:00.1: mlx5_pci_slot_reset Device state = 2 pci_status: 1. Exit, err = 0, result = 5, recovered

How to Proceed

This behavior is expected. The installation should be followed by a power cycle to recover and activate the NIC firmware and the BMC software.

Failed to Switch from DPU to NIC Mode

There are 3 ways to switch from DPU mode to NIC mode:

  • Via the DPU BMC redfish

  • Via the BlueField UEFI menu

  • via mlxconfig on the x86 host

Please refer to NVIDIA BlueField Modes of Operation for detailed instructions on switching between modes of operation.

How to Troubleshoot

Check the BlueField mode status via mlxconfig tool on the x86 host:

Copy
Copied!
            

mlxconfig -d /dev/mst/mt41692_pciconf0 q  | grep -ni offload 23:         INTERNAL_CPU_OFFLOAD_ENGINE                 ENABLED(1)

  • If it is set to ENABLED(1), it means that the configuration was applied successfully and BlueField firmware is operating in NIC mode. The issue is then on the Arm side not being able to detect NIC mode. In this case, collect the RShim log and the BlueField console log.

  • If it is ENABLED(0), it means that the configuration failed and NIC firmware is still operating in DPU mode.

    • If NIC mode was set using mlxconfig, then the problem is likely with the tool

    • If NIC mode was set using the BlueField BMC's Redfish or the UEFI menu, then the problem may be on the Arm side. In this case, collect the RShim log and the BlueField console log. Also try setting NIC mode via mlxconfig to see if that resolves the issue.

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.