NIC Mode
In NVIDIA® BlueField®-3 NIC mode, Arm cores are put to sleep in UEFI. Arm does not boot the OS except during the BFB installation. Most of the DRAM memory (normally allocated to the OS in DPU mode) is allocated to NIC firmware. This memory region is called static ICM.
All information that follows applies to BlueField-3 NIC mode only.
Command |
Description |
|
A DPU BMC program to access the BlueField console |
|
Set the RShim log debug level to 2 |
|
Dump the RShim log |
|
Get the NIC firmware mstdumps |
|
Reset the Arm |
There are no counters involved for debugging NIC mode on the Arm side.
Please dump the RShim log for troubleshooting, as described throughout this page.
mlx5_core Assert Messages
The following mlx5_core
asserts from the x86 host usually indicate that the NIC firmware did not get the small ICMC memory range from Arm:
[ 69.847178
] mlx5_core 0000
:05
:00.0
: poll_health:840
:(pid 0
): device's health compromised - reached miss count
[ 69.857135
] mlx5_core 0000
:05
:00.0
: print_health_info:429
:(pid 0
): Health issue observed, firmware internal error, severity(2
) CRITICAL:
[ 69.869403
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[0
] 0x0000001e
[ 69.877764
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[1
] 0x00000000
[ 69.886125
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[2
] 0x00000000
[ 69.894485
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[3
] 0x00000000
[ 69.902845
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[4
] 0x00000000
[ 69.911204
] mlx5_core 0000
:05
:00.0
: print_health_info:433
:(pid 0
): assert_var[5
] 0x00000000
[ 69.919564
] mlx5_core 0000
:05
:00.0
: print_health_info:436
:(pid 0
): assert_exit_ptr 0x20adec54
[ 69.928096
] mlx5_core 0000
:05
:00.0
: print_health_info:437
:(pid 0
): assert_callra 0x20adeeb0
[ 69.936462
] mlx5_core 0000
:05
:00.0
: print_health_info:438
:(pid 0
): fw_ver 32.42
.147
[ 69.944129
] mlx5_core 0000
:05
:00.0
: print_health_info:440
:(pid 0
): time 0
[ 69.950923
] mlx5_core 0000
:05
:00.0
: print_health_info:441
:(pid 0
): hw_id 0x0001021c
[ 69.962499
] mlx5_core 0000
:05
:00.0
: print_health_info:442
:(pid 0
): rfr 0
[ 69.962506
] mlx5_core 0000
:05
:00.0
: print_health_info:443
:(pid 0
): severity 2
(CRITICAL)
[ 69.973134
] mlx5_core 0000
:05
:00.0
: print_health_info:444
:(pid 0
): irisc_index 7
[ 69.973151
] mlx5_core 0000
:05
:00.0
: print_health_info:445
:(pid 0
): synd 0x1
: firmware internal error
[ 69.985156
] mlx5_core 0000
:05
:00.0
: print_health_info:447
:(pid 0
): ext_synd 0x874f
[ 70.013209
] mlx5_core 0000
:05
:00.0
: print_health_info:448
:(pid 0
): raw fw_ver 0x202a0093
How to Troubleshoot
Get the NIC firmware version and the Arm software version.
Collect
mstdumps
. The dumps would confirm that NIC firmware did not get the expected signal from the Arm.Collect RShim log:
echo
"DISPLAY_LEVEL 2"
> /dev/rshim0/misc cat /dev/rshim0/miscIf the issue is the Arm failing to send memory information to the NIC, consulting the RShim log may offer greater insight into the cause.
Possible Causes
An ATF BL2 exception caused the NIC firmware to timeout 30 seconds after reset, waiting for the small ICMC information. The RShim log should show the exception in this case.
A DDR training issue slowed down the boot and caused NIC firmware to timeout 30 seconds, waiting for the small ICMC information. The RShim log tracks the time since the reset. In the example below,
UP_TIME
is 41 seconds while DDR training has not completed. DDR training is done if the messageINFO[BL2]: DDR POST passed
is displayed.# cat /dev/rshim0/misc DISPLAY_LEVEL
2
(0
:basic,1
:advanced,2
:log) BOOT_MODE1
(0
:rshim,1
:emmc,2
:emmc-boot-swap) BOOT_TIMEOUT150
(seconds) DROP_MODE0
(0
:normal,1
:drop) SW_RESET0
(1
: reset) DEV_NAME pcie-0000
:05
:00.1
DEV_INFO BlueField-3
(Rev1
) OPN_STR BF3COMDPU UP_TIME31
(s) SECURE_NIC_MODE0
(0
:no,1
:yes) --------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (emmc) root@r
-bobcat-01
:~# cat /dev/rshim3/misc DISPLAY_LEVEL2
(0
:basic,1
:advanced,2
:log) BOOT_MODE1
(0
:rshim,1
:emmc,2
:emmc-boot-swap) BOOT_TIMEOUT150
(seconds) DROP_MODE0
(0
:normal,1
:drop) SW_RESET0
(1
: reset) DEV_NAME pcie-0000
:05
:00.1
DEV_INFO BlueField-3
(Rev1
) OPN_STR BF3COMDPU UP_TIME41
(s) SECURE_NIC_MODE0
(0
:no,1
:yes) --------------------------------------- Log Messages --------------------------------------- INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (emmc)
How to Proceed
Update to the latest bf-bundle.bfb
which would update the NIC firmware and keep it in sync with Arm software.
mlx5_core Timeout Messages
The following mlx5_core
timeout messages from the PCIe host usually indicate that the NIC firmware did not get the static ICM memory range from Arm:
2024
-03
-01T06:59
:35
,626713
+02
:00
mlx5_core 0000
:83
:00.0
: wait_fw_init:378
:(pid 372247
): Waiting for
FW initialization, timeout abort in 7159s (0x87020000
)
2024
-03
-01T06:59
:35
,758718
+02
:00
mlx5_core 0000
:83
:00.1
: wait_fw_init:378
:(pid 306775
): Waiting for
FW initialization, timeout abort in 7159s (0x87020000
)
2024
-03
-01T06:59
:55
,628014
+02
:00
mlx5_core 0000
:83
:00.0
: wait_fw_init:378
:(pid 372247
): Waiting for
FW initialization, timeout abort in 7139s (0x87020000
)
2024
-03
-01T06:59
:55
,760028
+02
:00
mlx5_core 0000
:83
:00.1
: wait_fw_init:378
:(pid 306775
): Waiting for
FW initialization, timeout abort in 7139s (0x87020000
)
2024
-03
-01T07:00
:15
,632300
+02
:00
mlx5_core 0000
:83
:00.0
: wait_fw_init:378
:(pid 372247
): Waiting for
FW initialization, timeout abort in 7119s (0x87020000
)
2024
-03
-01T07:00
:15
,761298
+02
:00
mlx5_core 0000
:83
:00.1
: wait_fw_init:378
:(pid 306775
): Waiting for
FW initialization, timeout abort in 7119s (0x87020000
)
How to Troubleshoot
Get the NIC firmware version and the Arm software version.
Collect
mstdumps
. The dumps would confirm that NIC firmware did not get the expected signal from the Arm.Collect RShim log:
echo
"DISPLAY_LEVEL 2"
> /dev/rshim0/misc cat /dev/rshim0/miscIf the problem is Arm not sending memory information to the NIC, the RShim log could provide more clarity on what caused that.
Collect the BlueField console log.
Possible Causes
An ATF or UEFI exception happened before UEFI could send the static ICM information to the NIC. NIC firmware has a 120-second timeout after reset for retrieving the static ICM information. The RShim log would show the exception in this case.
ATF/UEFI boot was too slow and violated the NIC firmware 120-second timeout. Possible causes for the slow boot:
Redfish issue
EEPROM initialization issue
How to Proceed
Either issue an Arm reset to work around the issue
Burn the latest
bf-bundle.bfb
to update both NIC firmware and Arm software
mlx5_core Driver Prints Fatal Error During BFB Installation
During BFB image installation, the mlx5_core
driver prints fatal messages on the x86 host. T as shown below. This behavior started in 4.7.0/2.7.0 where the BFB image also updates NIC firmware the BMC software.
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.0
: poll_health:1037
:(pid 0
): Fatal error 3
detected
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.1
: poll_health:1037
:(pid 0
): Fatal error 3
detected
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.1
: mlx5_health_try_recover:339
:(pid 2778
): handling bad device here
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.1
: mlx5_handle_bad_state:290
:(pid 2778
): starting teardown
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.1
: mlx5_error_sw_reset:241
:(pid 2778
): start
[Tue Apr 9
06
:59
:36
2024
] mlx5_core 0000
:03
:00.1
: mlx5_error_sw_reset:274
:(pid 2778
): end
[Tue Apr 9
06
:59
:39
2024
] mlx5_core 0000
:03
:00.1
: E-Switch: Disable: mode(LEGACY), nvfs(0
), necvfs(0
), active vports(0
)
[Tue Apr 9
06
:59
:39
2024
] mlx5_core 0000
:03
:00.1
: mlx5_wait_for_pages:916
:(pid 2778
): Skipping wait for
vf pages stage
[Tue Apr 9
06
:59
:39
2024
] mlx5_core 0000
:03
:00.1
: mlx5_wait_for_pages:916
:(pid 2778
): Skipping wait for
vf pages stage
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_health_try_recover:339
:(pid 5577
): handling bad device here
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_handle_bad_state:290
:(pid 5577
): starting teardown
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_error_sw_reset:241
:(pid 5577
): start
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_error_sw_reset:274
:(pid 5577
): end
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: E-Switch: Disable: mode(LEGACY), nvfs(0
), necvfs(0
), active vports(0
)
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_wait_for_pages:916
:(pid 5577
): Skipping wait for
vf pages stage
[Tue Apr 9
06
:59
:47
2024
] mlx5_core 0000
:03
:00.0
: mlx5_wait_for_pages:916
:(pid 5577
): Skipping wait for
vf pages stage
[Tue Apr 9
06
:59
:49
2024
] mlx5_core 0000
:03
:00.1
: mlx5_health_try_recover:345
:(pid 2778
): starting health recovery flow
[Tue Apr 9
06
:59
:49
2024
] mlx5_core 0000
:03
:00.1
: mlx5_pci_slot_reset Device state = 2
pci_status: 0
. Enter
[Tue Apr 9
06
:59
:49
2024
] mlx5_core 0000
:03
:00.1
: wait vital counter value 0x21100
after 1
iterations
[Tue Apr 9
06
:59
:49
2024
] mlx5_core 0000
:03
:00.1
: mlx5_pci_slot_reset Device state = 2
pci_status: 1
. Exit, err = 0
, result = 5
, recovered
How to Proceed
This behavior is expected. The installation should be followed by a power cycle to recover and activate the NIC firmware and the BMC software.
Failed to Switch from DPU to NIC Mode
There are 3 ways to switch from DPU mode to NIC mode:
Via the DPU BMC redfish
Via the BlueField UEFI menu
via mlxconfig on the x86 host
Please refer to NVIDIA BlueField Modes of Operation for detailed instructions on switching between modes of operation.
How to Troubleshoot
Check the BlueField mode status via mlxconfig tool on the x86 host:
mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep -ni offload
23
: INTERNAL_CPU_OFFLOAD_ENGINE ENABLED(1
)
If it is set to
ENABLED(1)
, it means that the configuration was applied successfully and BlueField firmware is operating in NIC mode. The issue is then on the Arm side not being able to detect NIC mode. In this case, collect the RShim log and the BlueField console log.If it is
ENABLED(0)
, it means that the configuration failed and NIC firmware is still operating in DPU mode.If NIC mode was set using mlxconfig, then the problem is likely with the tool
If NIC mode was set using the BlueField BMC's Redfish or the UEFI menu, then the problem may be on the Arm side. In this case, collect the RShim log and the BlueField console log. Also try setting NIC mode via mlxconfig to see if that resolves the issue.