Connectivity Troubleshooting
The UART cable in the Accessories Kit (OPN: MBF20-DKIT) can be used to connect to the DPU console and identify the stage at which BlueField is hanging.
Follow this procedure:
Connect the UART cable to a USB socket, and find it in your USB devices.
sudo lsusb Bus 002 Device 003: ID 0403:6001 Future Technology Devices International, Ltd FT232 Serial (UART) IC
NoteFor more information on the UART connectivity, please refer to the DPU's hardware user guide under Supported Interfaces > Interfaces Detailed Description > NC-SI Management Interface.
InfoIt is good practice to connect the other end of the NC-SI cable to a different host than the one on which the BlueField DPU is installed.
Install the minicom application.
For CentOS/RHEL:
sudo yum install minicom -y
For Ubuntu/Debian:
sudo apt-get install minicom
Open the minicom application.
sudo minicom -s -c on
Go to "Serial port setup"
Enter "F" to change "Hardware Flow control" to NO
Enter "A" and change to /dev/ttyUSB0 and press Enter
Press ESC.
Type on "Save setup as dfl"
Exit minicom by pressing Ctrl + a + z.
+-----------------------------------------------------------------------+ | A - Serial Device : /dev/ttyUSB0 | | | | C - Callin Program : | | D - Callout Program : | | E - Bps/Par/Bits : 115200 8N1 | | F - Hardware Flow Control : No | | G - Software Flow Control : No | | | | Change which setting? | +-----------------------------------------------------------------------+
What this looks like in dmsg:
[275604.216789] mlx5_core 0000:af:00.1: 63.008 Gb/s available PCIe bandwidth, limited by 8 GT/s x8 link at 0000:ae:00.0 (capable of 126.024 Gb/s with 16 GT/s x8 link)
[275624.187596] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 100s
[275644.152994] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 79s
[275664.118404] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 59s
[275684.083806] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 39s
[275704.049211] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 19s
[275723.954752] mlx5_core 0000:af:00.1: mlx5_function_setup:1237:(pid 943): Firmware over 120000 MS in pre-initializing state, aborting
[275723.968261] mlx5_core 0000:af:00.1: init_one:1813:(pid 943): mlx5_load_one failed with error code -16
[275723.978578] mlx5_core: probe of 0000:af:00.1 failed with error -16
The driver on the host server is dependent on the Arm side. If the driver on Arm is up, then the driver on the host server will also be up.
Please verify that:
The driver is loaded in the BlueField (Arm)
The Arm is booted into OS
The Arm is not in UEFI Boot Menu
The Arm is not hanged
Then:
Perform a graceful shutdown and a power cycle on the host server.
If the problem persists, reset nvconfig (sudo mlxconfig -d /dev/mst/<device> -y reset) and perform a BlueField system reboot.
NoteIf your BlueField is VPI capable, please be aware that this configuration will reset the link type on the network ports to IB. To change the network port's link type to Ethernet, run:
sudo mlxconfig -d <device> s LINK_TYPE_P1=2 LINK_TYPE_P2=2
This configuration change requires performing a BlueField system reboot.
If this problem still persists, please make sure to install the latest bfb image and then restart the driver in host server. Please refer to "Upgrading NVIDIA BlueField DPU Software" for more information.
Verify that the bridge is configured properly on the Arm side.
The following is an example for default configuration:
$ sudo ovs-vsctl show
f6740bfb-0312-4cd8-88c0-a9680430924f
Bridge ovsbr1
Port pf0sf0
Interface pf0sf0
Port p0
Interface p0
Port pf0hpf
Interface pf0hpf
Port ovsbr1
Interface ovsbr1
type: internal
Bridge ovsbr2
Port p1
Interface p1
Port pf1sf0
Interface pf1sf0
Port pf1hpf
Interface pf1hpf
Port ovsbr2
Interface ovsbr2
type: internal
ovs_version: "2.14.1"
If no bridge configuration exists, please refer to "Virtual Switch on BlueField".
Please check that the cables are connected properly into the network ports of the DPU and the peer device.