PCIe
This page offers troubleshooting information for PCIe.
Missing PCIe Express Device
PCIe Links
There are several stages to discovery and operation of PCIe devices, and errors at any of these stages could cause a device to be unavailable to the operating system. PCIe devices form a tree hierarchy, with each node connected to the other via a PCIe link. All the links between the root port and the endpoint device have to be trained and active in order to access the device. Link training is handled by hardware, but if it fails then all downstream devices become unavailable. The lspci
tool can be used to check the downstream link status for the root ports and switches.
In the following example, the LnkSta
line shows the link operating correctly. TrErr-
signifies there were no training errors and DLActive+
signifies the link is up.
# lspci -vv -s 5
:0.0
05
:00.0
PCI bridge: Mellanox Technologies MT43244 Family [BlueField-3
SoC PCIe Bridge] (rev 01
) (prog-if
00
[Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin ? routed to IRQ 57
IOMMU group: 5
Bus: primary=05
, secondary=06
, subordinate=06
, sec-latency=0
I/O behind bridge: 00000000
-00000fff [size=4K]
Memory behind bridge: 00200000
-003fffff [size=2M]
Prefetchable memory behind bridge: 0000800005000000
-00008000051fffff [size=2M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [60
] Express (v2) Downstream Port (Slot+), MSI 00
DevCap: MaxPayload 512
bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128
bytes, MaxReadReq 128
bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #1
, Speed 32GT/s, Width x2, ASPM not supported
ClockPM- Surprise+ LLActRep+ BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (downgraded), Width x2 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
Enumeration
The next stage is PCIe enumeration. This is the process software uses to discover all the devices present in the fabric. It does this by reading from the first register of every possible device to see which ones respond. The first register of every device contains its vendor ID and device ID which uniquely identify the device. PCIe enumeration is done twice during boot - once by UEFI and then again by Linux. Every device detected by Linux PCIe enumeration will be listed by lspci. If the device shows up here, then it means that it is present in the system and responded correctly to a configuration read. It doesn't say anything about the functionality of the device or it's driver.
Resource Allocation
After enumeration, the operating system does PCIe resource allocation. If resource allocation fails then some devices will be unavailable to the OS. There are three kinds of PCIe resources - bus numbers, IO space and Memory space. BlueField doesn't support PCIe IO space, so only the other two are interesting. Every platform including BlueField supports 255 bus numbers and running out of this resource is unlikely. If PCIe memory space runs out then there will be messages like these reported by dmsg.
[ 0.781698
] pci 0000
:21
:00.0
: BAR 6
: no space for
[mem size 0x00100000
pref]
[ 0.781700
] pci 0000
:21
:00.0
: BAR 6
: failed to assign [mem size 0x00100000
pref]
[ 0.781703
] pci 0000
:21
:00.1
: BAR 6
: no space for
[mem size 0x00100000
pref]
[ 0.781705
] pci 0000
:21
:00.1
: BAR 6
: failed to assign [mem size 0x00100000
pref]
There are two types of PCIe memory space - 32bit and 64bit. The width refers to the size of the addresses used. BlueField-3 supports 2GB of 32-bit PCIe memory space and 128TB of 64-bit PCIe memory space. The 32-bit space is in the range 0x7fff_0000_0000 to 0x7fff_7fff_ffff. The 64-bit space is in the range 0x8000_0000_0000 to 0xffff_ffff_ffff Even though the 64-bit space is huge it is still possible to run out because some devices support a limited number of address bits. Also because the size of memory space allocations are required to be a power of 2 and naturally aligned, sometimes big chunks of the address space can not be used. If memory space allocation fails, then it can be helpful to review the contents of /proc/iomem which contains a list of all the available ranges and which ranges have been allocated to each device.
Depending on the Linux configuration, it may either keep any resource allocation done by UEFI or discard those settings and do it's own allocation. This behavior can be controlled by adding "pci=realloc=on" or "pci=realloc=off" to the kernel command line.
Device Drivers
If enumeration and resource allocation succeed but the device services are still not available, then the issue is probably with the driver. If lspci -v
shows a line labeled "kernel driver in use:" or "kernel modules:" then that device driver is successfully attached to that device. In the example below, it is the nvme driver.
# lspci -v -s 6
:0.0
06
:00.0
Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller BG4 (DRAM-less) (prog-if
02
[NVM Express])
Subsystem: KIOXIA Corporation NVMe SSD Controller BG4 (DRAM-less)
Physical Slot: 0
Flags: bus master, fast devsel, latency 0
, IRQ 61
, IOMMU group 6
Memory at 7fff00200000 (64
-bit, non-prefetchable) [size=16K]
Capabilities: [40
] Express Endpoint, MSI 00
Capabilities: [80
] Power Management version 3
Capabilities: [90
] MSI: Enable- Count=1
/32
Maskable+ 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=32
Masked-
Capabilities: [100
] Advanced Error Reporting
Capabilities: [150
] Virtual Channel
Capabilities: [260
] Latency Tolerance Reporting
Capabilities: [300
] Secondary PCI Express
Capabilities: [400
] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
If that line is missing, then the driver is either missing or the attach failed. In either case searching for the name of the driver in dmesg output should provide more information.
UEFI Enumeration
If debugging from Linux is difficult or not available, then the UEFI Internal Shell can be used to see the results of PCIe enumeration as done by UEFI. To enter the shell, press Esc on the console when UEFI starts to boot. From the menu, select "Boot Manager" and then scroll down to "EFI Internal Shell". The relevant commands are "pci", "devices" and "drivers". The "help" command will give usage information for each command.
Shell> pci
Seg Bus Dev Func
--- --- --- ----
00
00
00
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device A2DA Prog Interface 0
00
01
00
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device 197B Prog Interface 0
00
02
00
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device 197B Prog Interface 0
00
02
03
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device 197B Prog Interface 0
00
03
00
00
==> Network Controller - Ethernet controller
Vendor 15B3 Device A2DC Prog Interface 0
00
03
00
01
==> Network Controller - Ethernet controller
Vendor 15B3 Device A2DC Prog Interface 0
00
04
00
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device 197B Prog Interface 0
00
05
00
00
==> Bridge Device - PCI/PCI bridge
Vendor 15B3 Device 197B Prog Interface 0
00
06
00
00
==> Mass Storage Controller - Non-volatile
memory subsystem
Vendor 1E0F Device 0001
Prog Interface 2
Missing PCIe Devices
If running lspci on the BlueField produces no output and all PCIe devices are missing, then it means that the device is in Livefish mode. In that case the NIC Firmware needs to be reinstalled.
Insufficient Power on the PCIe Slot
If you see the error "Insufficient power on the PCIe slot" in dmesg, please consult the Specifications section of your BlueField Hardware User Guide to ensure that your DPU is receiving the appropriate power supply.
To check the power capacity of your host's PCIe slots, execute the command lspci -vvv | grep PowerLimit
. For instance:
# lspci -vvv | grep PowerLimit
Slot #6, PowerLimit 75.000W; Interlock- NoCompl-
Slot #1, PowerLimit 75.000W; Interlock- NoCompl-
Slot #4, PowerLimit 75.000W; Interlock- NoCompl-
Be aware that this command is not supported by all host vendors/types.
Obtaining the Complete PCIe Device Description
The lspci
command may not display the complete descriptions for the NVIDIA PCIe devices connected to your host. For example:
# lspci | grep -i Mellanox
a3:00.0 Infiniband controller: Mellanox Technologies Device a2d6 (rev 01)
a3:00.1 Infiniband controller: Mellanox Technologies Device a2d6 (rev 01)
a3:00.2 DMA controller: Mellanox Technologies Device c2d3 (rev 01)
To see the full descriptions for these devices, please run the following command:
# update-pciids
After doing this, you should be able to view the complete details for those devices. For example:
# lspci | grep -i Mellanox
a3:00.0 Infiniband controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
a3:00.1 Infiniband controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
a3:00.2 DMA controller: Mellanox Technologies MT42822 BlueField-2 SoC Management Interface (rev 01)
Managing Two BlueField Platforms in the Same Server
This example demonstrates how to manage two BlueField platforms installed in the same server (the process is similar for additional platforms).
This example assumes that the RShim package has already been installed on the host server.
Configuring Management Interface on Host
This example is relevant for CentOS/RHEL operating systems only.
Create a
bf_tmfifo
interface under/etc/sysconfig/network-scripts
. Run:vim /etc/sysconfig/network-scripts/ifcfg-br_tmfifo
Inside
ifcfg-br_tmfifo
, insert the following content:DEVICE="br_tmfifo" BOOTPROTO="static" IPADDR="192.168.100.1" NETMASK="255.255.255.0" ONBOOT="yes" TYPE="Bridge"
Create a configuration file for the first BlueField platform,
tmfifo_net0
. Run:vim /etc/sysconfig/network-scripts/ifcfg-tmfifo_net0
Inside
ifcfg-tmfifo_net0
, insert the following content:DEVICE=tmfifo_net0 BOOTPROTO=none ONBOOT=yes NM_CONTROLLED=no BRIDGE=br_tmfifo
Create a configuration file for the second BlueField platform,
tmfifo_net1
. Run:DEVICE=tmfifo_net1 BOOTPROTO=none ONBOOT=yes NM_CONTROLLED=no BRIDGE=br_tmfifo
Create the rules for the
tmfifo_net
interfaces. Run:vim /etc/udev/rules.d/91-tmfifo_net.rules
Restart the network for the changes to take effect. Run:
# /etc/init.d/network restart Restarting network (via systemctl): [ OK ]
Configuring BlueField Platform Side
The BlueField platforms are shipped with the following factory default configurations for tmfifo_net0
.
Address |
Value |
MAC |
|
IP |
|
Therefore, if you are working with more than one platform, you must change the default MAC and IP addresses.
Updating the RShim Network MAC Address
This procedure is relevant for Ubuntu/Debian (sudo
needed), and CentOS BFBs. The procedure only affects the tmfifo_net0
on the Arm side.
Use a Linux console application (e.g. screen or minicom) to log into each BlueField. For example:
# sudo screen /dev/rshim<0|1>/console 115200
Create a configuration file for
tmfifo_net0
MAC address. Run:# sudo vi /etc/bf.cfg
Inside
bf.cfg
, insert the new MAC:NET_RSHIM_MAC=00:1a:ca:ff:ff:03
Apply the new MAC address. Run:
sudo bfcfg
Repeat this procedure for the second BlueField platform (using a different MAC address).
InfoArm must be rebooted for this configuration to take effect. It is recommended to update the IP address before you do that to avoid unnecessary reboots.
For
comprehensive list of the supported parameters to customize
bf.cfg
during BFB installation, refer to
section "bf.cfg Parameters".
Updating an IP Address
For Ubuntu:
Access the file
50-cloud-init.yaml
and modify thetmfifo_net0
IP address:sudo vim /etc/netplan/50-cloud-init.yaml tmfifo_net0: addresses: - 192.168.100.2/30 ===>>> 192.168.100.3/30
Reboot the Arm. Run:
sudo reboot
Repeat this procedure for the second BlueField platform (using a different IP address).
InfoArm must be rebooted for this configuration to take effect. It is recommended to update the MAC address before you do that to avoid unnecessary reboots.
For CentOS:
Access the file
ifcfg-tmfifo_net0
. Run:# vim /etc/sysconfig/network-scripts/ifcfg-tmfifo_net0
Modify the value for
IPADDR
:IPADDR=192.168.100.3
Reboot the Arm. Run:
reboot
Or perform
netplan apply
.Repeat this procedure for the second BlueField DPU (using a different IP address).
InfoArm must be rebooted for this configuration to take effect. It is recommended to update the MAC address before you do that to avoid unnecessary reboots.