The system panics when it is booted with a failed adapter installed.
Malfunction hardware component
NVIDIA adapter is not identified as a PCI device.
PCI slot or adapter PCI connector dysfunctionality
NVIDIA adapters are not installed in the system.
Misidentification of the NVIDIA adapter installed
Run the command below and check NVIDIA’s MAC to identify the NVIDIA adapter installed.
lspci | grep Mellanox' or 'lspci -d 15b3:
Note: NVIDIA MACs start with: 00:02:C9:xx:xx:xx, 00:25:8B:xx:xx:xx or F4:52:14:xx:xx:xx"
The default device may vary when invoking user apps (such as ibv_asyncwatch) which run using a specific device.
The default device for such apps is the first device in the device list generated by libibverbs. This first device in the list varies, depending on which and how many InfiniBand devices are installed on the host, which slot the devices are installed on, whether they use SR-IOV, and other factors.
Always specify the desired device explicitly when running userspace apps, by using the provided command line parameter (for example: ibv_asyncwatch -d <dev>).
Insufficient memory to be used by udev upon OS boot.
udev is designed to fork() new process for each event it receives so it could handle many events in parallel, and each udev instance consumes some RAM memory.
Limit the udev instances running simultaneously per boot by adding udev.children-max=<number> to the kernel command line in grub.
Operating system running from root file system located on a remote storage (over NVIDIA devices), hang during reboot/shutdown (errors such as “No such file or directory” will appear).
The openibd service script is called using the ‘stop’ option by the operating system. This option unloads the driver stack. Therefore, the OS root file system disappears before the reboot/ shutdown procedure is completed, leaving the OS in a hang state.
Disable the openibd ‘stop’ option by setting 'ALLOW_STOP=no' in /etc/ infiniband/openib.conf configuration file.
NVIDIA adapter warning print to dmesg:
Detected insufficient power on the PCIe slot (xxxW).
Insufficient PCI power.
Investigate the cause for lack of PCI power.