Reset Flow is activated by default, once a "fatal device" error is recognized. Both the HCA and the software are reset, the ULPs and user application are notified about it, and a recovery process is performed once the event is raised.
- In mlx4 devices, "Reset Flow" is activated by default. It can be disabled using the mlx- 4_core module parameter internal_err_reset (default value is 1).
- In mlx5 devices, "Reset Flow" is activated by default. Currently, it can be triggered by a firmware assert with Recover Flow Request (RFR) only. Firmware RFR support should be enabled explicitly using mlxconfig commands.
- For mlx4 devices, a “fatal device” error can be a timeout from a firmware command, an error on a firmware closing command, communication channel not being responsive in a VF, etc.
- For mlx5 devices, a “fatal device” is a firmware assert combined with Recover Flow Request bit.
To query the current value, run:
mlxconfig -d /dev/mst/mt4115_pciconf0 query | grep SW_RECOVERY_ON_ERRORS
To enable RFR bit support, run:
mlxconfig -d /dev/mst/mt4115_pciconf0 set SW_RECOVERY_ON_ERRORS=true
Once a "fatal device" error is recognized, an IB_EVENT_DEVICE_FATAL event is created, ULPs are notified about the incident, and outstanding WQEs are simulated to be returned with "flush in error" message to enable each ULP to close its resources and not get stuck via calling its "remove_one" callback as part of "Reset Flow".
Once the unload part is terminated, each ULP is called with its "
add_one" callback, its resources are re-initialized and it is re-activated.
User Space Applications (IB/RoCE)
Once a "fatal device" error is recognized an IB_EVENT_DEVICE_FATAL event is created, applications are notified about the incident and relevant recovery actions are taken.
Applications that ignore this event enter a zombie state, where each command sent to the kernel is returned with an error, and no completion on outstanding WQEs is expected.
The expected behavior from the applications is to register to receive such events and recover once the above event is raised. Same behavior is expected in case the NIC is unbounded from the PCI and later is rebounded. Applications running over RDMA CM should behave in the same manner once the RDMA_CM_EVENT_DEVICE_REMOVAL event is raised.
The below is an example of using the unbind/bind for NIC defined by "0000:04:00.0"
echo 0000:04:00.0 > /sys/bus/pci/drivers/mlx4_core/unbind echo 0000:04:00.0 > /sys/bus/pci/drivers/mlx4_core/bind
If the Physical Function recognizes the error, it notifies all the VFs about it by marking their communication channel with that information, consequently, all the VFs and the PF are reset.
If the VF encounters an error, only that VF is reset, whereas the PF and other VFs continue to work unaffected.
Forcing the VF to Reset
If an outside "reset" is forced by using the PCI sysfs entry for a VF, a reset is executed on that VF once it runs any command over its communication channel.
For example, the below command can be used on a hypervisor to reset a VF defined by 0000:04:00.1:
echo 1 >/sys/bus/pci/devices/0000:04:00.1/reset
Advanced Error Reporting (AER) in ConnectX-3 and ConnectX-3 Pro
AER, a mechanism used by the driver to get notifications upon PCI errors, is supported only in native mode, ULPs are called with remove_one/add_one and expect to continue working properly after that flow.User space application will work in same mode as defined in the "Reset Flow" above.
Extended Error Handling (EEH)
Extended Error Handling (EEH) is a PowerPC mechanism that encapsulates AER, thus exposing AER events to the operating system as EEH events.
The behavior of ULPs and user space applications is identical to the behavior of AER.
CRDUMP feature allows for taking an automatic snapshot of the device CR-Space in case the device's FW/HW fails to function properly.
- ConnectX-3 adapters family - the snapshot is triggered in case the driver detects any of the following issues:
- Critical event, such as a command timeout
- Critical FW command failure
- PCI errors
- Internal FW error
- ConnectX-4/ConnectX-5 adapters family - the snapshot is triggered after firmware detects a critical issue, requiring a recovery flow (see Reset Flow).
This snapshot can later be investigated and analyzed to track the root cause of the failure.
Currently, only the first snapshot is stored, and is exposed using a temporary virtual file. The virtual file is cleared upon driver reset.
When a critical event is detected, a message indicating CRDUMP collection will be printed to the Linux log. User should then back up the file pointed to in the printed message. The file location format is:
- For mlx4 driver: /proc/driver/mlx4_core/crdump/<pci address>
- For mlx5 driver: /proc/driver/mlx5_core/crdump/<pci address>
Example - the following message is printed to the log:
[257480.719070] mlx4_core 0000:00:05.0: Internal error detected: [257480.726019] mlx4_core 0000:00:05.0: buf: 0fffffff [257480.732082] mlx4_core 0000:00:05.0: buf: 00000000 .... [257480.806531] mlx4_core 0000:00:05.0: buf[0f]: 00000000 [257480.811534] mlx4_core 0000:00:05.0: device is going to be reset [257482.781154] mlx4_core 0000:00:05.0: crdump: Crash snapshot collected to /proc/driver/mlx4_core/crdump/0000:00:05.0 [257483.789230] mlx4_core 0000:00:05.0: device was reset successfully
Snapshot should be copied by Linux standard tool for future investigation.
In mlx4 driver, CRDUMP will not be collected if internal_err_reset module parameter is set to 0.
This mechanism allows for the device's FW/HW to log important events into the event tracing system (/sys/kernel/debug/tracing) without requiring any Mellanox tool.
To be able to use this feature, trace points must be enabled in the kernel.
This feature is enabled by default, and can be controlled using sysfs commands.
To disable the feature:
echo 0 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
To enable the feature:
echo 1 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
To view FW traces using vim text editor: