NVIDIA Docs Hub Homepage NVIDIA Networking BlueField DPUs / SuperNICs & DOCA DOCA Documentation v3.1.0 Core Update Reset Flow

Reset Flow

Reset Flow is activated by default. Once a "fatal device" error is recognized, both the HCA and the software are reset, the ULPs and user application are notified about it, and a recovery process is performed once the event is raised.

Currently, a reset flow can be triggered by a firmware assert with Recover Flow Request (RFR) only. Firmware RFR support should be enabled explicitly using mlxconfig commands.

To query the current value, run:

Copy
Copied!

            
            mlxconfig -d /dev/mst/mt4115_pciconf0 query | grep SW_RECOVERY_ON_ERRORS

To enable RFR bit support, run:

Copy
Copied!

            
            mlxconfig -d /dev/mst/mt4115_pciconf0 set SW_RECOVERY_ON_ERRORS=true

Kernel ULPs

Once a "fatal device" error is recognized, an IB_EVENT_DEVICE_FATAL event is created, ULPs are notified about the incident, and outstanding WQEs are simulated to be returned with "flush in error" message to enable each ULP to close its resources and not get stuck via calling its remove_one callback as part of "Reset Flow".

Once the unload part is terminated, each ULP is called with its add_one callback, its resources are re-initialized and it is re-activated.

User Space Applications (IB/RoCE)

Once a "fatal device" error is recognized an IB_EVENT_DEVICE_FATAL event is created, applications are notified about the incident and relevant recovery actions are taken.

Applications that ignore this event enter a zombie state, where each command sent to the kernel is returned with an error, and no completion on outstanding WQEs is expected.

The expected behavior from the applications is to register to receive such events and recover once the above event is raised. Same behavior is expected in case the NIC is unbounded from the PCI and later is rebounded. Applications running over RDMA CM should behave in the same manner once the RDMA_CM_EVENT_DEVICE_REMOVAL event is raised.

The following is an example of using the unbind/bind for NIC defined by 0000:04:00.0:

Copy
Copied!

            
            echo 0000:04:00.0 > /sys/bus/pci/drivers/mlx5_core/unbind 
echo 0000:04:00.0 > /sys/bus/pci/drivers/mlx5_core/bind

SR-IOV

If the PF recognizes the error, it notifies all the VFs about it by marking their communication channel with that information, consequently, all the VFs and the PF are reset.

If the VF encounters an error, only that VF is reset, whereas the PF and other VFs continue to work unaffected.

Forcing the VF to Reset

If an outside reset is forced by using the PCIe sysfs entry for a VF, a reset is executed on that VF once it runs any command over its communication channel.

For example, the following command can be used on a hypervisor to reset a VF defined by 0000:04:00.1:

Copy
Copied!

            
            echo 1 >/sys/bus/pci/devices/0000:04:00.1/reset

Extended Error Handling (EEH)

Extended Error Handling (EEH) is a PowerPC mechanism that encapsulates AER, thus exposing AER events to the operating system as EEH events.

The behavior of ULPs and user space applications is identical to the behavior of AER.

CRDUMP

CRDUMP feature allows for taking an automatic snapshot of the device CR-Space in case the device's FW/HW fails to function properly.

Snapshots Triggers:

The snapshot is triggered after firmware detects a critical issue, requiring a recovery flow.

This snapshot can later be investigated and analyzed to track the root cause of the failure.

Currently, only the first snapshot is stored, and is exposed using a temporary virtual file. The virtual file is cleared upon driver reset.

When a critical event is detected, a message indicating CRDUMP collection will be printed to the Linux log. User should then back up the file pointed to in the printed message. The file location format is: /proc/driver/mlx5_core/crdump/<pci address>.

Snapshot should be copied by Linux standard tool for future investigation.

Firmware Tracer

This mechanism allows for the device's firmware/hardware to log important events into the event tracing system (/sys/kernel/debug/tracing) without requiring any NVIDIA tool.

Note

To be able to use this feature, trace points must be enabled in the kernel.

This feature is enabled by default, and can be controlled using sysfs commands.

To disable the feature:

Copy
Copied!

            
            echo 0 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable

To enable the feature:

Copy
Copied!

            
            echo 1 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable

To view FW traces using vim text editor:

Copy
Copied!

            
            vim /sys/kernel/debug/tracing/trace

On This Page