Mellanox WinOF VPI Documentation v5.50.52000
Linux Kernel Upstream Release Notes v6.5

Device Self-Healing

The Self-Healing feature allows the WinOF driver to recover from various error states. The feature is responsible for:

  • Detecting of errors in the driver, firmware or hardware

  • Performing the necessary actions for recovery

  • Reporting the error and the action taken

The Self-Healing mechanism is comprised of two main components:

  • Health-Checker: Determines when to trigger the recovery flow

  • Recovery-Flows: Restarts the miniport driver. This component is comprised of:

    • Miniport adapter restart
      Recovery from miniport errors - in case the error is detected in the miniport driver, only the relevant miniport will be restarted

Self Healing

image2019-3-12_17-29-0.png

The driver’s stacks run a continuous periodic loop of health-checking code that is designed to detect issues described in the “Sensors” section below.
Upon detecting an error, the health-checker reports the error to the self-healing manager, and it will determine whether any recovery steps should be performed, according to the configured policy.
The time interval for the periodic loop can be controlled by the user, per miniport driver instance, using the registry key: "SHCheckForHangTimeInSeconds".

Registry key location:

HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<nn>

For instructions on how to find an interface index in the registry <nn>, please refer to Finding the Index Value of the Network Interface.

Check-for-Hang SHCheckForHangTimeInSeconds Registry Key

Key Name

Key Type

Values

Description

SHCheckForHangTimeInSeconds

REG_DWORD

[1 - MAX_ULONG]
Default: 4

The interval in seconds for the Check-for-Hang mechanism.

Sensors

The health state of the driver is examined by the designated sensors. Each sensor can be disabled/enabled independently. In case a specific sensor detects an error, it reports to the Self-Healing manager, logs an ETW message and executes the appropriate recovery-flow action.

In case the sensor is not activated, the error is ignored and no recovery flows are executed, but logs and dumps are generated.

Miniport Driver Sensors

The miniport sensors can be controlled per a miniport instance, using the per-miniport registry key paths.

Sensor

Description

Lack of Progress in Hardware for Ethernet Driver Send Queues

The driver has posted a send WR in an Ethernet QP, but the hardware did not respond with a completion notification within a reasonable time period. The time period is determined according to the Check-for-Hang mechanism and "CheckForHangCQMaxNoProgress" registry key as the follows:

  • For each cycle where the Check-for-Hang identifies that the hardware does not respond with a completion notification, a dedicated counter for this test will be incremented. When the counter reaches the "CheckForHangCQMaxNoProgress" threshold, an error will be reported to the Self-Healing manager.

  • The threshold for considering a posted WR as stuck is equal to SHCheckForHangTimeInSeconds * CheckForHangCQMaxNoProgress seconds.

  • In case the Head of Queue (HoQ) is disabled, this sensor will be ignored, and only dumps will be generated without performing any recovery steps. When the HoQ is disabled, posted WRs may complete after longer periods of time, due to congestion on the network, so this test is removed to avoid unnecessary recovery operations.

Mask: 0x00000008
ETW Event ID: 30009
Note: In VF, this sensor is always enabled and can not be disabled. This setting is defined in order to allow the Mellanox driver to go down at any time.

Lack of Progress in Software for Ethernet Driver Receive Queues

The driver has posted a receive WR to an Ethernet QP completed by the hardware, but it has not processed the completion event within a reasonable time period. The time is determined according to the same logic as described in “Lack of Progress in Hardware for Ethernet Driver Send Queues” row in this table.
Mask: 0x00000040
ETW Event ID: 30011

Receive Completion Error

The hardware has reported an error in a receive WR.
Mask: 0x00010000
ETW Event ID: 30021

Send Completion Error

The hardware has reported an error in a send WR.
Mask: 0x00020000
ETW Event ID: 30022

Lack of Progress in Software for Ethernet Driver Send Queues

The driver has posted a send WR to an Ethernet QP completed by the hardware, but it has not processed the completion event within a reasonable time period. The time is determined according to the same logic as described in “Lack of Progress in Hardware for Ethernet Driver Send Queues” row in this table, but multiplied by two: (2 * SHCheckForHangTimeInSeconds * CheckForHangCQMaxNoProgress).
Mask: 0x00000010
ETW Event ID: 30029

The Self-Healing feature can be controlled by registry keys. The driver detects registry changes dynamically, and updates the Self-Healing settings automatically without requiring a driver restart.

Miniport Driver Registry Keys

Registry keys location:

HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<nn>

For instructions on how to find an interface index in the registry <nn>, please refer to Finding the Index Value of the Network Interface.

Miniport Driver Registry Keys

Key Name

Key Type

Values

Description

SHMPResetActiveSensorsMask

REG_DWORD

[0 , 0XFFFFFFFF]
Default - 0XFFFFFFFF

Determines which sensors are active to execute a miniport reset upon error.
For the sensors activation values, please refer to Miniport Driver Sensors above.

SHSensorsDumpMask

REG_DWORD

[0 , 0XFFFFFFFF]
Default - 0x0

Determines which sensors are allowed to trigger the "Dump Me Now" feature upon error.
For the sensors activation values, please refer to Miniport Driver Sensors above.

CheckForHangCQMaxNoProgress

REG_DWORD

[1 - 1000]
Default: 4

The number of Check-for-Hang cycles with no progress in HW to count before reporting an error to the Self-Healing manager.

The self-healing manager records the following events in the System Event Viewer. Each record specifies the selected recovery flow and the reason to its execution:

Each sensor issues a unique ETW event upon error. The event could be found in the Windows event viewer, under "Applications and Services Log\Mellanox-Drivers\Operational". The following table contains all event messages:

Logging - Windows Event Viewer Applications Messages

Event ID

Message

3009

<Device name>: Lack of progress in hardware for Ethernet driver send queues sensor detected an error

30029

<Device name>: Lack of progress in software for Ethernet driver send queues sensor detected an error

30011

<Device name>: Lack of progress in software for Ethernet driver receive queues sensor detected an error

30021

<Device name>: Receive completion error sensor detected an error

30022

<Device name>: Send completion error sensor detected an error

30013

<Device name>: VF communication channel error sensor detected an error

The reasons are detailed in the following table:

Logging - Windows Event Viewer Messages

Event ID

Message

0x008b

<Device name>: Self Healing - Failed to activate the resiliency flow as a result of a SW reset failure, error=<error id>.%n
The error was reported by the sensors <sensors id>.

0x008c

Restart <Interface name> as a result of an error that was reported by the sensors <Sensors mask>
Self healing state:
• Restarts count: <n>

0x008d

Stopped <Interface name> activity as a result of an error that was reported by sensors <n>.

0x0100

<Device name>: dump folder (<path>) was created due to a dump-me-now request.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.