Device Self-Healing
The Self-Healing feature allows the WinOF driver to recover from various error states. The feature is responsible for:
Detecting of errors in the driver, firmware or hardware
Performing the necessary actions for recovery
Reporting the error and the action taken
The Self-Healing mechanism is comprised of two main components:
Health-Checker: Determines when to trigger the recovery flow
Recovery-Flows: Restarts the miniport driver. This component is comprised of:
Miniport adapter restart
Recovery from miniport errors - in case the error is detected in the miniport driver, only the relevant miniport will be restarted
Self Healing
The driver’s stacks run a continuous periodic loop of health-checking code that is designed to detect issues described in the “Sensors” section below.
Upon detecting an error, the health-checker reports the error to the self-healing manager, and it will determine whether any recovery steps should be performed, according to the configured policy.
The time interval for the periodic loop can be controlled by the user, per miniport driver instance, using the registry key: "SHCheckForHangTimeInSeconds".
Registry key location:
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<nn>
For instructions on how to find an interface index in the registry <nn>, please refer to Finding the Index Value of the Network Interface.
Check-for-Hang SHCheckForHangTimeInSeconds Registry Key
Key Name | Key Type | Values | Description |
SHCheckForHangTimeInSeconds | REG_DWORD | [1 - MAX_ULONG] | The interval in seconds for the Check-for-Hang mechanism. |
Sensors
The health state of the driver is examined by the designated sensors. Each sensor can be disabled/enabled independently. In case a specific sensor detects an error, it reports to the Self-Healing manager, logs an ETW message and executes the appropriate recovery-flow action.
In case the sensor is not activated, the error is ignored and no recovery flows are executed, but logs and dumps are generated.
Miniport Driver Sensors
The miniport sensors can be controlled per a miniport instance, using the per-miniport registry key paths.
Sensor | Description |
Lack of Progress in Hardware for Ethernet Driver Send Queues | The driver has posted a send WR in an Ethernet QP, but the hardware did not respond with a completion notification within a reasonable time period. The time period is determined according to the Check-for-Hang mechanism and "CheckForHangCQMaxNoProgress" registry key as the follows:
Mask: 0x00000008 |
Lack of Progress in Software for Ethernet Driver Receive Queues | The driver has posted a receive WR to an Ethernet QP completed by the hardware, but it has not processed the completion event within a reasonable time period. The time is determined according to the same logic as described in “Lack of Progress in Hardware for Ethernet Driver Send Queues” row in this table. |
Receive Completion Error | The hardware has reported an error in a receive WR. |
Send Completion Error | The hardware has reported an error in a send WR. |
Lack of Progress in Software for Ethernet Driver Send Queues | The driver has posted a send WR to an Ethernet QP completed by the hardware, but it has not processed the completion event within a reasonable time period. The time is determined according to the same logic as described in “Lack of Progress in Hardware for Ethernet Driver Send Queues” row in this table, but multiplied by two: (2 * SHCheckForHangTimeInSeconds * CheckForHangCQMaxNoProgress). |
The Self-Healing feature can be controlled by registry keys. The driver detects registry changes dynamically, and updates the Self-Healing settings automatically without requiring a driver restart.
Miniport Driver Registry Keys
Registry keys location:
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<nn>
For instructions on how to find an interface index in the registry <nn>, please refer to Finding the Index Value of the Network Interface.
Miniport Driver Registry Keys
Key Name | Key Type | Values | Description |
SHMPResetActiveSensorsMask | REG_DWORD | [0 , 0XFFFFFFFF] | Determines which sensors are active to execute a miniport reset upon error. |
SHSensorsDumpMask | REG_DWORD | [0 , 0XFFFFFFFF] | Determines which sensors are allowed to trigger the "Dump Me Now" feature upon error. |
CheckForHangCQMaxNoProgress | REG_DWORD | [1 - 1000] | The number of Check-for-Hang cycles with no progress in HW to count before reporting an error to the Self-Healing manager. |
The self-healing manager records the following events in the System Event Viewer. Each record specifies the selected recovery flow and the reason to its execution:
Each sensor issues a unique ETW event upon error. The event could be found in the Windows event viewer, under "Applications and Services Log\Mellanox-Drivers\Operational". The following table contains all event messages:
Logging - Windows Event Viewer Applications Messages
Event ID | Message |
3009 | <Device name>: Lack of progress in hardware for Ethernet driver send queues sensor detected an error |
30029 | <Device name>: Lack of progress in software for Ethernet driver send queues sensor detected an error |
30011 | <Device name>: Lack of progress in software for Ethernet driver receive queues sensor detected an error |
30021 | <Device name>: Receive completion error sensor detected an error |
30022 | <Device name>: Send completion error sensor detected an error |
30013 | <Device name>: VF communication channel error sensor detected an error |
The reasons are detailed in the following table:
Logging - Windows Event Viewer Messages
Event ID | Message |
0x008b | <Device name>: Self Healing - Failed to activate the resiliency flow as a result of a SW reset failure, error=<error id>.%n |
0x008c | Restart <Interface name> as a result of an error that was reported by the sensors <Sensors mask> |
0x008d | Stopped <Interface name> activity as a result of an error that was reported by sensors <n>. |
0x0100 | <Device name>: dump folder (<path>) was created due to a dump-me-now request. |