Resiliency
DMN generates dumps and traces from various components, including hardware, firmware and software, upon user requests, upon internally detected issues (by the resiliency sensors) and ND application requests via the extended NVIDIA® ND API.
DMN dumps are crucial for offline debugging. Once an issue is hit, the dumps can provide useful information about the NIC's state at the time of the failure. This includes hardware state dumps, firmware traces and various driver component state and resource dumps.
For information on the relevant registry keys for this feature, please refer to Dump Me Now (DMN) Registry Keys.
DMN Triggers and APIs
DMN supports three triggering APIs:
mlx5Cmd.exe can be used to trigger DMN by running the -Dmn sub command:
Mlx5Cmd -Dmn -hh | -Name <adapter name> Submit dump-me-now request
Options:
-hh
Show this help screen
-Name <adapter name>
Network adapter name
-NoMstDump
Run DMN without mst dump
-CoreDumpQP<QP number>
Run DMN with QP Core Dump
ND SPI NVIDIA® extension (defined in ndspi_ext_mlx.h):
API function to generate a general DMN dump from an ND application:
HRESULT Nd2AdapterControlDumpMeNow( __in IND2AdapterControl* pCtrl, __in HANDLE hOverlappedFile, __inout OVERLAPPED* pOverlapped );
API function to generate a QP based DMN dump from an ND application. The function generates a dump that might include more information about the queue pair specified by its number.
HRESULT Nd2AdapterControlDumpQpNow( __in IND2AdapterControl* pCtrl, __in HANDLE hOverlappedFile, __in ULONG Qpn, __inout OVERLAPPED* pOverlapped );
An internal API between different driver components, in order to support generating DMN upon self-detected errors and failures (by the resiliency feature).
Dumps and Incident Folders
DMN generates a directory per incident, where it places all of the needed NIC dump files. There is a mechanism to limit the number of created Incident Directories. For further information, see Cyclic DMN Mechanism.
The DMN incident directory name includes a timestamp, dump type, DMN event source and reason. It uses the following directory naming scheme: dmn-<type of DMN>-<source of DMN trigger>-<reason>-<timestamp>
Example:
dmn-GN-USR-NA-4.13
.2017
-07.49
.02.747
In this example:
GN: The dump type is "General”
USR: The DMN was triggered by mlx5Cmd (user)
NA: In this version of the driver, the cause for the dump is not available in case of mlx5Cmd triggering
The dump was created on April 13th, 2017 at 747 milliseconds after 7:49:02 AM
In this version of the driver, the DMN generates the following dump files upon a DMN event:
IPoIB: The adapter’s IPoIB state
PDDR: The port diagnostics database
General
mst files
Registry
DMN incident dumps are created under the DMN root directory, which can be controlled via the registry. The root directory will include the port identification in its name.
The default is:
Host: "\Systemroot\temp\Mlx5_Dump_Me_Now-<b>-<d>-<f>"
VF: "\Systemroot\temp\Mlx5_Dump_Me_Now-<b>-<d>". See section Dump Me Now (DMN) Registry Keys.
State Dumping (via Dump Me Now)
Upon several types of events, the drivers can produce a set of files reflecting the current state of the adapter.
Automatic state dumps via DMN are done upon the following events:
Event Type | Description | Provider | Default | Tag |
CMD_FAILED | Command failure | Mlx5 | On | FAILED |
CMD_TIMEOUT | Timeout reached on a command | Mlx5 | On | TOUT |
RESILIENCY | Resiliency sensor was activated | Mlx5 | OFF | RES |
EQ_STUCK | Driver decided that an event queue is stuck | Mlx5 | On | EQ |
TXCQ_STUCK | Driver decided that a transmit completion queue is stuck | Mlx5 | On | TXCQ |
RXCQ_STUCK | Driver decided that a receive completion queue is stuck | Mlx5 | On | RXCQ |
PORT_STATE | Adapter passed to “port up” state, “port down” state or “port unknown” state. | Mlx5 | On | PORT |
USER | User application asked to generate dump files | Mlx5 | N/A | USR |
where
Provider | The driver creating the set of files. |
Default | Whether or not the state dumps are created by default upon this event. |
Tag | Part of the file name, used to identify the event that has triggered the state dump. |
Dump events can be enabled/disabled by adding DWORD32 parameters into HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1- 08002be10318}\<nn> as follows:
Dump events can be disabled by adding MstDumpMode parameter as follows:
MstDumpMode
0
PORT_STATE events can be disabled by adding EnableDumpOnUnknownLink and EnableDumpOnPortDown parameters as follows:
EnableDumpOnUnknownLink
0
EnableDumpOnPortDown0
EnableDumpOnPortUp0
WarningAs of WinOF-2 v2.10, the registry keys above can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.
EQ_STUCK, TXCQ_STUCK and RXCQ_STUCK events can be disabled by adding DisableDumpOnEqStuck, DisableDumpOnTxCqStuck and DisableDumpOnRxCqStuck parameters as follows:
DisableDumpOnEqStuck
1
DisableDumpOnTxCqStuck1
DisableDumpOnRxCqStuck1
The set consists of 2 consecutive mstdump files. These files are created in the same directory as the DMN, and should be sent to NVIDIA® Support for analysis when debugging WinOF2 driver problems.
Their names have the following format: <event_name>-<dump_mode>_<file_index>.txt
<event_name>
Event name | Description |
poll-tout-<OPCODE> | Timeout reached on command with polling mode, OPCODE is the command opcode in the driver. |
wait-tout-<OPCODE> | Timeout reached on command while waiting, OPCODE is the command opcode in the driver. |
poll-failed-<OPCODE> | Command with polling mode failed, OPCODE is the command opcode in the driver. |
wait-failed-<OPCODE> | Command failed, OPCODE is the command opcode in the driver. |
eth-eq-<EQN >-<EQ_IDX> | EQ stuck, EQN: EQ number, EQ_IDX: EQ index |
eth-txcq-<CQN> | TXCQ is stuck, CQN is the CQ number |
eth-rxcq-<CQN> | RXCQ is stuck, CQN is the CQ number |
eth-<STATE> | PORT change event, STATE: [“up”, “down”, “none”] |
oid | User application asked the dump |
BugCheck | Bug check event |
resiliency | When resiliency flow is triggered |
<dump_mode>
dump_mode: The mode of collecting the mstdump: “crspcae”, “fast-crspace”
<file_index>
file_index: The file number of this type in the set
Example:
Name: wait-failed-936
-fast-crspace_1.txt
The default number of sets of files for each event is 20. The other dump files have the filename of: <DumpType>.log
DumpType can be: PDDR, Registry, General, IPoIB, MiniportProfiling
Cyclic DMN Mechanism
The driver manages the DMN incident dumps in a cyclic fashion, in order to limit the amount of disk space used for saving DMN dumps, and avoid low disk space conditions that can be caused from creating the dumps.
Rather than using a simple cyclic override scheme by replacing the oldest DMN incident folder every time it generates a new one, the driver allows the user to determine whether the first N incident folders should be preserved or not. This means that the driver will maintain a cyclic overriding scheme starting from a given index.
The two registry keys used to control this behavior are DumpMeNowTotalCount, which specifies the maximum number of allowed dumps under the DMN root folder, and DumpMeNowPreservedCount, which specifies the number of reserved incident folders that will not be overridden by the cyclic algorithm.
The following diagram illustrates the cyclic scheme’s work, assuming DumpMeNowPreservedCount=2 and DumpMeNowTotalCount=16:
Configuring DMN-IOV
The DMN-IOV detail level can be configured by the "DmnIovMode" value that is located in device parameters registry key. The default value is 2. The acceptable values are 0-4:
Values | Description |
0 | The feature is disabled |
1 | Major IOV objects and their state will be listed |
2 | All VF hardware resources and their state will be listed in the dump (QPs, CQs, MTTs, etc.) |
3 | All QP-to-Ring mapping will be added (the huge dump) |
4 | All IOV objects and their state will be list |
Dump PDDR Information
The DMN-PDDR can configured by the "EnableDumpOnPortUp" and "EnableDumpOnPortDown" values that are located in device parameters registry keys.
The default values of the keys are follow:
EnableDumpOnPortUp = 0 [capability disabled]
EnableDumpOnPortDown = 1 [capability enabled]
Event Logs
DMN generates an event to the system event log upon the success or failure of the dump file generation.
Reported Driver Event Severity: Error
Event ID | Message |
0x101 | <device name>: Failed to create a full dump me now. Dump me now root directory: <path to root DMN folder> Failure: <Failure description> Status: <status code> |
Reported Driver Event Severity: Warning
For a list of the DMN Warning events, see Reported Driver Events.
FwTrace feature allows firmware traces to be logged Online into the WPP tracing without any NVIDIA® specific tools’ requirements. It provides an easy way to debug and diagnose issues at production without the need to reproduce the issue. Both the firmware and the driver traces are displayed at the same file. Additionally, FwTrace is also used as a platform for core_dump.
System Requirements | |
Firmware versions: |
|
Configuring FwTrace
FwTrace uses Registry Keys for its configuration. For more information see section FwTrace Registry Keys.
FwTrace feature could be enabled/disabled dynamically (without requiring an adapter restart) using the FwTracerEnabled registry key.
FwTrace uses a cyclic buffer. The size of the buffer could be configured using the dynamic registry key FwTracerBufferSize. To change buffer size, set the desired value to FwTracerBufferSize and then restart FwTrace using FwTracerEnabled registry key or adapter restart.
Resource Dump is a debuggability utility that extracts and prints data segments generated by the firmware/hardware. The driver will register to all the supported types of resources (Segments) and will listen on the events sent by the firmware to initiate a collect resource dump request and export it to the filesystem (using Dump-Me-Now mechanism).
For further information, see ResourceDump Registry Keys and Resource Dump Utility.
As Resource Dump depends on DMN, its enablement is coupled with the DMN enablement.