NVIDIA WinOF-2 Documentation v24.07.50000
NVIDIA WinOF-2 Documentation v24.07.50000

Resiliency

DMN generates dumps and traces from various components, including hardware, firmware and software, upon user requests, upon internally detected issues (by the resiliency sensors) and ND application requests via the extended NVIDIA® ND API.

DMN dumps are crucial for offline debugging. Once an issue is hit, the dumps can provide useful information about the NIC's state at the time of the failure. This includes hardware state dumps, firmware traces and various driver component state and resource dumps.

For information on the relevant registry keys for this feature, please refer to Dump Me Now (DMN) Registry Keys.

DMN Triggers and APIs

DMN supports three triggering APIs:

  1. mlx5Cmd.exe can be used to trigger DMN by running the -Dmn sub command:

    Copy
    Copied!
                

    Mlx5Cmd -Dmn -hh | -Name <adapter name> Submit dump-me-now request

    Options:

    -hh

    Show this help screen

    -Name <adapter name>

    Network adapter name

    -NoMstDump

    Run DMN without mst dump

    -CoreDumpQP<QP number>

    Run DMN with QP Core Dump

  2. ND SPI NVIDIA® extension (defined in ndspi_ext_mlx.h):

    1. API function to generate a general DMN dump from an ND application:

      Copy
      Copied!
                  

      HRESULT Nd2AdapterControlDumpMeNow( __in IND2AdapterControl* pCtrl, __in HANDLE hOverlappedFile, __inout OVERLAPPED* pOverlapped );

    2. API function to generate a QP based DMN dump from an ND application. The function generates a dump that might include more information about the queue pair specified by its number.

      Copy
      Copied!
                  

      HRESULT Nd2AdapterControlDumpQpNow( __in IND2AdapterControl* pCtrl, __in HANDLE hOverlappedFile, __in ULONG Qpn, __inout OVERLAPPED* pOverlapped );

    3. An internal API between different driver components, in order to support generating DMN upon self-detected errors and failures (by the resiliency feature).

Dumps and Incident Folders

DMN generates a directory per incident, where it places all of the needed NIC dump files. There is a mechanism to limit the number of created Incident Directories. For further information, see Cyclic DMN Mechanism.

The DMN incident directory name includes a timestamp, dump type, DMN event source and reason. It uses the following directory naming scheme: dmn-<type of DMN>-<source of DMN trigger>-<reason>-<timestamp>

Example:

Copy
Copied!
            

dmn-GN-USR-NA-4.13.2017-07.49.02.747

In this example:

  • GN: The dump type is "General”

  • USR: The DMN was triggered by mlx5Cmd (user)

  • NA: In this version of the driver, the cause for the dump is not available in case of mlx5Cmd triggering

  • The dump was created on April 13th, 2017 at 747 milliseconds after 7:49:02 AM

In this version of the driver, the DMN generates the following dump files upon a DMN event:

  • IPoIB: The adapter’s IPoIB state

  • PDDR: The port diagnostics database

  • General

  • mst files

  • Registry

DMN incident dumps are created under the DMN root directory, which can be controlled via the registry. The root directory will include the port identification in its name.

The default is:

  • Host: "\Systemroot\temp\Mlx5_Dump_Me_Now-<b>-<d>-<f>"

State Dumping (via Dump Me Now)

Upon several types of events, the drivers can produce a set of files reflecting the current state of the adapter.

Automatic state dumps via DMN are done upon the following events:

Event Type

Description

Provider

Default

Tag

CMD_FAILED

Command failure

Mlx5

On

FAILED

CMD_TIMEOUT

Timeout reached on a command

Mlx5

On

TOUT

RESILIENCY

Resiliency sensor was activated

Mlx5

OFF

RES

EQ_STUCK

Driver decided that an event queue is stuck

Mlx5

On

EQ

TXCQ_STUCK

Driver decided that a transmit completion queue is stuck

Mlx5

On

TXCQ

RXCQ_STUCK

Driver decided that a receive completion queue is stuck

Mlx5

On

RXCQ

PORT_STATE

Adapter passed to “port up” state, “port down” state or “port unknown” state.

Mlx5

On

PORT

USER

User application asked to generate dump files

Mlx5

N/A

USR

where

Provider

The driver creating the set of files.

Default

Whether or not the state dumps are created by default upon this event.

Tag

Part of the file name, used to identify the event that has triggered the state dump.

Dump events can be enabled/disabled by adding DWORD32 parameters into HKLM\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1- 08002be10318}\<nn> as follows:

  • Dump events can be disabled by adding MstDumpMode parameter as follows:

    Copy
    Copied!
                

    MstDumpMode 0

  • PORT_STATE events can be disabled by adding EnableDumpOnUnknownLink and EnableDumpOnPortDown parameters as follows:

    Copy
    Copied!
                

    EnableDumpOnUnknownLink 0 EnableDumpOnPortDown 0 EnableDumpOnPortUp 0

    Note

    As of WinOF-2 v2.10, the registry keys above can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.

  • EQ_STUCK, TXCQ_STUCK and RXCQ_STUCK events can be disabled by adding DisableDumpOnEqStuck, DisableDumpOnTxCqStuck and DisableDumpOnRxCqStuck parameters as follows:

    Copy
    Copied!
                

    DisableDumpOnEqStuck 1 DisableDumpOnTxCqStuck 1 DisableDumpOnRxCqStuck 1

The set consists of 2 consecutive mstdump files. These files are created in the same directory as the DMN, and should be sent to NVIDIA® Support for analysis when debugging WinOF2 driver problems.

Their names have the following format: <event_name>-<dump_mode>_<file_index>.txt

<event_name>

Event name

Description

poll-tout-<OPCODE>

Timeout reached on command with polling mode, OPCODE is the command opcode in the driver.

wait-tout-<OPCODE>

Timeout reached on command while waiting, OPCODE is the command opcode in the driver.

poll-failed-<OPCODE>

Command with polling mode failed, OPCODE is the command opcode in the driver.

wait-failed-<OPCODE>

Command failed, OPCODE is the command opcode in the driver.

eth-eq-<EQN >-<EQ_IDX>

EQ stuck, EQN: EQ number, EQ_IDX: EQ index

eth-txcq-<CQN>

TXCQ is stuck, CQN is the CQ number

eth-rxcq-<CQN>

RXCQ is stuck, CQN is the CQ number

eth-<STATE>

PORT change event, STATE: [“up”, “down”, “none”]

oid

User application asked the dump

BugCheck

Bug check event

resiliency

When resiliency flow is triggered

<dump_mode>: The mode of collecting the mstdump: “crspcae”, “fast-crspace”

<file_index>: The file number of this type in the set

Example:

Copy
Copied!
            

Name: wait-failed-936-fast-crspace_1.txt

The default number of sets of files for each event is 20. The other dump files have the filename of: <DumpType>.log

DumpType can be: PDDR, Registry, General, IPoIB, MiniportProfiling

Cyclic DMN Mechanism

The driver manages the DMN incident dumps in a cyclic fashion, in order to limit the amount of disk space used for saving DMN dumps, and avoid low disk space conditions that can be caused from creating the dumps.

Rather than using a simple cyclic override scheme by replacing the oldest DMN incident folder every time it generates a new one, the driver allows the user to determine whether the first N incident folders should be preserved or not. This means that the driver will maintain a cyclic overriding scheme starting from a given index.

The two registry keys used to control this behavior are DumpMeNowTotalCount, which specifies the maximum number of allowed dumps under the DMN root folder, and DumpMeNowPreservedCount, which specifies the number of reserved incident folders that will not be overridden by the cyclic algorithm.

The following diagram illustrates the cyclic scheme’s work, assuming DumpMeNowPreservedCount=2 and DumpMeNowTotalCount=16:

images/download/attachments/3075125650/Cyclic_DMN%2BMechanism-version-1-modificationdate-1723458801533-api-v2.png

Configuring DMN-IOV

The DMN-IOV detail level can be configured by the "DmnIovMode" value that is located in device parameters registry key. The default value is 2. The acceptable values are 0-4:

Values

Description

0

The feature is disabled

1

Major IOV objects and their state will be listed

2

All VF hardware resources and their state will be listed in the dump (QPs, CQs, MTTs, etc.)

3

All QP-to-Ring mapping will be added (the huge dump)

4

All IOV objects and their state will be list


Dump PDDR Information

The DMN-PDDR can configured by the "EnableDumpOnPortUp" and "EnableDumpOnPortDown" values that are located in device parameters registry keys.

The default values of the keys are follow:

  • EnableDumpOnPortUp = 0 [capability disabled]

  • EnableDumpOnPortDown = 1 [capability enabled]

Event Logs

DMN generates an event to the system event log upon the success or failure of the dump file generation.

Reported Driver Event Severity: Error

Event ID

Message

0x101

<device name>: Failed to create a full dump me now.

Dump me now root directory: <path to root DMN folder>

Failure: <Failure description>

Status: <status code>


Reported Driver Event Severity: Warning

For a list of the DMN Warning events, see Reported Driver Events.

FwTrace feature allows firmware traces to be logged Online into the WPP tracing without any NVIDIA® specific tools’ requirements. It provides an easy way to debug and diagnose issues at production without the need to reproduce the issue. Both the firmware and the driver traces are displayed at the same file. Additionally, FwTrace is also used as a platform for core_dump.

System Requirements

Firmware versions:

  • NVIDIA® ConnectX®-4 v12.22.1002

  • NVIDIA® ConnectX®-4 Lx v14.22.1002

  • NVIDIA® ConnectX®-5 v16.22.4020

Configuring FwTrace

FwTrace uses Registry Keys for its configuration. For more information see section FwTrace Registry Keys.

FwTrace feature could be enabled/disabled dynamically (without requiring an adapter restart) using the FwTracerEnabled registry key.

FwTrace uses a cyclic buffer. The size of the buffer could be configured using the dynamic registry key FwTracerBufferSize. To change buffer size, set the desired value to FwTracerBufferSize and then restart FwTrace using FwTracerEnabled registry key or adapter restart.

The NIC Health Monitor is an external tool used to check and monitor the health of the NIC by analyzing the firmware and the diagnostic counters previously collected by the user.

This capability can be used using the following command and its parameters:

Copy
Copied!
            

Mlx5Cmd -Dbg -NicHealthMonitor -hh | -Input <CSV file> [-Type N] [-FullName] [-Desc] [-Format TXT | CSV]

where:

-hh

Show this help screen

-Input <CSV file>

File, containing the names and values of counters to be checked.

Note: This is a mandatory parameter, containing counters for analysis. This file can be produced using the typeperf utility. For example:

typeperf -qx | findstr "Mell*" > c:\Counters.txt

typeperf -cf c:\Counters.txt -o c:\CounterData.csv -sc 200 -si 2

The first command creates the list of counters to collect.

The second one collects 200 sets of the above counters, one probe in 2 seconds.

-Type N

Bit field, containing types of results to be shown:

  • 1 errors

  • 2 warnings

  • 3 errors + warnings (default)

  • 4/8 good/unchecked counters

The tool makes its analysis based on the internal list of counters that can show issues in the NIC health.

There are four possible results of a counter analysis:

  • ERROR - the value of the counter is regarded as error.

  • WARN - the value of the counter is suspicious.

  • GOOD - the value of the counter is OK.

  • ABSNT - the counter is not in the internal list and was not analyzed.

-FullName

Print full counter names.

The full name of the counter is quite long. Its format is:

\\node_name\counter_set_name(adapter_instance(es))\counter_name

By default, the tool prints only counter_name.

-Desc

Print description of the counter.

The name of the counters is often not enough to understand its purpose. The tool can print the description of the counter to give more information.

-Format TXT | CSV

Output format; default - TXT (plain text).

The tool prints the results to the stdout. It can produce the output in two formats: Plain Text (default) or CSV.

-CfgFile <CfgFile>

Some of the checked counters have two configuration values: threshold and time unit.

To see the default values of these parameters, run: –List

Mlx5Cmd.exe -Dbg – NicHealthMonitor -List

The output is a plane text that can be easily edit by changing threshold values and then check counters with the new thresholds:

Mlx5Cmd -Dbg - NicHealthMonitor -Check -Input <CSV file> -CfgFile config.log

The following are a few examples of how to run the command:

  • To print only error counters in default format:

    Copy
    Copied!
                

    Mlx5Cmd.exe -Dbg -NicHealthMonitor -input c:\tmp\CounterData.csv -type 1

  • To print only error and warning counters with full name of counters:

    Copy
    Copied!
                

    Mlx5Cmd.exe -Dbg -NicHealthMonitor -input c:\tmp\CounterData.csv -type 3 -FullName

  • To print conclusions on all counters, found in the input file, with maximum info and in CSV format:

    Copy
    Copied!
                

    Mlx5Cmd.exe -Dbg -NicHealthMonitor -input c:\tmp\CounterData.csv -type 15 -FullName -Desc -Format CSV > output.csv

Resource Dump is a debuggability utility that extracts and prints data segments generated by the firmware/hardware. The driver will register to all the supported types of resources (Segments) and will listen on the events sent by the firmware to initiate a collect resource dump request and export it to the filesystem (using Dump-Me-Now mechanism).

For further information, see ResourceDump Registry Keys and Resource Dump Utility.

Note

As Resource Dump depends on DMN, its enablement is coupled with the DMN enablement.

© Copyright 2024, NVIDIA. Last updated on Aug 14, 2024.