NVIDIA WinOF-2 Documentation v24.07.50000
NVIDIA WinOF-2 Documentation v24.07.50000

Management Utilities

The management utilities described in this chapter are used to manage device’s performance, NIC attributes information and traceability.

The following are the supported management utilities:

mlx5cmd is a general management utility used for configuring the adapter, retrieving its information and collecting its WPP trace.

Usage

mlx5Cmd.exe <tool-name> <tool-arguments>

Performance Tuning Utility

This utility is used mostly for IP forwarding tests to optimize the driver’s configuration to achieve maximum performance when running in IP router mode.

Usage

mlx5cmd.exe -PerfTuning <tool-arguments>


Information Utility

This utility displays information of NVIDIA® NIC attributes. It is the equivalent utility to ibstat and vstat utilities in WinOF.

Usage

mlx5cmd.exe -Stat <tool-arguments>


DriverVersion Utility

The utility can display both the PF's and the VF's driver version.

Usage

mlx5cmd -DriverVersion -hh | -Name <adapter name> | [-PF] | [-VF] <VF number>

The VF's driver version format naming is different when the VM runs on a Windows or a Linux OS.If the VF number is not set, then all the driver’s VFs’ versions will be printed.

  • In a VM that runs on Windows OS, the naming format is: Os version,Driver Name,Driver version (e.g., Windows2012R2,WinOF2,2.000.019684)

  • In a VM that runs on Linux OS, the naming format is: OS,Driver,Driver version

  • (e.g., Linux Driver: Linux,mlx5_core,4.003.030211; Linux Inbox Driver: Linux,mlx5_core,3.0-1)

Trace Utility

The utility saves the ETW WPP tracing of the driver.

Usage

mlx5cmd.exe -Trace <tool-arguments>


QoS Configuration Utility

The utility configures Quality of Service (QoS) settings.

Usage

mlx5cmd.exe -QoSConfig -Name <Network Adapter Name> <-DefaultUntaggedPriority | -Dcqcn | -SetupRoceQosConfig>

For further information about the parameters, you may refer to RCM Configuration.

Quick RoCE Configuration (One-Click RoCE)

This utility provides a quick RoCE configuration method using the mlx5cmd tool. It enables the user to set different QoS RoCE configuration without any pre-requirements.

To set the desired RoCE configuration, run the -Configure <Configuration name> command.

The following are the types of configuration currently support:

  • Lossy fabric

  • Lossy fabric with QoS

  • Lossless fabric

Once set, RoCE will be configured with DSCP priority 26 by default, if the -Priority or -Dscp flags are not specified.

When configuring the interface to work in a "Lossy fabric" state, the configuration is returned to its default (out-of-box) settings and the -Dscp and -Priority flags are ignored.

To check the current configuration, run the -Query command.

Detailed usage

mlx5cmd.exe -QosConfig -SetupRoceQosConfig -h

Note

-Priority option uses VLAN priority (layer 2 priority). To use this option VLAN needs to be configured on the network.

Registry Keys Utility

This utility shows the registry keys that were set in the registry and are read by the driver. The PCI information can be queried from the "General" properties tab under "Location".

Usage

mlx5cmd.exe -RegKeys [-bdf <pci-bus#> <pci-device#> <pci-function#>]

Example

If the "Location" is "PCI Slot 3 (PCI bus 8, device 0, function 0)"

mlx5cmd.exe -RegKeys -bdf 8.0.0


Non-RSS Traffic Capture Utility

The RssSniffer utility provides sampling of packets that did not pass through the RSS engine, whether it is non-RSS traffic, or in any other case that the hardware determines to avoid RSS hashing.Non-RSS Traffic Capture Utility

The tool generates a packet dump file in a .pcap format. The RSS sampling is performed globally in native RSS mode, or per vPort in virtualization mode, when the hardware vRSS mode is active.

Detailed usage

mlx5cmd.exe -RssSniffer -hh

Note

Note that the tool can be configured to capture only a part of the packet, as well as specific packets in a sequence (N-th).


Sniffer Utility

Sniffer utility provides the user the ability to capture Ethernet, RoCE and IB traffic that flows to and from the NVIDIA® NIC's ports. The tool generates a packet dump file in .pcap format. This file can be read using the Wireshark tool (www.wireshark.org) for graphical traffic analysis. The .pcap file generated by the Sniffer Utility will be limited by default to 10M. Users can change or cancel the limit size per their demand. In order to force the file limit, the oldest captures will be saved in fileNamePrev.pcap and will be deleted when the limit is reached.

Note

In Bluefield 2 SmartNIC mode, sniffer cannot capture VF to VF traffic.

Detailed usage

mlx5cmd.exe -sniffer -help

Note

When using the sniffer utility in IPoIB in loopback mode, between VMs and hosts on the same network port, packets are seen twice in the pcap file: once for transmitting and once for receiving.

For multicast packets, packets are seen once for each direction and not for each destination.

Note

When in SR-IOV mode with 2nd PF enabled, on ConnectX-4 adapter cards, the Ethernet Sniffer utility sniffs only the PF’s traffic and not its VF’s traffic.


Link Speed Utility

This utility provides the ability to query supported link speeds by the adapter. Additionally, it enables the user to force set a particular link speed that the adapter can support.

Note

When using this utility, setting the link speed to 56GbE is not supported.

Usage

mlx5cmd.exe -LinkSpeed -Name <Network Adapter Name> -Query

Example

mlx5cmd.exe -LinkSpeed -Name <Network Adapter Name> -Set 1

Detailed usage

mlx5cmd.exe -LinkSpeed -hh


Link FEC Configuration Utility

Forward Error Correction (FEC) is an algorithm for finding and fixing errors in data transmission on physical link. The NIC can support several algorithms for every link speed. There is an internal register called PPLM, which contains information on FEC algorithms for every link speed.

PPLM register contains two fields for every link speed - ‘cap’ and ‘admin’.

  • ‘cap’ – means ‘capability’ – is a bitmask field, showing several FEC algorithms, supported for this link speed.

  • ‘admin’ – means ‘configured’ – contains the above ‘cap’ field where only one bit is set. It defines the FEC algorithm which is currently configured.

The Link FEC Configuration utility provides the ability to query supported link FEC modes by the adapter for the current link speed and for all supported link speeds.

Additionally, the utility enables the user to change the default FEC algorithm to one of the FEC modes, that the adapter supports.

Usage

mlx5cmd.exe -Dbg -LinkSpeed -Name <Network Adapter Name> -Query | -QueryPplm | -Set <value>

Example

mlx5cmd.exe -Dbg -LinkSpeed -Name <Network Adapter Name> -Set RS

Detailed usage

mlx5cmd.exe -Dbg -LinkSpeed -hh


NdStat Utility

This utility enumerates open ND connections. Connections can be filtered by adapter IP or Process ID.

Usage

mlx5cmd -NdStat -hh | [-a <IP address>] [-p <Process Id>] [-e] [-n <count>] [-t <time>]

Example

mlx5cmd -NdStat

Detailed usage

mlx5cmd -NdStat -hh


NdkStat Utility

This utility enumerates open NDK connections. Connections can be filtered by adapter IP or Process ID.

Usage

mlx5cmd -NdkStat -hh | [-a <IP address>] [-e] [-n <count>] [-t <time>]

Example:

mlx5cmd -NdkStat

Detailed usage

mlx5cmd -NdkStat -hh

mlx5cmd -NdkStat -hh


Debug Utility

This utility exposes driver’s debug information.

Usage

mlx5cmd -Dbg <-PddrInfo | -SwReset> | -hh

Detailed usage

mlx5cmd -Dbg -hh

VF Resources

This tool queries VF MSI-X and EQ count.

Note

This tool is not supported in BlueField 2 SmartNIC mode.

Usage

mlx5cmd -Dbg -VfResources -Name <adapter name>

mlx5cmd -Dbg -VfResources -Name <adapter name> -Vf <vf id>

Detailed usage

mlx5cmd -Dbg -VfResources -hh


Features Status Utility

The utility displays the status of driver features.

Usage

mlx5cmd -Features -hh | -Name <adapter name> [-Json] [-Indentation <count>]

Detailed usage

mlx5cmd -Features -hh


Firmware Capabilities

This tool queries firmware capabilities.

Note

This tool is not supported in BlueField 2 SmartNIC mode.

Usage

mlx5cmd -Dbg -FwCaps -Name <adapter name>

mlx5cmd -Dbg -FwCaps -Name <adapter name> -Vf <vf id>

mlx5cmd -Dbg -FwCaps -Name <adapter name> -Vf <vf id> -DumpAll

Detailed usage

mlx5cmd -FwCaps -hh


Port Diagnostic Database Register (PDDR)

The tool provides troubleshooting and operational information that can assist in debugging physical layer link related issues.

Usage

mlx5cmd -Dbg -PddrInfo [-bdf <pci-bus#> <pci-device#> <pci-function#>] | [-Name <adapter name>] | -hh

Detailed usage

mlx5cmd -Dbg -PddrInfo -hh


Software Reset for Adapter Command

The tool enables the user to execute a software reset on the adapter.

Usage

mlx5cmd -Dbg -SwReset -Name <adapter name>

Detailed usage

mlx5cmd -Dbg -SwReset -hh


Resource Dump

Resource Dump is used to:

  • query a menu segments mode:

Usage

mlx5cmd -Dbg -ResourceDump -Menu -hh | -Name <adapter name>

Detailed usage

mlx5cmd -Dbg -ResourceDump -Menu -hh

Example

Two menu segment records:

mlx5cmd -Dbg -ResourceDump -Menu -Name "Ethernet"
......
......
__________________________________________________________________

           Segment Type - 0x1301 (EQ_BUFF)

Dump Params                        Applicability    Special Values
--------------------------------   --------------   --------------
index1 -> EQN                      Mandatory        N/A
num_of_obj1                        N/A              N/A
index2 -> EQE                      Optional         N/A
num_of_obj2                        Optional         All
__________________________________________________________________
__________________________________________________________________

           Segment Type - 0x3000 (SX_SLICE)

Dump Params                        Applicability    Special Values
--------------------------------   --------------   --------------
index1 -> SLICE                    Mandatory        N/A
num_of_obj1                        N/A              N/A
index2 -> N/A                      N/A              N/A
num_of_obj2                        N/A              N/A
__________________________________________________________________
__________________________________________________________________
......
......

  • dump a segments mode:

Usage

mlx5cmd -Dbg -ResourceDump -Menu -hh | -Name <adapter name>

Detailed usage

mlx5cmd -Dbg -ResourceDump -Menu -hh

Example

mlx5cmd -Dbg -ResourceDump -Dump -Name "Ethernet" -Segment 0x1310 –Index1 1
Output file generated at C:\Windows\temp\Mlx5_Dump_Me_Now-7-0-0\PF\dmn-GN-OID-RESDUMP-2020.6.17-19.18.16-Gen6

Note

The tool does not validate any segment parameters, therefore if any of parameter is missing, the tool will recognize it as zero value. In the case of dump failure, the output file will contain an error message. Hence, we recommend using the menu mode before using this command.

The tool will generate a text file at the printed path, (in our case: “ResourceDump_SegType_0x1310.txt”), and the output text file will contain unparsed text-hex values:

Copy
Copied!
            

0x0004fffe 0x00000000 0x00000000 0x101b0fb4 0x0005fffa 0x13100000 0x00000001 0x00000000 0x00000000 0x0001fffb

Note

Since the Resource Dump feature is used in DMN to generate a directory, DMN uses a mechanism that limits the number of created directories. For further information, see Cyclic DMN Mechanism.


Packet Pacing Capabilities

This tools query allocated Packet Pacing objects

Usage

mlx5cmd -Dbg -FWPacketPacing -Name <adapter name>

mlx5cmd -Dbg -FWPacketPacing -Name <adapter name> -Index <index id>

mlx5cmd -Dbg -FWPacketPacing -Name <adapter name> -UID <uid>

Detailed usage

mlx5cmd -FWPacketPacing -hh

Temperature Utility

The tool queries the external ASIC temperature sensor to get temperature readings. It displays the highest temperature among the ASIC diodes on the adapter in Celsius units.

Usage

mlx5cmd -Temperature -hh | [-Name <adapter name>]

Detailed usage

mlx5cmd -Temperature -hh


Get-NetView Utility

This utility allows the user to collect data on system and network configurations for troubleshooting purposes.

Note

The utility is only supported on Windows Server 2016 and above. For more information, please refer to the Microsoft SDN repository documentation.

Usage

The script is available publicly as part of the Microsoft repository at: https://github.com/microsoft/Get-NetView

To execute the script, simply run the script from PowerShell. Once the script has completed, it will display the output location.

Display RSS Information

RSS information is now displayed from the driver. On the Hyper-V it will also display Vport's VMMQ configurations.

Usage

mlx5cmd -Dbg -RssInfo -Name <adapter name> [-Json <file_name.json>]| -hh


smpquery Utility

smpquery allows querying of various information about the InfiniBand network.

Usage

mlx5cmd -ib -SmpQuery -help


Configuration Validator

This tool validates the configuration of registry keys provided in the configuration file.

Usage

mlx5cmd -ConfigValidator | -Name <Adapter Name> | [-Template] | [-ConfigCompare] | -File <File Name> | -hh

Detailed usage

mlx5cmd -ConfigValidator -hh

Example

Print a Template file:

mlx5cmd -ConfigValidator -Name cx4 -Template -File .\at.json

Compare driver registry configuration with the one in the file:

mlx5cmd -Dbg -ConfigValidator -Name cx4 -ConfigCompare -File .\at.json


VXLAN Offloading Configuration Utility

This tool will allow the user to configure additional ports for VXLAN offloading. The user can also query the VXLAN ports offload configuration of the adapter.

Usage

mlx5cmd -Vxlan -hh | -Name <adapter name> [-add_port <port_num> | -del_port <port_num> | -query]

Detailed usage

mlx5cmd -Vxlan -hh

Notes

  • VXLAN offloading is a global hardware configuration, therefore any modification applies to all adapter ports.

  • VXLAN offloading is always configured on the IANA standard VXLAN port, regardless of OS configuration.


The AutoLogger is a debuggability capability implemented as part of Mlx5Cmd, that automatically collects logs until it detects a trigger defined by the user.

Usage

mlx5cmd -AutoLogger -hh | [-Name <adapter name>] -TriggerType <type>

Detailed usage

mlx5cmd -AutoLogger -hh

Note

This feature is supported in NVIDIA® BlueField®-2 devices only. Using the feature on other devices with REAL_TIME timestamping will result in wrong PTP clock.

This feature creates a PTP like ability to let the user sync the clock by getting a PTP similar clock from a DevX commands. To update the system's time, use the value from the “devx_ptp_query_time” option. This feature must be used when the REAL_TIME timestamping is enabled.

The DevX API used for this utility is devx_ptp_create(__deref_out struct devx_ptp_context** ppPtpCtx, __in devx_device_ctx* pDevxCtx, __in uint32_t flags).

To enable the feature:

  1. Create the PTP context (devx_ptp_create).

  2. Query the PTP clock (devx_ptp_query_time).

  3. Repeat the process as many times as needed.

  4. Delete the devx PTP context (devx_ptp_destroy).

RoCE Restrict Configuration Utility

This tool will limit VM RoCEv2 traffic to a specific IPv6 source address. The subcommand will get the desired IPv6 source address and the desired VFID (if not specified will be considered the first available VF) and will apply the configurations on it.

Expected:

• RoCEv2 traffic with the specific IPv6 source address will be passed

• RoCEv2 traffic with different IPv6 source address will be dropped

• RoCEv2 traffic with IPv4 will be dropped

• All other traffic will be as default

Note

A restore command can be run on the same VF to reset to default behavior.

Usage

mlx5cmd -RoceRestrict -Name <adapter_name> -Set -IPv6SrcAddress <IPv6Address> -VfId <VFID>

Detailed usage

mlx5cmd -RoceRestrict -hh

Example

mlx5cmd -RoceRestrict -Set -Name "SLOT 1 Port 1" -VfId 0 -Ipv6SrcAddr fe80::215:5dff:fe67:123


NicHealthMonitor Utility

Nic Health Monitor is a utility that performs multiple checks on the node as stated below:

  • Analyzes counters' data and report detected issues

  • Runs on a live system, and collects dumps and logs periodically for offline troubleshooting until a pre-defined trigger is detected

  • Runs on a live system, scans the system (System event log and Perfmon counters), and reports the status of the NIC, driver, and firware

Usage

mlx5cmd -Dbg -NicHealthMonitor -AnalyzeCounters -Check -input c:\tmp\CounterData.csv -type 1

Detailed usage

mlx5cmd -Dbg -NicHealthMonitor -CheckNode -hh

Subcommands

-AnalyzeCounters

Analyze counters data and generate a report that includes the detected issues, if applicable.

-SmartTrigger

Run an AutoLogger mechanism on a SmartTrigger.

-CheckNode

Scan the system and report its status

AnalyzeCounters

Checks the Nic health while analyzing the value of counters, found in the input CSV file.

Copy
Copied!
            

mlnx5cmd -Dbg -NicHealthMonitor -AnalyzeCounters -hh | -List | -Check -Input <CSV file> [-Type N] [-FullName] [-Desc] [-Format TXT | CSV] [-CfgFile <Cfg.txt>

Parameter

Mandatory

default

Description

List

No

N/A

Command: print the configurable counters in format of <Cfg.txt> file

Check

No

N/A

Command: check the values of counters, found in the -input file

Input <CSV file

No

NULL

File, containing the names and values of counters to be checked, if Null using the default list.

Type N

No

3 ((errors+warnings)

Bit-field, containing types of results to be shown:

  • 1-errors

  • 2-warnings

  • 4/8-good/unchecked counters

FullName

No

N/A

Print full counter names

Desc

No

N/A

Print description of the counter

Format TXT | CSV

No

TXT

The output format

CfgFile <Cfg.txt>

No

N/A

A text file, containing new configuration parameters for the counters, printed by -List command.

  • Print only error counters in default format

    Copy
    Copied!
                

    mlnx5Cmd.exe -Dbg -NicHealthMonitor -AnalyzeCounters -Check -input c:\tmp\CounterData.csv -type 1

  • Print only error and warning counters with full name of counters

    Copy
    Copied!
                

    mlnx5Cmd.exe -Dbg -NicHealthMonitor -AnalyzeCounters -Check -input c:\tmp\CounterData.csv -type 3 -FullName

  • Print conclusions on all counters of the input file, with maximum info and in CSV format. 'all counters' requires '-type 15'

    Copy
    Copied!
                

    mlnx5Cmd.exe -Dbg -NicHealthMonitor -AnalyzeCounters -Check -input c:\tmp\CounterData.csv -type 15 -FullName -Desc -Format CSV > output.csv

SmartTrigger

SmartTrigger is a debuggability capability implemented as part of Mlx5Cmd that automatically collects logs until it detects a trigger defined by the user.

Copy
Copied!
            

mlnx5Cmd -Dbg -NicHealthMonitor -SmartTrigger -hh | [-Name <adapter name>] -TriggerType <type>

Parameter

Mandatory

default

Description

Name <adapter name>

No

1st adapter

The Network adapter name

TriggerType

Yes

The type of trigger , optional values Event|CounterNumeric|CounterProgressPrev|CounterProgressFirst

SampleInterval

No

30

The Interval (in seconds) in which the logs are collected. The legal range is 5-86400.

TriggerQueryInterval

No

10

The Interval (in seconds) in which the tool will check if a trigger event has been generated. The legal range is 5-14400.

TotalTime

No

infinitely

The total Time the tool will run in seconds, default is until stopped by user

TriggerEventID

Yes when TriggerType Event

N/A

The event ID of the trigger event, can only be used with TriggerType Event.

TriggerCounterName

Yes when

TriggerType CounterNumeric or CounterProgress

N/A

The full path of the counter,Can only be used with TriggerType CounterNumeric or CounterProgress

TriggerCounterThreshold

Yes when

TriggerType CounterNumeric or CounterProgress

N/A

Can be used only with TriggerType CounterNumeric or CounterProgress.

  • if TriggerType is CounterNumeric valid values are greater than 0

  • if TriggerType is CounterProgress valid values are 1-100

MaxNumLog

No

3

The maximal number of logs to be collected (including post trigger log) valid values are 2-10.

LogsPath

No

\%SystemRoot%\Temp

Path to save logs

CopyDMNDirs

No

False

In case DMN cannot be generated inside the logger directory, copy the DMN DIRs from the default path instead of saving its path.

Verbose

No

Normal

Tool Verbosity, valid values are Normal\Debug

Example:

  • Start an instance of the tool to collect periodic logs every 5 seconds, and query the Event Log every 10 seconds to see if Event with id 403 has been logged. If an event has been logged it will collect the final logs and exit. The instance will run until the event has ben logged or until user stops it.

    Copy
    Copied!
                

    mlnx5Cmd -Dbg -NicHealthMonitor -SmartTrigger -Name <adapter name> -TriggerType Event -TriggerEventID 403 -SampleInterval 5 -TriggerQueryInterval 10

  • Start an instance of the tool to collect periodic logs every 30 seconds, and query the specified counter every 10 seconds to see if the current value of the counter is >=1000000. If it is, the tool will collect final logs and exit. It will run a maximum time of 180 seconds.

    Copy
    Copied!
                

    mlnx5Cmd -Dbg -NicHealthMonitor -SmartTrigger -TriggerType CounterNumeric -TriggerCounterName "\Mellanox WinOF-2 Port Traffic(_Total)\Bytes Total" -TriggerCounterThreshold 1000000 -TotalTime 180

CheckNode

Nic Health Monitor estimates the health of the NIC by analyzing the firmware and diagnostic counters, collected previously by the customer.

Usage

mlnx5cmd -Dbg -NicHealthMonitor -CheckNode -hh | [-Name <adapter name>]

Detailed usage

mlnx5cmd -Dbg -NicHealthMonitor -CheckNode -hh

Example

mlnx5cmd Dbg -NicHealthMonitor -CheckNode

Parameter Descriptions

Parameter

Mandatory

Default

Description

Name

No

N\A

The name of the adapter. If this parameter is not provided, the tool uses the first adapter it finds.

Periodic

No

False

[Optional] This flag is used to start a manual CheckNode operation.

OldEventsThreshold

No

36000

The tool will search for events that were logged in the last OldEventsThreshold seconds. Events older than NewEventsThreshold and newer than OldEventsThreshold will be considered as OLD.

NewEventsThreshold

Only in manual mode

10800

Events logged in the last NewEventsThreshold seconds will be considered as NEW.

In manual mode, this parameter represents the time since CheckNode was last called (in manual mode), and is mandatory.

Note: NewEventsThreshold must be less than OldEventsThreshold.

LogsPath

No

%SystemRoot%\Temp

The path to save the logs in.

LogToFile

No

False

Use this parameter to generate a log file instead of printing output to STDOUT.

Verbose

No

False

Use verbose printing.


Event Log

The tool will check the event log for the following events:

ID

Event

2

MLX_EVENT_INIT_BIT_STUCK

8

MLX_EVENT_INIT_BIT_STUCK_ON_SHUTDOWN

12

MLX_EVENT_LOG_NOT_ENOUGH_MSIX_VECTORS

16

MLX_EVENT_LOG_CQ_EVENT_MSG

19

MLX_EVENT_LOG_CQE_ERROR_MSG

20

MLX_EVENT_LOG_EQ_STUCK_MSG

21

MLX_EVENT_LOG_TX_QUEUE_TIMEOUT_MSG

22

MLX_EVENT_LOG_RX_QUEUE_TIMEOUT_MSG

66

MLX_EVENT_FW_HEALTH_REPORT

76

MLX_EVENT_LOG_VF_REACHED_MAX_PAGES

138

MLX_EVENT_ERROR_RESILIENCY_IGNORE_EVENT

149

MLX_EVENT_ERROR_RESILIENCY_START

267

MLX_EVENT_LOG_ERROR_QUERY_HCA_CAP

268

MLX_EVENT_LOG_ERROR_QUERY_ADAPTER

304

MLX_EVENT_LOG_ERROR_FW_CMD_FAILED

307

MLX_EVENT_LOG_ERROR_FW_CMD_EXEC_FAILED

355

MLX_EVENT_LOG_NDIS_RESET_FAILED

356

MLX_EVENT_RECEIVE_HANG

357

MLX_EVENT_TRANSMIT_ENGINE_HANG

363

MLX_EVENT_ADAPTER_RESTART_BY_DEVICE_IS_DISABLED

386

MLX_EVENT_LOG_VPORT_TX_QUEUE_TIMEOUT_MSG

387

MLX_EVENT_LOG_TX_QUEUE_TIMEOUT_MSG

421

MLX_EVENT_STUCK_OID


Auto Mode

When set to Auto mode, the CheckNode command will perform the following:

  1. Query the event log for events in the list logged by the driver (events with the source: “mlx5”). In the last NewEventsThreshold seconds, these events will be considered as NEW, and if any were logged, the status will be RED.

  2. Query the event log for events logged before more than NewEventsThreshold seconds, and less than OldEventsThreshold. If any are found they will be considered as OLD and the status will be YELLOW.

  3. Collect 3 samples of the NVIDIA counters and analyze the output CSV file using the AnalyzeCounters utility. If the status after the analysis of the event log and counters is YELLOW or RED, the tool will collect Dump-me-now, ETLs and event log.

Manual Mode (Periodic)

In Periodic mode, if this is the first time tool is running, it will establish a base line by collecting one sample of counters and return a GREEN status. To determine if a base line exists, the tool searches for a folder named CheckNodePeriodic in LogsPath. If it does not exist, no base line is assumed and it will create the folder.

If the base line exists, the tool will query the event log for events logged in the last NewEventsThreshold seconds. If any of the events from the list are found, they will be considered as NEW and the result will be RED.

Note

In Manual mode, the tool does NOT check for OLD events.

After finishing with event log, the tool will collect one sample of counters, and analyze them by comparing them to the previous sample collected (on the previous call to CheckNode in manual mode) using the AnalyzeCounters utility.

  • If the status is RED or YELLOW, the tool will collect Dump-me-now, ETLs and event log and will NOT erase the logs from the previous run, for comparison.

  • If the status is GREEN, only the counter data and dump-me-now from the current run will be saved.

© Copyright 2024, NVIDIA. Last updated on Sep 18, 2024.