NVIDIA Firmware Tools (MFT) Documentation v4.24.0
NVIDIA Firmware Tools (MFT) Documentation v4.24.0

mlxfwstress Utility

Warning

The tool can support new devices only once the tool is upgraded to its latest version.

mlxfwstress enables/disables various firmware stress flows. It can work in multiple modes:

  • Enable/disable a specific set of stress types

  • Clear all stress types

  • Random mode:

    • Single mode - choose one stress type in each iteration and enable/disable it

    • Wild-mode- choose multiple stress types in each iteration and enable/disable them

Each time a stress type is chosen in a random iteration, the opposite operation is done on it (e.g., if a stress type is turned on, in the next iteration it will be turned off and vice versa).

  • Toggle mode:

    • Turns on and off the list of stress types alternating. Can be used with iterations.

      Warning

      To disable a stressor while in toggling mode, first you must disable the mlxfwstress tool, and only after that disable the stressor.

  • Clear semaphore:
    Note: This functionality is supported in ConnectX-3 Pro adapter cards only.

Copy
Copied!
            

# mlxfwstress [-d|--dev <DeviceName>] [-h|--help] [-v|--version] [-o|--operation <Operation>] [--rand-mode <Random mode>] [-t|--stress-type <Stress type>] [--iterations <Iterations>] [--stress-delay <Stress delay>] [--max-rand-on <Max rand on>] [--hang-type <Hang type>] [--seed <seed>] [--toggle-time <x,y>]

where:

-d|--dev <DeviceName>

Perform operation for a specified device

-h|--help

Show this message and exit

-v|--version

Show the executable version and exit

-o|--operation <Operation>

Choose operation: on, off, clear_all, random query, clear_semaphore

--rand-mode <Random mode>

Choose a random mode: single, wild

-t|--stress-type <Stress type>

Specify a list of stress types separated by comma. (See Stress Types.)

--iterations <Iterations>

Specify the number of iterations.

--stress-delay <Stress delay>

Specify the stress delay in seconds (can be float).

Note: Some stress flows may take more time.

Recommended values: 0-1

--max-rand-on <Max rand on>

Specify the maximal time a stress is allowed to be on in random mode in seconds.

Recommended values (0,1]

Default is 1

--hang-type <Hang type>

Specify a list of hang types separated by comma. (See Hang Types.)

--seed <seed>

Specify the seed for the random.

--toggle-time <x,y>

Toggle time after off, both in seconds (can be float). If y is not supplied the tool will use equal values for x and y

ConnectX-4/ConnectX-4 Lx/ConnectX-5 Adapter Cards Stress Types

The following are the stress types available for ConnectX-4/ConnectX-4 Lx/ConnectX-5 adapter cards:

Category

Stress Type

Description

Notes

Transparent

PAUSE_STORM_GENERATION

Generates pause frames from the device toward the network

INVALIDATE_INTERNAL_CACHE_RX_1

Invalidates STE cache

INVALIDATE_INTERNAL_CACHE_RX_2

Invalidates qp L0 cache (RX)

INVALIDATE_INTERNAL_CACHE_RX_3

Invalidates dct L0 cache (RX)

INVALIDATE_INTERNAL_CACHE_RX_4

Invalidates scatter list cache in RX

INVALIDATE_INTERNAL_CACHE_CQ

Invalidates CQC cache

INVALIDATE_INTERNAL_CACHE_SX1

Invalidates SXDC cache

INVALIDATE_INTERNAL_CACHE_RX_5

Invalidates LDB cache

INVALIDATE_INTERNAL_CACHE_GENERAL_1

Invalidates RO caches

INVALIDATE_INTERNAL_CACHE_SX2

Invalidates pkey cache (SX)

INVALIDATE_INTERNAL_CACHE_SX3

Invalidates guid cache (SX)

INVALIDATE_INTERNAL_CACHE_QP

Invalidates QPC (main QP cache unit)

Hang FW/HW

PACKET_DROP

Drops N packets on portx

This type requires the following extra flags:

  • num_of_packets - 8 bit (max 15)

  • port_num - 8 bit (should be 1 or 2)


ConnectX-3 Pro Adapter Cards Stress Types

The following are the stress types available for ConnectX-3 Pro adapter cards:

Warning

Stressors in "Transparent" category that are active for more than 100 msec, may cause resiliency.

Category

Stress Type

Description

Transparent

STOP_CE_INSTAGE_EQE

Stops sending EQEs created by the hardware (not the ones created by the firmware).

STOP_EDBH

Stops the handling of external doorbells.

STOP_IDBH

Stops the handling of internal doorbells.

STOP_QPC_MISS_MACHINE_0

STOP_QPC_MISS_MACHINE_1

STOP_QPC_MISS_MACHINE_2

STOP_QPC_MISS_MACHINE_3

Spots reading a QPC from the ICM on a miss-blocking hardware/firmware that accesses the QPC

LOCK_CEGW

Locks the CQE gateway.

LOCK_OBGW_TPT

LOCK_OBGW_TCU

LOCK_OBGW_SXD

Locks the OBGW (access to the host memory gateway).

LOCK_QPCGW_RX

Locks QPCGW.

LOCK_SEMAPHORE_IPC_RX0

LOCK_SEMAPHORE_IPC_RX1

LOCK_SEMAPHORE_IPC_LDB

LOCK_SEMAPHORE_IPC_SX1

Locks the IPC semaphore.

INVALIDATE_CACHES

Invalidates caches.

Performance

STOP_SXP_VL_ARB_PORT1

STOP_SXP_VL_ARB_PORT2

Stops transmission of packets to the wire. Causes head-of-line packet drop (HLL) if enabled.

RX_BACKPRESSURE

Stops the RX pipe - back-pressure to wire- sending tx pauses.

DROP_PACKETS_TX

Drops packets on the TX side.


Turning On Stress Types

To turn on a specific stress type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on -t STOP_CE_INSTAGE_EQE ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: stop_ce_instage_eqe -PASSED

To turn on a set of stress types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on -t STOP_CE_INSTAGE_EQE,STOP_QPC_MISS_MACHINE_3,LOCK_SEMAPHORE_IPC_RX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: stop_ce_instage_eqe -PASSED Turning ON stress type: stop_qpc_miss_machine_3 -PASSED Turning ON stress type: lock_semaphore_ipc_rx1 -PASSED

To turn on all the available stress types:

Copy
Copied!
            

mlxfwstress -d mt4119_pciconf0 -t ALL -o on Random seed: [1587969653] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: INVALIDATE_INTERNAL_CACHE_CQ -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_GENERAL_1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_QP -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_2 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_3 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_4 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_5 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX2 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX3 -PASSED


Turning Off Stress Types

To turn off a specific stress type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off -t STOP_CE_INSTAGE_EQE ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF stress type: stop_ce_instage_eqe -PASSED

To turn off a set of stress types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off -t STOP_CE_INSTAGE_EQE,STOP_QPC_MISS_MACHINE_3,LOCK_SEMAPHORE_IPC_RX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF stress type: stop_ce_instage_eqe -PASSED Turning OFF stress type: stop_qpc_miss_machine_3 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx1 -PASSED


Querying the Stress Types

To query the state of all stress types:

Copy
Copied!
            

mlxfwstress -d mt4117_pciconf0 -o query -t ALL ------------------------------------------------- Operation: [QUERY] ------------------------------------------------- Querying stress type: INVALIDATE_INTERNAL_CACHE_CQ -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_GENERAL_1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_QP -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_2 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_3 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_4 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_5 -NOT SUPPORTED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX2 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX3 -ENABLED


ConnectX-4/ConnectX-4 Lx/ConnectX-5 Adapter Cards Hang Types

The following are the hang types available for ConnectX-4/ConnectX-4 Lx/ConnectX-5 adapter cards:

Category

Stress Type

Description

Notes

Hang FW/HW

FFSER

Initialize FaultInjector object

  • This hang type is supported in BlueField-2 device only.

  • No extra flag is required.

STOP_RX_PER_PRIO1

This type requires the following extra flags:

  • vl_mask - 16 bit

  • port_num - 8 bit

Copy
Copied!
            

mlxfwstress -d mt4115_pciconf0 -o on --hang-type STOP_RX_PER_PRIO --extra %STOP_RX_PER_PRIO[0x00100FF] Random seed: [1588056318] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: STOP_RX_PER_PRIO -PASSED

To turn this Hang Type, the command must be executed in the following format:

Example:

Copy
Copied!
            

mlxfwstress -d mt4115_pciconf0 -o on --hang-type STOP_RX_PER_PRIO --extra % STOP_RX_PER_PRIO [0x000100FF] output: Random seed: [1573642282] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: STOP_RX_PER_PRIO-PASSED


Turning On Hang Types

To turn on a specific hang type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on --hang-type HANG_SX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: Sx1 -PASSED

To turn on a set of hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on --hang-type HANG_SX1,HANG_RX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: Sx1 -PASSED Turning ON hang type: Rx1 -PASSED


Turning Off Hang Types

To turn off a specific hang type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off --hang-type HANG_SX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED

To turn off a set of hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off --hang-type HANG_SX1,HANG_RX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED Turning OFF hang type: Rx1 -PASSED


Querying the Hang Types

To query the state of all hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o query --hang-type ALL ------------------------------------------------- Operation: [QUERY] ------------------------------------------------- Querying hang type: Sx1 -ENABLED Querying hang type: Rx1 -ENABLED Querying hang type: Tx -ENABLED Querying hang type: Rx -ENABLED


To clear all stress/hang types:

Copy
Copied!
            

mlxfwstress - d mt4103_pciconf0 -o clear_all ------------------------------------------------- Operation: [CLEAR_ALL] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED Turning OFF hang type: Rx1 -PASSED Turning OFF hang type: Tx -PASSED Turning OFF hang type: Rx -PASSED Turning OFF stress type: stop_ce_instage_eqe -PASSED Turning OFF stress type: stop_sxp_vl_arb_port1 -PASSED Turning OFF stress type: stop_sxp_vl_arb_port2 -PASSED Turning OFF stress type: stop_edbh -PASSED Turning OFF stress type: stop_idbh -PASSED Turning OFF stress type: stop_qpc_miss_machine_0 -PASSED Turning OFF stress type: stop_qpc_miss_machine_1 -PASSED Turning OFF stress type: stop_qpc_miss_machine_2 -PASSED Turning OFF stress type: stop_qpc_miss_machine_3 -PASSED Turning OFF stress type: lock_cegw -PASSED Turning OFF stress type: lock_obgw_tpt -PASSED Turning OFF stress type: lock_obgw_tcu -PASSED Turning OFF stress type: lock_obgw_sxd -PASSED Turning OFF stress type: lock_qpcgw_rx -PASSED Turning OFF stress type: lock_semaphore_ipc_sx1 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx0 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx1 -PASSED Turning OFF stress type: lock_semaphore_ipc_ldb -PASSED Turning OFF stress type: invalidate_caches -PASSED

To clear the semaphore:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o clear_semaphore ------------------------------------------------- Operation: [CLEAR_SEMAPHORE] ------------------------------------------------- Semaphore was cleared successfully

There are two random modes you can choose from:

  • Single - gives a set of stress types, in each iteration one stress type is chosen an toggled ON/OFF according to his current state

  • Wild - gives a set of stress types, in each iteration a random subset of stress types is chosen and toggled ON/OFF according to their current state

Setting the Random Mode for the Stress Types

To set the Single Mode:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o random --rand-mode single -t STOP_CE_INSTAGE_EQE --stress-delay 0.2 --iterations 10 ------------------------------------------------- Operation: [RANDOM] ------------------------------------------------- ############################################# Random: Iterations delay: 0.2 [sec] Iterations number: 10 Max on time: 1 [sec] ############################################# RANDOM ITERATION: [1] [stop_ce_instage_eqe]: [ON] , duration since last operation: 0 [ms] RANDOM ITERATION: [2] [stop_ce_instage_eqe]: [OFF], duration since last operation: 200 [ms] RANDOM ITERATION: [3] [stop_ce_instage_eqe]: [ON] , duration since last operation: 201 [ms] RANDOM ITERATION: [4] [stop_ce_instage_eqe]: [OFF], duration since last operation: 200 [ms] RANDOM ITERATION: [5] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] RANDOM ITERATION: [6] [stop_ce_instage_eqe]: [OFF], duration since last operation: 201 [ms] RANDOM ITERATION: [7] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] RANDOM ITERATION: [8] [stop_ce_instage_eqe]: [OFF], duration since last operation: 201 [ms] RANDOM ITERATION: [9] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] Turning OFF stress type: stop_ce_instage_eqe RANDOM ITERATION: [10] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] ======================================================= Turning off all stress types after random: Turning OFF stress type: stop_ce_instage_eqe

  • As seen in the example above, after the specified number of iterations, the tool turns off all the stress types.

  • The default value for stress-delay is 1 second.

  • If no number of iterations was supplied then the user is expected to stop the tool with ctrl+c. The tool turns off all the stress types.

To set the Wild Mode:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o random --rand-mode wild -t ALL --stress-delay 0.2 --max-rand-on 1 --iterations 5 ------------------------------------------------- Operation: [RANDOM] ------------------------------------------------- ############################################# Random: Iterations delay: 0.2 [sec] Iterations number: 5 Max on time: 1 [sec] #############################################   RANDOM ITERATION: [1] [stop_ce_instage_eqe]: [ON] , duration since last operation: 0 [ms] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 0 [ms] [stop_edbh]: [ON] , duration since last operation: 0 [ms] [stop_idbh]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_0]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_3]: [ON] , duration since last operation: 0 [ms] [lock_cegw]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 0 [ms] [lock_qpcgw_rx]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_sx1]: [ON] , duration since last operation: 0 [ms]   RANDOM ITERATION: [2] [stop_sxp_vl_arb_port1]: [ON] , duration since last operation: 0 [ms] [stop_edbh]: [OFF], duration since last operation: 203 [ms] [stop_idbh]: [OFF], duration since last operation: 203 [ms] [stop_qpc_miss_machine_3]: [OFF], duration since last operation: 202 [ms] [lock_cegw]: [OFF], duration since last operation: 202 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tcu]: [OFF], duration since last operation: 203 [ms] [lock_semaphore_ipc_rx0]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_rx1]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_ldb]: [ON] , duration since last operation: 0 [ms]   RANDOM ITERATION: [3] [stop_ce_instage_eqe]: [OFF], duration since last operation: 406 [ms] [stop_sxp_vl_arb_port2]: [OFF], duration since last operation: 406 [ms] [stop_edbh]: [ON] , duration since last operation: 203 [ms] [stop_idbh]: [ON] , duration since last operation: 203 [ms] [stop_qpc_miss_machine_0]: [OFF], duration since last operation: 406 [ms] [stop_qpc_miss_machine_2]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tpt]: [OFF], duration since last operation: 203 [ms] [lock_obgw_sxd]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_sx1]: [OFF], duration since last operation: 405 [ms] [lock_semaphore_ipc_ldb]: [OFF], duration since last operation: 203 [ms]   RANDOM ITERATION: [4] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 203 [ms] [stop_edbh]: [OFF], duration since last operation: 202 [ms] [stop_idbh]: [OFF], duration since last operation: 202 [ms] [stop_qpc_miss_machine_1]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_3]: [ON] , duration since last operation: 406 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 202 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 406 [ms] [lock_obgw_sxd]: [OFF], duration since last operation: 203 [ms] [lock_semaphore_ipc_sx1]: [ON] , duration since last operation: 203 [ms] [lock_semaphore_ipc_rx1]: [OFF], duration since last operation: 406 [ms] [invalidate_caches]: [ON] , duration since last operation: 0 [ms]   Turning OFF stress type: stop_sxp_vl_arb_port1 Turning OFF stress type: stop_sxp_vl_arb_port2 Turning OFF stress type: stop_qpc_miss_machine_1 Turning OFF stress type: stop_qpc_miss_machine_2 Turning OFF stress type: stop_qpc_miss_machine_3 Turning OFF stress type: lock_obgw_tpt Turning OFF stress type: lock_obgw_tcu Turning OFF stress type: lock_qpcgw_rx Turning OFF stress type: lock_semaphore_ipc_sx1 Turning OFF stress type: lock_semaphore_ipc_rx0 Turning OFF stress type: invalidate_caches   RANDOM ITERATION: [5] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 202 [ms] [stop_idbh]: [ON] , duration since last operation: 322 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 202 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 202 [ms] [lock_qpcgw_rx]: [ON] , duration since last operation: 202 [ms] [invalidate_caches]: [ON] , duration since last operation: 202 [ms] ======================================================= Turning off all stress types after random:   Turning OFF stress type: stop_sxp_vl_arb_port2 Turning OFF stress type: stop_idbh Turning OFF stress type: lock_obgw_tpt Turning OFF stress type: lock_obgw_tcu Turning OFF stress type: lock_qpcgw_rx Turning OFF stress type: invalidate_caches


ConnectX-3/ConnectX-3 Pro Adapter Cards Hang Types

The following are the hang types available for ConnectX-3/ConnectX-3 Pro adapter cards:

Category

Stress Type

Description

Notes

Hang FW/HW

HANG_SX1

HANG_RX1

HANG_TX

HANG_RX

ALL

Hang types that require extra flags are not supported when running with the 'ALL' option.


© Copyright 2023, NVIDIA. Last updated on Jun 10, 2024.