NVIDIA Firmware Tools (MFT) Documentation v4.28
v4.28

mlxfwstress Utility

Note

The tool can support new devices only once the tool is upgraded to its latest version.

mlxfwstress enables/disables various firmware stress flows. It can work in multiple modes:

  • Enable/disable a specific set of stress types

  • Clear all stress types

  • Random mode:

    • Single mode - choose one stress type in each iteration and enable/disable it

    • Wild-mode- choose multiple stress types in each iteration and enable/disable them

Each time a stress type is chosen in a random iteration, the opposite operation is done on it (e.g., if a stress type is turned on, in the next iteration it will be turned off and vice versa).

  • Toggle mode:

    • Turns on and off the list of stress types alternating. Can be used with iterations.

      Note

      To disable a stressor while in toggling mode, first you must disable the mlxfwstress tool, and only after that disable the stressor.

  • Clear semaphore:

    Note: This functionality is supported in ConnectX-3 Pro adapter cards only.

Copy
Copied!
            

# mlxfwstress [-d|--dev <DeviceName>] [-h|--help] [-v|--version] [-o|--operation <Operation>] [--rand-mode <Random mode>] [-t|--stress-type <Stress type>] [--iterations <Iterations>] [--stress-delay <Stress delay>] [--max-rand-on <Max rand on>] [--hang-type <Hang type>] [--seed <seed>] [--toggle-time <x,y>]

where:

-d|--dev <DeviceName>

Perform operation for a specified device

-h|--help

Show this message and exit

-v|--version

Show the executable version and exit

-o|--operation <Operation>

Choose operation: on, off, clear_all, random query, clear_semaphore

--rand-mode <Random mode>

Choose a random mode: single, wild

-t|--stress-type <Stress type>

Specify a list of stress types separated by comma. (See Stress Types.)

--iterations <Iterations>

Specify the number of iterations.

--stress-delay <Stress delay>

Specify the stress delay in seconds (can be float).

Note: Some stress flows may take more time.

Recommended values: 0-1

--max-rand-on <Max rand on>

Specify the maximal time a stress is allowed to be on in random mode in seconds.

Recommended values (0,1]

Default is 1

--hang-type <Hang type>

Specify a list of hang types separated by comma. (See Hang Types.)

--seed <seed>

Specify the seed for the random.

--toggle-time <x,y>

Toggle time after off, both in seconds (can be float). If y is not supplied the tool will use equal values for x and y

ConnectX-4/ConnectX-4 Lx/ConnectX-5 Adapter Cards Stress Types

The following are the stress types available for ConnectX-4/ConnectX-4 Lx/ConnectX-5 adapter cards:

Category

Stress Type

Description

Notes

Transparent

PAUSE_STORM_GENERATION

Generates pause frames from the device toward the network

INVALIDATE_INTERNAL_CACHE_RX_1

Invalidates STE cache

INVALIDATE_INTERNAL_CACHE_RX_2

Invalidates qp L0 cache (RX)

INVALIDATE_INTERNAL_CACHE_RX_3

Invalidates dct L0 cache (RX)

INVALIDATE_INTERNAL_CACHE_RX_4

Invalidates scatter list cache in RX

INVALIDATE_INTERNAL_CACHE_CQ

Invalidates CQC cache

INVALIDATE_INTERNAL_CACHE_SX1

Invalidates SXDC cache

INVALIDATE_INTERNAL_CACHE_RX_5

Invalidates LDB cache

INVALIDATE_INTERNAL_CACHE_GENERAL_1

Invalidates RO caches

INVALIDATE_INTERNAL_CACHE_SX2

Invalidates pkey cache (SX)

INVALIDATE_INTERNAL_CACHE_SX3

Invalidates guid cache (SX)

INVALIDATE_INTERNAL_CACHE_QP

Invalidates QPC (main QP cache unit)

Hang FW/HW

PACKET_DROP

Drops N packets on portx

This type requires the following extra flags:

  • num_of_packets - 8 bit (max 15)

  • port_num - 8 bit (should be 1 or 2)


ConnectX-3 Pro Adapter Cards Stress Types

The following are the stress types available for ConnectX-3 Pro adapter cards:

Note

Stressors in "Transparent" category that are active for more than 100 msec, may cause resiliency.

Category

Stress Type

Description

Transparent

STOP_CE_INSTAGE_EQE

Stops sending EQEs created by the hardware (not the ones created by the firmware).

STOP_EDBH

Stops the handling of external doorbells.

STOP_IDBH

Stops the handling of internal doorbells.

STOP_QPC_MISS_MACHINE_0

STOP_QPC_MISS_MACHINE_1

STOP_QPC_MISS_MACHINE_2

STOP_QPC_MISS_MACHINE_3

Spots reading a QPC from the ICM on a miss-blocking hardware/firmware that accesses the QPC

LOCK_CEGW

Locks the CQE gateway.

LOCK_OBGW_TPT

LOCK_OBGW_TCU

LOCK_OBGW_SXD

Locks the OBGW (access to the host memory gateway).

LOCK_QPCGW_RX

Locks QPCGW.

LOCK_SEMAPHORE_IPC_RX0

LOCK_SEMAPHORE_IPC_RX1

LOCK_SEMAPHORE_IPC_LDB

LOCK_SEMAPHORE_IPC_SX1

Locks the IPC semaphore.

INVALIDATE_CACHES

Invalidates caches.

Performance

STOP_SXP_VL_ARB_PORT1

STOP_SXP_VL_ARB_PORT2

Stops transmission of packets to the wire. Causes head-of-line packet drop (HLL) if enabled.

RX_BACKPRESSURE

Stops the RX pipe - back-pressure to wire- sending tx pauses.

DROP_PACKETS_TX

Drops packets on the TX side.


Turning On Stress Types

To turn on a specific stress type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on -t STOP_CE_INSTAGE_EQE ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: stop_ce_instage_eqe -PASSED

To turn on a set of stress types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on -t STOP_CE_INSTAGE_EQE,STOP_QPC_MISS_MACHINE_3,LOCK_SEMAPHORE_IPC_RX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: stop_ce_instage_eqe -PASSED Turning ON stress type: stop_qpc_miss_machine_3 -PASSED Turning ON stress type: lock_semaphore_ipc_rx1 -PASSED

To turn on all the available stress types:

Copy
Copied!
            

mlxfwstress -d mt4119_pciconf0 -t ALL -o on Random seed: [1587969653] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON stress type: INVALIDATE_INTERNAL_CACHE_CQ -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_GENERAL_1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_QP -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_2 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_3 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_4 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_RX_5 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX1 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX2 -PASSED Turning ON stress type: INVALIDATE_INTERNAL_CACHE_SX3 -PASSED


Turning Off Stress Types

To turn off a specific stress type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off -t STOP_CE_INSTAGE_EQE ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF stress type: stop_ce_instage_eqe -PASSED

To turn off a set of stress types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off -t STOP_CE_INSTAGE_EQE,STOP_QPC_MISS_MACHINE_3,LOCK_SEMAPHORE_IPC_RX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF stress type: stop_ce_instage_eqe -PASSED Turning OFF stress type: stop_qpc_miss_machine_3 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx1 -PASSED


Querying the Stress Types

To query the state of all stress types:

Copy
Copied!
            

mlxfwstress -d mt4117_pciconf0 -o query -t ALL ------------------------------------------------- Operation: [QUERY] ------------------------------------------------- Querying stress type: INVALIDATE_INTERNAL_CACHE_CQ -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_GENERAL_1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_QP -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_2 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_3 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_4 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_RX_5 -NOT SUPPORTED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX1 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX2 -ENABLED Querying stress type: INVALIDATE_INTERNAL_CACHE_SX3 -ENABLED


ConnectX-4/ConnectX-4 Lx/ConnectX-5 Adapter Cards Hang Types

The following are the hang types available for ConnectX-4/ConnectX-4 Lx/ConnectX-5 adapter cards:

Category

Stress Type

Description

Notes

Hang FW/HW

FFSER

Initialize FaultInjector object

  • This hang type is supported in BlueField-2 device only.

  • No extra flag is required.

STOP_RX_PER_PRIO1

This type requires the following extra flags:

  • vl_mask - 16 bit

  • port_num - 8 bit

Copy
Copied!
            

mlxfwstress -d mt4115_pciconf0 -o on --hang-type STOP_RX_PER_PRIO --extra %STOP_RX_PER_PRIO[0x00100FF] Random seed: [1588056318] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: STOP_RX_PER_PRIO -PASSED

To turn this Hang Type, the command must be executed in the following format:

Example:

Copy
Copied!
            

mlxfwstress -d mt4115_pciconf0 -o on --hang-type STOP_RX_PER_PRIO --extra % STOP_RX_PER_PRIO [0x000100FF] output: Random seed: [1573642282] ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: STOP_RX_PER_PRIO-PASSED


Turning On Hang Types

To turn on a specific hang type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on --hang-type HANG_SX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: Sx1 -PASSED

To turn on a set of hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o on --hang-type HANG_SX1,HANG_RX1 ------------------------------------------------- Operation: [ON] ------------------------------------------------- Turning ON hang type: Sx1 -PASSED Turning ON hang type: Rx1 -PASSED


Turning Off Hang Types

To turn off a specific hang type:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off --hang-type HANG_SX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED

To turn off a set of hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o off --hang-type HANG_SX1,HANG_RX1 ------------------------------------------------- Operation: [OFF] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED Turning OFF hang type: Rx1 -PASSED


Querying the Hang Types

To query the state of all hang types:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o query --hang-type ALL ------------------------------------------------- Operation: [QUERY] ------------------------------------------------- Querying hang type: Sx1 -ENABLED Querying hang type: Rx1 -ENABLED Querying hang type: Tx -ENABLED Querying hang type: Rx -ENABLED


To clear all stress/hang types:

Copy
Copied!
            

mlxfwstress - d mt4103_pciconf0 -o clear_all ------------------------------------------------- Operation: [CLEAR_ALL] ------------------------------------------------- Turning OFF hang type: Sx1 -PASSED Turning OFF hang type: Rx1 -PASSED Turning OFF hang type: Tx -PASSED Turning OFF hang type: Rx -PASSED Turning OFF stress type: stop_ce_instage_eqe -PASSED Turning OFF stress type: stop_sxp_vl_arb_port1 -PASSED Turning OFF stress type: stop_sxp_vl_arb_port2 -PASSED Turning OFF stress type: stop_edbh -PASSED Turning OFF stress type: stop_idbh -PASSED Turning OFF stress type: stop_qpc_miss_machine_0 -PASSED Turning OFF stress type: stop_qpc_miss_machine_1 -PASSED Turning OFF stress type: stop_qpc_miss_machine_2 -PASSED Turning OFF stress type: stop_qpc_miss_machine_3 -PASSED Turning OFF stress type: lock_cegw -PASSED Turning OFF stress type: lock_obgw_tpt -PASSED Turning OFF stress type: lock_obgw_tcu -PASSED Turning OFF stress type: lock_obgw_sxd -PASSED Turning OFF stress type: lock_qpcgw_rx -PASSED Turning OFF stress type: lock_semaphore_ipc_sx1 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx0 -PASSED Turning OFF stress type: lock_semaphore_ipc_rx1 -PASSED Turning OFF stress type: lock_semaphore_ipc_ldb -PASSED Turning OFF stress type: invalidate_caches -PASSED

To clear the semaphore:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o clear_semaphore ------------------------------------------------- Operation: [CLEAR_SEMAPHORE] ------------------------------------------------- Semaphore was cleared successfully

There are two random modes you can choose from:

  • Single - gives a set of stress types, in each iteration one stress type is chosen an toggled ON/OFF according to his current state

  • Wild - gives a set of stress types, in each iteration a random subset of stress types is chosen and toggled ON/OFF according to their current state

Setting the Random Mode for the Stress Types

To set the Single Mode:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o random --rand-mode single -t STOP_CE_INSTAGE_EQE --stress-delay 0.2 --iterations 10 ------------------------------------------------- Operation: [RANDOM] ------------------------------------------------- ############################################# Random: Iterations delay: 0.2 [sec] Iterations number: 10 Max on time: 1 [sec] ############################################# RANDOM ITERATION: [1] [stop_ce_instage_eqe]: [ON] , duration since last operation: 0 [ms] RANDOM ITERATION: [2] [stop_ce_instage_eqe]: [OFF], duration since last operation: 200 [ms] RANDOM ITERATION: [3] [stop_ce_instage_eqe]: [ON] , duration since last operation: 201 [ms] RANDOM ITERATION: [4] [stop_ce_instage_eqe]: [OFF], duration since last operation: 200 [ms] RANDOM ITERATION: [5] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] RANDOM ITERATION: [6] [stop_ce_instage_eqe]: [OFF], duration since last operation: 201 [ms] RANDOM ITERATION: [7] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] RANDOM ITERATION: [8] [stop_ce_instage_eqe]: [OFF], duration since last operation: 201 [ms] RANDOM ITERATION: [9] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] Turning OFF stress type: stop_ce_instage_eqe RANDOM ITERATION: [10] [stop_ce_instage_eqe]: [ON] , duration since last operation: 200 [ms] ======================================================= Turning off all stress types after random: Turning OFF stress type: stop_ce_instage_eqe

  • As seen in the example above, after the specified number of iterations, the tool turns off all the stress types.

  • The default value for stress-delay is 1 second.

  • If no number of iterations was supplied then the user is expected to stop the tool with ctrl+c. The tool turns off all the stress types.

To set the Wild Mode:

Copy
Copied!
            

mlxfwstress -d mt4103_pciconf0 -o random --rand-mode wild -t ALL --stress-delay 0.2 --max-rand-on 1 --iterations 5 ------------------------------------------------- Operation: [RANDOM] ------------------------------------------------- ############################################# Random: Iterations delay: 0.2 [sec] Iterations number: 5 Max on time: 1 [sec] #############################################   RANDOM ITERATION: [1] [stop_ce_instage_eqe]: [ON] , duration since last operation: 0 [ms] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 0 [ms] [stop_edbh]: [ON] , duration since last operation: 0 [ms] [stop_idbh]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_0]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_3]: [ON] , duration since last operation: 0 [ms] [lock_cegw]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 0 [ms] [lock_qpcgw_rx]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_sx1]: [ON] , duration since last operation: 0 [ms]   RANDOM ITERATION: [2] [stop_sxp_vl_arb_port1]: [ON] , duration since last operation: 0 [ms] [stop_edbh]: [OFF], duration since last operation: 203 [ms] [stop_idbh]: [OFF], duration since last operation: 203 [ms] [stop_qpc_miss_machine_3]: [OFF], duration since last operation: 202 [ms] [lock_cegw]: [OFF], duration since last operation: 202 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tcu]: [OFF], duration since last operation: 203 [ms] [lock_semaphore_ipc_rx0]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_rx1]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_ldb]: [ON] , duration since last operation: 0 [ms]   RANDOM ITERATION: [3] [stop_ce_instage_eqe]: [OFF], duration since last operation: 406 [ms] [stop_sxp_vl_arb_port2]: [OFF], duration since last operation: 406 [ms] [stop_edbh]: [ON] , duration since last operation: 203 [ms] [stop_idbh]: [ON] , duration since last operation: 203 [ms] [stop_qpc_miss_machine_0]: [OFF], duration since last operation: 406 [ms] [stop_qpc_miss_machine_2]: [ON] , duration since last operation: 0 [ms] [lock_obgw_tpt]: [OFF], duration since last operation: 203 [ms] [lock_obgw_sxd]: [ON] , duration since last operation: 0 [ms] [lock_semaphore_ipc_sx1]: [OFF], duration since last operation: 405 [ms] [lock_semaphore_ipc_ldb]: [OFF], duration since last operation: 203 [ms]   RANDOM ITERATION: [4] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 203 [ms] [stop_edbh]: [OFF], duration since last operation: 202 [ms] [stop_idbh]: [OFF], duration since last operation: 202 [ms] [stop_qpc_miss_machine_1]: [ON] , duration since last operation: 0 [ms] [stop_qpc_miss_machine_3]: [ON] , duration since last operation: 406 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 202 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 406 [ms] [lock_obgw_sxd]: [OFF], duration since last operation: 203 [ms] [lock_semaphore_ipc_sx1]: [ON] , duration since last operation: 203 [ms] [lock_semaphore_ipc_rx1]: [OFF], duration since last operation: 406 [ms] [invalidate_caches]: [ON] , duration since last operation: 0 [ms]   Turning OFF stress type: stop_sxp_vl_arb_port1 Turning OFF stress type: stop_sxp_vl_arb_port2 Turning OFF stress type: stop_qpc_miss_machine_1 Turning OFF stress type: stop_qpc_miss_machine_2 Turning OFF stress type: stop_qpc_miss_machine_3 Turning OFF stress type: lock_obgw_tpt Turning OFF stress type: lock_obgw_tcu Turning OFF stress type: lock_qpcgw_rx Turning OFF stress type: lock_semaphore_ipc_sx1 Turning OFF stress type: lock_semaphore_ipc_rx0 Turning OFF stress type: invalidate_caches   RANDOM ITERATION: [5] [stop_sxp_vl_arb_port2]: [ON] , duration since last operation: 202 [ms] [stop_idbh]: [ON] , duration since last operation: 322 [ms] [lock_obgw_tpt]: [ON] , duration since last operation: 202 [ms] [lock_obgw_tcu]: [ON] , duration since last operation: 202 [ms] [lock_qpcgw_rx]: [ON] , duration since last operation: 202 [ms] [invalidate_caches]: [ON] , duration since last operation: 202 [ms] ======================================================= Turning off all stress types after random:   Turning OFF stress type: stop_sxp_vl_arb_port2 Turning OFF stress type: stop_idbh Turning OFF stress type: lock_obgw_tpt Turning OFF stress type: lock_obgw_tcu Turning OFF stress type: lock_qpcgw_rx Turning OFF stress type: invalidate_caches


ConnectX-3/ConnectX-3 Pro Adapter Cards Hang Types

The following are the hang types available for ConnectX-3/ConnectX-3 Pro adapter cards:

Category

Stress Type

Description

Notes

Hang FW/HW

HANG_SX1

HANG_RX1

HANG_TX

HANG_RX

ALL

Hang types that require extra flags are not supported when running with the 'ALL' option.


© Copyright 2024, NVIDIA. Last updated on May 15, 2024.