ibdiagnet InfiniBand Fabric Diagnostic Tool User Manual v2.9.0

Bit Error Rate (BER)

The Bit Error Rate (BER) is the number of bit errors per unit time divided by the total number of transferred bits during a studied time interval. BER is a unitless performance measure, often expressed as a percentage.

Parameter

Description

Notes

--get_phy_info

Collects BER information for fabric ports and checks BER validating with specific thresholds. Errors will be reported to the ibdiagnet2.log and ibdiagnet2.db_csv files.

Applicable to all EDR/HDR and future InfiniBand devices.

--ber_test

Deprecated. Provides a BER test for each port. Calculate BER for each port and check no BER value has exceeded the BER threshold. (default threshold="10^-12").

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices.

--ber_thresh <value>

Deprecated. Specifies the threshold value for the BER test. The reciprocal number of the BER should be provided.

For example, the value of 10^-12 should be 1000000000000 or 0xe8d4a51000 (10^12).

If the given threshold is 0, then all BER values for all ports will be reported.

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices.

--llr_active_cell <64|128>

Deprecated. Specifies the Link Level Retransmission (LLR) active cell size for BER test, when LLR is active in the fabric.

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices

Example:

Copy
Copied!
            

ibdiagnet --get_phy_info

For HDR/EDR links, symbol errors (HDR) or effective errors (EDR) are the actual errors seen by the application level after error correction.

The below methodology is recommended as a first step if fabric performance is degraded.

  1. Make sure the significant traffic is running in the fabric

  2. ibdiagnet --pc --reset_phy_info -i <mlx_dev>

  3. Wait for some time (5-10 minutes)

  4. ibdiagnet --get_phy_info -i <mlx_dev>

  5. Review ibdiagnet2.log

  6. Contact Support if Symbol/Effective BER Check finished with errors.

For detailed description of cmd line parameters, see previous chapter “Bit Error Rate”

BER check log file fragment:

Copy
Copied!
            

-E- Symbol BER Check finished with errors -E- H-10/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 -E- H-14/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 -E- H-3/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 -E- H-7/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 -E- SW-1-0/U1/P4 - BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12 -E- SW-1-0/U1/P5 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12   --------------------------------------------- Fabric Summary   Total Nodes : 24 IB Switches : 8 IB Channel Adapters : 16 IB Aggregation Nodes : 0 IB Routers : 0   Total number of links : 32 Links at 4x10 : 32   High BER reported by 6 ports

BER check error section in db_csv file:

Copy
Copied!
            

START_ERRORS_SYMBOL_BER_CHECK Scope,NodeGUID,PortGUID,PortNumber,EventName,Summary PORT,0x0002c90000000005,0x0002c90000000006,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 " PORT,0x0002c90000000015,0x0002c90000000016,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 " PORT,0x0002c90000000025,0x0002c90000000026,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 " PORT,0x0002c90000000035,0x0002c90000000036,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 " PORT,0x0002c90000000049,0x0002c90000000049,4,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12 " PORT,0x0002c90000000049,0x0002c90000000049,5,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 " END_ERRORS_SYMBOL_BER_CHECK

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.