Bit Error Rate (BER)
The Bit Error Rate (BER) is the number of bit errors per unit time divided by the total number of transferred bits during a studied time interval. BER is a unitless performance measure, often expressed as a percentage.
Parameter | Description | Notes |
--get_phy_info | Collects BER information for fabric ports and checks BER validating with specific thresholds. Errors will be reported to the ibdiagnet2.log and ibdiagnet2.db_csv files. | Applicable to all EDR/HDR and future InfiniBand devices. |
--ber_test | Deprecated. Provides a BER test for each port. Calculate BER for each port and check no BER value has exceeded the BER threshold. (default threshold="10^-12"). | This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices. |
--ber_thresh <value> | Deprecated. Specifies the threshold value for the BER test. The reciprocal number of the BER should be provided. For example, the value of 10^-12 should be 1000000000000 or 0xe8d4a51000 (10^12). If the given threshold is 0, then all BER values for all ports will be reported. | This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices. |
--llr_active_cell <64|128> | Deprecated. Specifies the Link Level Retransmission (LLR) active cell size for BER test, when LLR is active in the fabric. | This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices |
Example:
ibdiagnet --get_phy_info
For NDR/HDR/EDR links, symbol errors (NDR/HDR) or effective errors (EDR) are the actual errors seen by the application level after error correction.
The below methodology is recommended as a first step if fabric performance is degraded.
Make sure the significant traffic is running in the fabric
ibdiagnet --pc --reset_phy_info -i <mlx_dev>
Wait for some time (5-10 minutes)
ibdiagnet --get_phy_info -i <mlx_dev>
Review ibdiagnet2.log
Contact Support if Symbol/Effective BER Check finished with errors.
For detailed description of cmd line parameters, see previous chapter “Bit Error Rate”
BER check log file fragment:
-E- Symbol BER Check finished with errors
-E- H-10/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12
-E- H-14/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12
-E- H-3/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12
-E- H-7/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12
-E- SW-1-0/U1/P4 - BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12
-E- SW-1-0/U1/P5 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12
---------------------------------------------
Fabric Summary
Total Nodes : 24
IB Switches : 8
IB Channel Adapters : 16
IB Aggregation Nodes : 0
IB Routers : 0
Total number of links : 32
Links at 4x10 : 32
High BER reported by 6 ports
BER check error section in db_csv file:
START_ERRORS_SYMBOL_BER_CHECK
Scope,NodeGUID,PortGUID,PortNumber,EventName,Summary
PORT,0x0002c90000000005,0x0002c90000000006,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000015,0x0002c90000000016,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000025,0x0002c90000000026,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000035,0x0002c90000000036,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000049,0x0002c90000000049,4,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000049,0x0002c90000000049,5,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
END_ERRORS_SYMBOL_BER_CHECK