Link Diagnostic Per Port
When debugging a system, it is important to be able to quickly identify the root of a problem. The Diagnostic commands enables an insight into the physical layer components where the user is able to see information such as a cable status (plugged/unplugged) or if Auto-Negotiation has failed.
Below is a list of possible output messages:
Monitor _opcode
Detailed Descritption
Detailed Mitigation
0—No issue observed
Wait 5 seconds and check again. If the message continues, check peer side.
1—Port is close by command
PAOS down command, also used form port shutsdown, for example.
Check who sent the command to close the port and reopen it.
2—AN failure
Both sides did not agree on speed/FEC or DME is missing.
Debug Steps:
3—AN failure
Ack not received.
Not relevant for NDR.
4—AN failure
Next-page exchange failed.
5—Link training failure.
Frame lock not acquired.
6—Link training failure.
Link inhibit timeout.
7—Link training failure.
Link partner did not set receiver ready.
8—Link training failure.
Tuning didn’t completed.
9—Logical mismatch between link partners
Did not acquire block lock.
10—Logical mismatch between link partners
Did not acquire AM lock (NO FEC).
11—Logical mismatch between link partners
Did not get align_status. AN is done but the signal is not locked. Very rare.
12—Logical mismatch between link partners
FC FEC is not locked.
13—Logical mismatch between link partners
RS FEC is not locked.
14—Remote fault received
Wait 5 seconds and check again. If the message continues, check peer side.
15—Bad signal integrity
Low Raw BER. Please notice to have it running minimum time before checking.
The link is up, but with low Raw BER.
Steps:
16—Cable compliance code mismatch (protocol mismatch between cable and port)
17—Bad signal integrity
Not relevant for NDR.
18, 20—Internal error
19—Internal error
20—Stamping of non-NVIDIA Cables/Modules
Replace the cable with an NVIDIA cable.
21—Down by PortInfo MAD
Need to check who sent the command to close the port and reopen it.
22—Internal error
Not relevant for the field.
23—Internal error
Calibration failure.
24—EDR speed is not allowed due to cable stamping: EDR stamping
Cable is invalid.
Replace the cable with an NVIDIA cable.
25—FDR10 speed is not allowed due to cable stamping: FDR10 stamping
26—Port is closed due to cable stamping: Ethernet_compliace_code_zero
27—Port is closed due to cable stamping: 56GE stamping
28—Port is closed due to cable stamping: non-NVIDIA QSFP28
29—Port is closed due to cable stamping: non-NVIDIA SFP28
30—Port is closed, no backplane enabled speed over backplane channel
Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.
31—Port is closed, no passive protocol enabled over passive copper channel
32—Port is closed, no active protocol enabled over active channel
33—Port width is does not match the port speed enabled
34—Local Speed degradation
The link is up, but with lower speed than expected.
Steps:
35—Remote Speed degradation
Review remote side status.
36—No Partner detected during force mode.
37—Partial link indication during force mode.
Debug steps:
38—AN failure
FEC mismatch during override.
39—AN failure
No HCD.
40
N/A
Not relevant for NDR.
41—Port is closed, module can’t be set to the enabled rate
42—Bad SI, cable is configured to non optimal rate
Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.
43—No Partner Detected in Force Mode and Fast Link Up
Not relevant for NDR.
44-47
N/A
48—Bad signal integrity
49—Bad signal integrity
50—Internal error
51—HST speed mismatch
52—Bad signal integrity
The link is up, but with low Raw BER.
Steps:
1) Wait to test again after some time
2) Cleaning the fiber from both sides + the transceivers (including reinsertions)
3) Look at the Tx power and Rx Power
a. Low Tx power: Check transceiver issue
b. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends)
4) Collect SNR (electrical and optical) from both sides
a. Available in mlxlink -m and other tools
5) In case the link stays with low BER, test with PRBS.
a. Please see the steps in the mlxlink help flag or in the attached excel
6) Collet mlxlink and mstdumps and share with us
a. In case of successful PRBS results: We will debug the firmware
b. In case of low BER PRBS results: we will debug the SerDes
c. In case specific lane doesn’t lock, well it gets in interesting and might be transceiver, NIC,firmware or SerDes.
53—Link failure due to MCB at link up
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and toggle the link.
54—PLR didn't get Rx good non sync cell
55—PSI fatal error
56—module_lanes_frequency_not_synced
Not relevant for NDR
57—signal not detected
59—Did not get module conf done
Power detection in the SerDes is not detected.
58
N/A
Not relevant for NDR.
128—Troubleshooting in process
Wait 3 seconds and run the command again.
1023—Info not available
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle.
1024—Cable is unplugged
No phisical tranceiver detacted on cage.
Plug tranceiver. Please notice that no one run command simulating unplugged transceiver.
1025—Long Range for non Mellanox cable/module .
No support for long rage none NVIDIA cables.
Replace the cable with NVIDIA cable.
1026—Bus stuck (I2C Data or clock shorted)
Received failure on the I2C EEPROM communication line.
Transceiver reset (Disable/enable), if the issue continues, please collect information and data and then run power cycle.
1027—Bad/unsupported EEPROM
Failed to read EEPROM from tranceiver or tranceiver id is not recognized.
Please test with another approved transceiver. Id the issue continues, please collect data and share.
1028—Part number list
Tranceiver is not permitted by vendor list.
Replace the cable with cable from the supported list.
1029—Unsupported cable.
SFP tranceiver is not supported.
1030—Module temperature shutdown
Tranceiver temerature exceeded allowed threshold.
Please check the cable temperature and cool the envoriment if it is indeed to hot.
1031—Shorted cable
Receive over current on the tranceiver.
Bad tranceiver, please test with a different transceiver.
1032—Power budget exceeded
Board power budget have exceeded.
Review supported power by the transceiver and board INI.
1033—Management forced down the port
Module shutdown by server command.
Please review the serve commands.
1034—Module is disabled by command
Traceiver admin status is disabled.
Enable admin status.
1036—Module’s PMD type is not enabled (see PMTPS).
Tranceiver type not supported.
Replace tranceiver.
1037
N/A
Not relevant for NDR.
1038
N/A
1039
N/A
1040—pcie system power slot Exceeded
1041
N/A
1042—Module state machine fault
1043—Module’s stamping speed degeneration
1044—Module’s stamping speed degeneration
HDR speed is not supported.
Replace the cable with an NVIDIA cable.
1045—Module’s stamping speed degeneration
EDR speed is not supported.
1046—Module’s stamping speed degeneration
FDR10 speed is not supported.
1047—Modules DataPath FSM fault
Failed to configure speed (application) by tranceiver.
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle.
1048—Modules DataPath FSM fault
Core/Driver (2048—3071):
2048—MPR Violation (Under 64 bytes between two starts).
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle .
show interfaces ib link-diagnostics
show interfaces ib [device/port] link-diagnostics
Displays a specific InfiniBand module/port or all InfiniBand ports.
Configuration Mode
config
History
3.6.4000
Example
switch (config) # show interfaces ib link-diagnostics
show interfaces ib internal leaf link-diagnostics
show interfaces ib internal leaf <module/port> link-diagnostics
Displays a specific InfiniBand internal leaf module/port.
Configuration Mode
config
History
3.6.4000
Example
switch (config) # show interfaces ib internal leaf 1 link-diagnostics
show interfaces ib internal spine link-diagnostics
show interfaces ib internal spine <module/port> link-diagnostics
Displays a specific InfiniBand internal spine module/port.
Configuration Mode
config
History
3.6.4000
Example
switch (config) # show interfaces ib internal spine 3/1/1 link-diagnostics
