Link Diagnostic Per Port
When debugging a system, it is important to be able to quickly identify the root of a problem. The Diagnostic commands enables an insight into the physical layer components where the user is able to see information such as a cable status (plugged/unplugged) or if Auto-Negotiation has failed.
Below is a list of possible output messages:
Monitor _opcode |
Detailed Descritption |
Detailed Mitigation |
0—No issue observed |
Wait 5 seconds and check again. If the message continues, check peer side. |
|
1—Port is close by command |
PAOS down command, also used form port shutsdown, for example. |
Check who sent the command to close the port and reopen it. |
2—AN failure |
Both sides did not agree on speed/FEC or DME is missing. |
Debug Steps:
|
3—AN failure |
Ack not received. |
Not relevant for NDR. |
4—AN failure |
Next-page exchange failed. |
|
5—Link training failure. |
Frame lock not acquired. |
|
6—Link training failure. |
Link inhibit timeout. |
|
7—Link training failure. |
Link partner did not set receiver ready. |
|
8—Link training failure. |
Tuning didn’t completed. |
|
9—Logical mismatch between link partners |
Did not acquire block lock. |
|
10—Logical mismatch between link partners |
Did not acquire AM lock (NO FEC). |
|
11—Logical mismatch between link partners |
Did not get align_status. AN is done but the signal is not locked. Very rare. |
|
12—Logical mismatch between link partners |
FC FEC is not locked. |
|
13—Logical mismatch between link partners |
RS FEC is not locked. |
|
14—Remote fault received |
Wait 5 seconds and check again. If the message continues, check peer side. |
|
15—Bad signal integrity |
Low Raw BER. Please notice to have it running minimum time before checking. |
The link is up, but with low Raw BER. Steps:
|
16—Cable compliance code mismatch (protocol mismatch between cable and port) |
|
|
17—Bad signal integrity |
Not relevant for NDR. |
|
18, 20—Internal error |
||
19—Internal error |
||
20—Stamping of non-NVIDIA Cables/Modules |
Replace the cable with an NVIDIA cable. |
|
21—Down by PortInfo MAD |
Need to check who sent the command to close the port and reopen it. |
|
22—Internal error |
Not relevant for the field. |
|
23—Internal error |
Calibration failure. |
|
24—EDR speed is not allowed due to cable stamping: EDR stamping |
Cable is invalid. |
Replace the cable with an NVIDIA cable. |
25—FDR10 speed is not allowed due to cable stamping: FDR10 stamping |
||
26—Port is closed due to cable stamping: Ethernet_compliace_code_zero |
||
27—Port is closed due to cable stamping: 56GE stamping |
||
28—Port is closed due to cable stamping: non-NVIDIA QSFP28 |
||
29—Port is closed due to cable stamping: non-NVIDIA SFP28 |
||
30—Port is closed, no backplane enabled speed over backplane channel |
Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled. |
|
31—Port is closed, no passive protocol enabled over passive copper channel |
||
32—Port is closed, no active protocol enabled over active channel |
||
33—Port width is does not match the port speed enabled |
||
34—Local Speed degradation |
The link is up, but with lower speed than expected. Steps:
|
|
35—Remote Speed degradation |
Review remote side status. |
|
36—No Partner detected during force mode. 37—Partial link indication during force mode. |
Debug steps:
|
|
38—AN failure |
FEC mismatch during override. |
|
39—AN failure |
No HCD. |
|
40 |
N/A |
Not relevant for NDR. |
41—Port is closed, module can’t be set to the enabled rate |
||
42—Bad SI, cable is configured to non optimal rate |
Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled. |
|
43—No Partner Detected in Force Mode and Fast Link Up |
Not relevant for NDR. |
|
44-47 |
N/A |
|
48—Bad signal integrity |
||
49—Bad signal integrity |
||
50—Internal error |
||
51—HST speed mismatch |
|
|
52—Bad signal integrity |
The link is up, but with low Raw BER. Steps: 1) Wait to test again after some time 2) Cleaning the fiber from both sides + the transceivers (including reinsertions) 3) Look at the Tx power and Rx Power a. Low Tx power: Check transceiver issue b. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends) 4) Collect SNR (electrical and optical) from both sides a. Available in mlxlink -m and other tools 5) In case the link stays with low BER, test with PRBS. a. Please see the steps in the mlxlink help flag or in the attached excel 6) Collet mlxlink and mstdumps and share with us a. In case of successful PRBS results: We will debug the firmware b. In case of low BER PRBS results: we will debug the SerDes c. In case specific lane doesn’t lock, well it gets in interesting and might be transceiver, NIC,firmware or SerDes. |
|
53—Link failure due to MCB at link up |
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and toggle the link. |
|
54—PLR didn't get Rx good non sync cell |
||
55—PSI fatal error |
||
56—module_lanes_frequency_not_synced |
Not relevant for NDR |
|
57—signal not detected 59—Did not get module conf done |
Power detection in the SerDes is not detected. |
|
58 |
N/A |
Not relevant for NDR. |
128—Troubleshooting in process |
Wait 3 seconds and run the command again. |
|
1023—Info not available |
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle. |
|
1024—Cable is unplugged |
No phisical tranceiver detacted on cage. |
Plug tranceiver. Please notice that no one run command simulating unplugged transceiver. |
1025—Long Range for non Mellanox cable/module . |
No support for long rage none NVIDIA cables. |
Replace the cable with NVIDIA cable. |
1026—Bus stuck (I2C Data or clock shorted) |
Received failure on the I2C EEPROM communication line. |
Transceiver reset (Disable/enable), if the issue continues, please collect information and data and then run power cycle. |
1027—Bad/unsupported EEPROM |
Failed to read EEPROM from tranceiver or tranceiver id is not recognized. |
Please test with another approved transceiver. Id the issue continues, please collect data and share. |
1028—Part number list |
Tranceiver is not permitted by vendor list. |
Replace the cable with cable from the supported list. |
1029—Unsupported cable. |
SFP tranceiver is not supported. |
|
1030—Module temperature shutdown |
Tranceiver temerature exceeded allowed threshold. |
Please check the cable temperature and cool the envoriment if it is indeed to hot. |
1031—Shorted cable |
Receive over current on the tranceiver. |
Bad tranceiver, please test with a different transceiver. |
1032—Power budget exceeded |
Board power budget have exceeded. |
Review supported power by the transceiver and board INI. |
1033—Management forced down the port |
Module shutdown by server command. |
Please review the serve commands. |
1034—Module is disabled by command |
Traceiver admin status is disabled. |
Enable admin status. |
1036—Module’s PMD type is not enabled (see PMTPS). |
Tranceiver type not supported. |
Replace tranceiver. |
1037 |
N/A |
Not relevant for NDR. |
1038 |
N/A |
|
1039 |
N/A |
|
1040—pcie system power slot Exceeded |
||
1041 |
N/A |
|
1042—Module state machine fault |
||
1043—Module’s stamping speed degeneration |
||
1044—Module’s stamping speed degeneration |
HDR speed is not supported. |
Replace the cable with an NVIDIA cable. |
1045—Module’s stamping speed degeneration |
EDR speed is not supported. |
|
1046—Module’s stamping speed degeneration |
FDR10 speed is not supported. |
|
1047—Modules DataPath FSM fault |
Failed to configure speed (application) by tranceiver. |
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle. |
1048—Modules DataPath FSM fault |
||
Core/Driver (2048—3071): |
||
2048—MPR Violation (Under 64 bytes between two starts). |
Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle . |
show interfaces ib link-diagnostics
show interfaces ib [device/port] link-diagnostics Displays a specific InfiniBand module/port or all InfiniBand ports. |
||
Syntax Description |
N/A |
|
Default |
N/A |
|
Configuration Mode |
config |
|
History |
3.6.4000 |
|
Example |
switch (config) # show interfaces ib link-diagnostics |
|
Related Commands |
||
Notes |
show interfaces ib internal leaf link-diagnostics
show interfaces ib internal leaf <module/port> link-diagnostics Displays a specific InfiniBand internal leaf module/port. |
||
Syntax Description |
N/A |
|
Default |
N/A |
|
Configuration Mode |
config |
|
History |
3.6.4000 |
|
Example |
switch (config) # show interfaces ib internal leaf 1 link-diagnostics |
|
Related Commands |
||
Notes |
show interfaces ib internal spine link-diagnostics
show interfaces ib internal spine <module/port> link-diagnostics Displays a specific InfiniBand internal spine module/port. |
||
Syntax Description |
N/A |
|
Default |
N/A |
|
Configuration Mode |
config |
|
History |
3.6.4000 |
|
Example |
switch (config) # show interfaces ib internal spine 3/1/1 link-diagnostics |
|
Related Commands |
||
Notes |