Link Diagnostic Per Port
When debugging a system, quickly identifying the root cause of a problem is crucial. The LDPP's diagnostic commands provide valuable insights into the physical layer components, enabling users to access critical information such as cable status (plugged or unplugged) and Auto-Negotiation failures. This facilitates efficient troubleshooting and resolution.
Below is a list of possible output messages:
Code | Technical Description | Detailed Description | Detailed Mitigation |
0 | No issue observed | N/A | Wait 5 seconds and check again. If the message continues, check peer side. |
1 | Port is close by command (see PAOS) | PAOS down command, also used form port shutsdown, for example. | Check who sent the command to close the port and reopen it. |
2 | AN no partner detected | Both sides did not agree on speed/FEC or DME is missing. | Debug Steps:
|
3 | AN failure | AN ack not received | N/A |
4 | AN failure | AN next-page exchange failed | |
5 | Link training failure | KR frame lock not acquired | N/A |
6 | Link training failure | KR link inhibit timeout | |
7 | Link training failure | KR link partner did not set receiver ready | |
8 | Link training failure | KR tuning did not complete | |
9 | Logical mismatch between link partners | PCS did not acquire block lock | N/A |
10 | Logical mismatch between link partners | PCS did not acquire AM lock (NO FEC) | |
11 | Logical mismatch between link partners | PCS did not get align_status | |
12 | Logical mismatch between link partners | FC FEC is not locked | |
13 | Logical mismatch between link partners | RS FEC is not locked | |
14 | Remote fault received | N/A | Wait 5 seconds and check again. If the message continues, check peer side. |
15 | Bad Signal integrity | Low Raw BER. Ensure the system runs for a minimum duration before performing checks. | The link is up, but with low Raw BER. Steps:
|
16 | Compliance code mismatch (protocol mismatch between cable and port) | N/A |
|
17 | Bad signal integrity | Large number of physical errors (high BER) | |
18 | Port is disabled by Ekey | N/A | |
19 | Phase EO failure | N/A | |
20 | Stamping of non-NVIDIA Cables/Modules | N/A | Replace the cable with an NVIDIA cable. |
21 | Down by PortInfo MAD | N/A | Need to check who sent the command to close the port and reopen it. |
22 | Disabled by Verification | N/A | Not relevant for the field. |
23 | Calibration failure | N/A |
|
24 | Cable is invalid | EDR speed is not allowed due to cable stamping: EDR stamping | Replace the cable with an NVIDIA cable. |
25 | Cable is invalid | FDR10 speed is not allowed due to cable stamping: FDR10 stamping | |
26 | Cable is invalid | Port is closed due to cable stamping: Ethernet_compliace_code_zero | |
27 | Cable is invalid | Port is closed due to cable stamping: 56GE stamping | |
28 | Cable is invalid | Port is closed due to cable stamping: non-NVIDIA QSFP28 | |
29 | Cable is invalid | Port is closed due to cable stamping: non-NVIDIA SFP28 | |
30 | Port is closed | No backplane enabled speed over backplane channel | Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled. |
31 | Port is closed | No passive protocol enabled over passive copper channel | |
32 | Port is closed | No active protocol enabled over active channel | |
33 | Port width does not match the port speed enabled | N/A | |
34 | Local speed degradation | N/A | The link is up, but with lower speed than expected. Steps:
|
35 | Remote speed degradation | N/A | Review remote side status. |
36 | No Partner detected during force mode | N/A | Debug steps:
|
37 | Partial link indication during force mode | N/A | |
38 | AN Failure | FEC mismatch during override |
|
39 | AN Failure | No HCD | |
40 | VPI protocol do not match | N/A | N/A |
41 | Port is closed, module cannot be set to the enabled rate | N/A | N/A |
42 | Bad SI, cable is configured to non optimal rate | N/A | Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled. |
1023 | Information not available | N/A | Wait for 10 seconds, and if the message is reread then share information from both sides and run power cycle. |
MNG FW issues (1024—2047) | |||
1024 | Cable is unplugged/powered off | No phisical tranceiver detacted on cage | Plug tranceiver. Please notice that no one run command simulating unplugged transceiver. |
1025 | Long Range for non NVIDIA cable/module | No support for long rage non-NVIDIA cables | Replace the cable with NVIDIA cable. |
1026 | Bus stuck (I2C Data or clock shorted) | Received failure on the I2C EEPROM communication line | Transceiver reset (Disable/enable), if the issue continues, please collect information and data and then run power cycle. |
1027 | Bad/unsupported EEPROM | Failed to read EEPROM from tranceiver or tranceiver id is not recognized | Please test with another approved transceiver. Id the issue continues, please collect data and share. |
1028 | Part number list | Tranceiver is not permitted by vendor list | Replace the cable with cable from the supported list. |
1029 | Unsupported cable | SFP tranceiver is not supported | |
1030 | Module temperature shutdown | Tranceiver temerature exceeded allowed threshold | Please check the cable temperature and cool the envoriment if it is indeed to hot. |
1031 | Shorted cable | Receive over current on the tranceiver | Bad tranceiver, please test with a different transceiver. |
1032 | Power budget exceeded | Board power budget have exceeded. | Review supported power by the transceiver and board INI. |
1033 | Management force down the port | Module shutdown by server command | Please review the server commands |
1034 | Module is disabled by command | Traceiver admin status is disabled | Enable admin status. |