What can I help you with?
NVIDIA NVOS User Manual for NVLink Switches v25.02.1884

Link Diagnostic Per Port

When debugging a system, quickly identifying the root cause of a problem is crucial. The LDPP's diagnostic commands provide valuable insights into the physical layer components, enabling users to access critical information such as cable status (plugged or unplugged) and Auto-Negotiation failures. This facilitates efficient troubleshooting and resolution.

Below is a list of possible output messages:

Code

Technical Description

Detailed Description

Detailed Mitigation

0

No issue observed

N/A

Wait 5 seconds and check again. If the message continues, check peer side.

1

Port is close by command (see PAOS)

PAOS down command, also used form port shutsdown, for example.

Check who sent the command to close the port and reopen it.

2

AN no partner detected

Both sides did not agree on speed/FEC or DME is missing.

Debug Steps:

  1. Check Tx power and Rx power from both sides.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  2. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  3. For more than that, collect data from both sides of the link and escalate.

3

AN failure

AN ack not received

N/A

4

AN failure

AN next-page exchange failed

5

Link training failure

KR frame lock not acquired

N/A

6

Link training failure

KR link inhibit timeout

7

Link training failure

KR link partner did not set receiver ready

8

Link training failure

KR tuning did not complete

9

Logical mismatch between link partners

PCS did not acquire block lock

N/A

10

Logical mismatch between link partners

PCS did not acquire AM lock (NO FEC)

11

Logical mismatch between link partners

PCS did not get align_status

12

Logical mismatch between link partners

FC FEC is not locked

13

Logical mismatch between link partners

RS FEC is not locked

14

Remote fault received

N/A

Wait 5 seconds and check again. If the message continues, check peer side.

15

Bad Signal integrity

Low Raw BER. Ensure the system runs for a minimum duration before performing checks.

The link is up, but with low Raw BER.

Steps:

  1. Wait to test again after some time.

  2. Cleaning the fiber from both sides + the transceivers.

  3. Look at the Tx power and Rx Power.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  4. Collect SNR (electrical and optical) from both sides.

    1. Available in mlxlink -m and other tools.

  5. In case the link stays with low BER, test with PRBS.

    1. Please see the steps in the mlxlink help flag or in the attached excel.

  6. Collet mlxlink and mstdumps from both sides and share with us.

    1. In case of successful PRBS results: we will debug the firmware.

    2. In case of low BER PRBS results: we will debug the SerDes.

    3. In case specific lane does not lock, it might be transceiver, NIC, firmware, or SerDes.

16

Compliance code mismatch (protocol mismatch between cable and port)

N/A

  1. Need to see the port speed is configured as expected with the cable.

  2. Need to see if the cable is the right one for the port if it is as expected, please collect data.

17

Bad signal integrity

Large number of physical errors (high BER)

18

Port is disabled by Ekey

N/A

19

Phase EO failure

N/A

20

Stamping of non-NVIDIA Cables/Modules

N/A

Replace the cable with an NVIDIA cable.

21

Down by PortInfo MAD

N/A

Need to check who sent the command to close the port and reopen it.

22

Disabled by Verification

N/A

Not relevant for the field.

23

Calibration failure

N/A

  1. Collect data from both sides.

  2. Please run power cycle and check if the issue repeats. Send us the informatio and data.

24

Cable is invalid

EDR speed is not allowed due to cable stamping: EDR stamping

Replace the cable with an NVIDIA cable.

25

Cable is invalid

FDR10 speed is not allowed due to cable stamping: FDR10 stamping

26

Cable is invalid

Port is closed due to cable stamping: Ethernet_compliace_code_zero

27

Cable is invalid

Port is closed due to cable stamping: 56GE stamping

28

Cable is invalid

Port is closed due to cable stamping: non-NVIDIA QSFP28

29

Cable is invalid

Port is closed due to cable stamping: non-NVIDIA SFP28

30

Port is closed

No backplane enabled speed over backplane channel

Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.

31

Port is closed

No passive protocol enabled over passive copper channel

32

Port is closed

No active protocol enabled over active channel

33

Port width does not match the port speed enabled

N/A

34

Local speed degradation

N/A

The link is up, but with lower speed than expected.

Steps:

  1. Wait to test again after some time.

  2. Cleaning the fiber from both sides + the transceivers.

  3. Look at the Tx power and Rx Power.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  4. .Collect SNR (electrical and optical) from both sides.

    1. Available in mlxlink -m and other tools.

  5. In case the link stays with low BER, test with PRBS.

    1. Please see the steps in the mlxlink help flag or in the attached excel.

  6. Collet mlxlink and mstdumps and share with us.

    1. In case of successful PRBS results: We will debug the firmware.

    2. In case of low BER PRBS results: we will debug the SerDes.

    3. In case specific lane doesn’t lock, well it gets in interesting and might be transceiver, NIC, firmware or SerDes.

35

Remote speed degradation

N/A

Review remote side status.

36

No Partner detected during force mode

N/A

Debug steps:

  1. Check Tx power and Rx power from both sides.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  2. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  3. For more then that, please collect data and escalate.

37

Partial link indication during force mode

N/A

38

AN Failure

FEC mismatch during override

  1. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  2. For more then that, please collect data and escalate.

39

AN Failure

No HCD

40

VPI protocol do not match

N/A

N/A

41

Port is closed, module cannot be set to the enabled rate

N/A

N/A

42

Bad SI, cable is configured to non optimal rate

N/A

Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.

1023

Information not available

N/A

Wait for 10 seconds, and if the message is reread then share information from both sides and run power cycle.

MNG FW issues (1024—2047)

1024

Cable is unplugged/powered off

No phisical tranceiver detacted on cage

Plug tranceiver. Please notice that no one run command simulating unplugged transceiver.

1025

Long Range for non NVIDIA cable/module

No support for long rage non-NVIDIA cables

Replace the cable with NVIDIA cable.

1026

Bus stuck (I2C Data or clock shorted)

Received failure on the I2C EEPROM communication line

Transceiver reset (Disable/enable), if the issue continues, please collect information and data and then run power cycle.

1027

Bad/unsupported EEPROM

Failed to read EEPROM from tranceiver or tranceiver id is not recognized

Please test with another approved transceiver. Id the issue continues, please collect data and share.

1028

Part number list

Tranceiver is not permitted by vendor list

Replace the cable with cable from the supported list.

1029

Unsupported cable

SFP tranceiver is not supported

1030

Module temperature shutdown

Tranceiver temerature exceeded allowed threshold

Please check the cable temperature and cool the envoriment if it is indeed to hot.

1031

Shorted cable

Receive over current on the tranceiver

Bad tranceiver, please test with a different transceiver.

1032

Power budget exceeded

Board power budget have exceeded.

Review supported power by the transceiver and board INI.

1033

Management force down the port

Module shutdown by server command

Please review the server commands

1034

Module is disabled by command

Traceiver admin status is disabled

Enable admin status.

© Copyright 2025, NVIDIA. Last updated on Mar 3, 2025.