Link Diagnostic Per Port

NVIDIA MLNX-OS User Manual v3.11.4002

When debugging a system, it is important to be able to quickly identify the root of a problem. The Diagnostic commands enables an insight into the physical layer components where the user is able to see information such as a cable status (plugged/unplugged) or if Auto-Negotiation has failed.

Below is a list of possible output messages:

Monitor _opcode

Detailed Descritption

Detailed Mitigation

0—No issue observed

Wait 5 seconds and check again. If the message continues, check peer side.

1—Port is close by command

PAOS down command, also used form port shutsdown, for example.

Check who sent the command to close the port and reopen it.

2—AN failure

Both sides did not agree on speed/FEC or DME is missing.

Debug Steps:

  1. Check Tx power and Rx power from both sides.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  2. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  3. For more than that, collect data from both sides of the link and escalate.

3—AN failure

Ack not received.

Not relevant for NDR.

4—AN failure

Next-page exchange failed.

5—Link training failure.

Frame lock not acquired.

6—Link training failure.

Link inhibit timeout.

7—Link training failure.

Link partner did not set receiver ready.

8—Link training failure.

Tuning didn’t completed.

9—Logical mismatch between link partners

Did not acquire block lock.

  1. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  2. If the issue repeats, collect data from both sides and escalate.

10—Logical mismatch between link partners

Did not acquire AM lock (NO FEC).

11—Logical mismatch between link partners

Did not get align_status. AN is done but the signal is not locked. Very rare.

12—Logical mismatch between link partners

FC FEC is not locked.

13—Logical mismatch between link partners

RS FEC is not locked.

14—Remote fault received

Wait 5 seconds and check again. If the message continues, check peer side.

15—Bad signal integrity

Low Raw BER. Please notice to have it running minimum time before checking.

The link is up, but with low Raw BER.

Steps:

  1. Wait to test again after some time.

  2. Cleaning the fiber from both sides + the transceivers.

  3. Look at the Tx power and Rx Power.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  4. Collect SNR (electrical and optical) from both sides.

    1. Available in mlxlink -m and other tools.

  5. In case the link stays with low BER, test with PRBS.

    1. Please see the steps in the mlxlink help flag or in the attached excel.

  6. Collet mlxlink and mstdumps from both sides and share with us.

    1. In case of successful PRBS results: we will debug the firmware.

    2. In case of low BER PRBS results: we will debug the SerDes.

    3. In case specific lane does not lock, it might be transceiver, NIC, firmware, or SerDes.

16—Cable compliance code mismatch (protocol mismatch between cable and port)

  1. Need to see the port speed is configured as expected with the cable.

  2. Need to see if the cable is the right one for the port if it is as expected, please collect data.

17—Bad signal integrity

Not relevant for NDR.

18, 20—Internal error

19—Internal error

20—Stamping of non-NVIDIA Cables/Modules

Replace the cable with an NVIDIA cable.

21—Down by PortInfo MAD

Need to check who sent the command to close the port and reopen it.

22—Internal error

Not relevant for the field.

23—Internal error

Calibration failure.

  1. Collect data from both sides.

  2. Please run power cycle and check if the issue repeats. Send us the informatio and data.

24—EDR speed is not allowed due to cable stamping: EDR stamping

Cable is invalid.

Replace the cable with an NVIDIA cable.

25—FDR10 speed is not allowed due to cable stamping: FDR10 stamping

26—Port is closed due to cable stamping: Ethernet_compliace_code_zero

27—Port is closed due to cable stamping: 56GE stamping

28—Port is closed due to cable stamping: non-NVIDIA QSFP28

29—Port is closed due to cable stamping: non-NVIDIA SFP28

30—Port is closed, no backplane enabled speed over backplane channel

Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.

31—Port is closed, no passive protocol enabled over passive copper channel

32—Port is closed, no active protocol enabled over active channel

33—Port width is does not match the port speed enabled

34—Local Speed degradation

The link is up, but with lower speed than expected.

Steps:

  1. Wait to test again after some time.

  2. Cleaning the fiber from both sides + the transceivers.

  3. Look at the Tx power and Rx Power.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  4. .Collect SNR (electrical and optical) from both sides.

    1. Available in mlxlink -m and other tools.

  5. In case the link stays with low BER, test with PRBS.

    1. Please see the steps in the mlxlink help flag or in the attached excel.

  6. Collet mlxlink and mstdumps and share with us.

    1. In case of successful PRBS results: We will debug the firmware.

    2. In case of low BER PRBS results: we will debug the SerDes.

    3. In case specific lane doesn’t lock, well it gets in interesting and might be transceiver, NIC, firmware or SerDes.

35—Remote Speed degradation

Review remote side status.

36—No Partner detected during force mode.

37—Partial link indication during force mode.

Debug steps:

  1. Check Tx power and Rx power from both sides.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  2. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  3. For more then that, please collect data and escalate.

38—AN failure

FEC mismatch during override.

  1. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  2. For more then that, please collect data and escalate.

39—AN failure

No HCD.

40

N/A

Not relevant for NDR.

41—Port is closed, module can’t be set to the enabled rate

42—Bad SI, cable is configured to non optimal rate

Check the port is configured correctly: s ame speeds, width and FECs or that AN is fully enabled.

43—No Partner Detected in Force Mode and Fast Link Up

Not relevant for NDR.

44-47

N/A

48—Bad signal integrity

49—Bad signal integrity

50—Internal error

51—HST speed mismatch

  1. Check both sides are configured correctly:

    1. Same speeds and FECs or that AN is fully enabled.

  2. For more then that, please collect data and escalate.

52—Bad signal integrity

The link is up, but with low Raw BER.

Steps:

1) Wait to test again after some time

2) Cleaning the fiber from both sides + the transceivers (including reinsertions)

3) Look at the Tx power and Rx Power

a. Low Tx power: Check transceiver issue

b. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends)

4) Collect SNR (electrical and optical) from both sides

a. Available in mlxlink -m and other tools

5) In case the link stays with low BER, test with PRBS.

a. Please see the steps in the mlxlink help flag or in the attached excel

6) Collet mlxlink and mstdumps and share with us

a. In case of successful PRBS results: We will debug the firmware

b. In case of low BER PRBS results: we will debug the SerDes

c. In case specific lane doesn’t lock, well it gets in interesting and might be transceiver, NIC,firmware or SerDes.

53—Link failure due to MCB at link up

Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and toggle the link.

54—PLR didn't get Rx good non sync cell

55—PSI fatal error

56—module_lanes_frequency_not_synced

Not relevant for NDR

57—signal not detected

59—Did not get module conf done

Power detection in the SerDes is not detected.

  1. Wait to test again after some time.

  2. Cleaning the fiber from both sides + the transceivers (including reinsertions).

  3. Look at the Tx power and Rx Power.

    1. Low Tx power: Check transceiver issue.

    2. Low Rx power: Check Tx power from peer side+ clean fiber and transceiver (both ends).

  4. In case the link stays with low BER, test with PRBS.

    1. Please see the steps in the mlxlink help flag or in the attached excel.

  5. Collet mlxlink and mstdumps and share with us.

    1. In case of successful PRBS results: We will debug the firmware.

    2. In case of low BER PRBS results: we will debug the SerDes.

    3. In case specific lane does not lock, well it gets in interesting and might be transceiver, NIC, firmware or SerDes.

58

N/A

Not relevant for NDR.

128—Troubleshooting in process

Wait 3 seconds and run the command again.

1023—Info not available

Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle.

1024—Cable is unplugged

No phisical tranceiver detacted on cage.

Plug tranceiver. Please notice that no one run command simulating unplugged transceiver.

1025—Long Range for non Mellanox cable/module .

No support for long rage none NVIDIA cables.

Replace the cable with NVIDIA cable.

1026—Bus stuck (I2C Data or clock shorted)

Received failure on the I2C EEPROM communication line.

Transceiver reset (Disable/enable), if the issue continues, please collect information and data and then run power cycle.

1027—Bad/unsupported EEPROM

Failed to read EEPROM from tranceiver or tranceiver id is not recognized.

Please test with another approved transceiver. Id the issue continues, please collect data and share.

1028—Part number list

Tranceiver is not permitted by vendor list.

Replace the cable with cable from the supported list.

1029—Unsupported cable.

SFP tranceiver is not supported.

1030—Module temperature shutdown

Tranceiver temerature exceeded allowed threshold.

Please check the cable temperature and cool the envoriment if it is indeed to hot.

1031—Shorted cable

Receive over current on the tranceiver.

Bad tranceiver, please test with a different transceiver.

1032—Power budget exceeded

Board power budget have exceeded.

Review supported power by the transceiver and board INI.

1033—Management forced down the port

Module shutdown by server command.

Please review the serve commands.

1034—Module is disabled by command

Traceiver admin status is disabled.

Enable admin status.

1036—Module’s PMD type is not enabled (see PMTPS).

Tranceiver type not supported.

Replace tranceiver.

1037

N/A

Not relevant for NDR.

1038

N/A

1039

N/A

1040—pcie system power slot Exceeded

1041

N/A

1042—Module state machine fault

1043—Module’s stamping speed degeneration

1044—Module’s stamping speed degeneration

HDR speed is not supported.

Replace the cable with an NVIDIA cable.

1045—Module’s stamping speed degeneration

EDR speed is not supported.

1046—Module’s stamping speed degeneration

FDR10 speed is not supported.

1047—Modules DataPath FSM fault

Failed to configure speed (application) by tranceiver.

Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle.

1048—Modules DataPath FSM fault

Core/Driver (2048—3071):

2048—MPR Violation (Under 64 bytes between two starts).

Wait for 10 seconds, and if the message is reread then share inforamtion from both sides and run power cycle .

show interfaces ib [device/port] link-diagnostics

Displays a specific InfiniBand module/port or all InfiniBand ports.

Syntax Description

N/A

Default

N/A

Configuration Mode

config

History

3.6.4000

Example

    
switch (config) # show interfaces ib link-diagnostics

----------------------------------------------------------------------

Interface Code Status

----------------------------------------------------------------------

IB1/1 0 The port is Active.

IB1/2 0 The port is Active.

IB1/3 1024 Cable unplugged

IB1/4 1024 Cable unplugged

IB1/5 1024 Cable unplugged

IB1/6 1024 Cable unplugged

IB1/7 1024 Cable unplugged

IB1/8 1024 Cable unplugged

IB1/9 1024 Cable unplugged

IB1/10 1024 Cable unplugged

IB1/11 1024 Cable unplugged

IB1/12 1024 Cable unplugged

IB1/13 1024 Cable unplugged

IB1/14 1024 Cable unplugged

IB1/15 1024 Cable unplugged

IB1/16 1024 Cable unplugged

IB1/17 1024 Cable unplugged

IB1/18 1024 Cable unplugged

IB1/19 1024 Cable unplugged

IB1/20 1024 Cable unplugged

IB1/21 1024 Cable unplugged

IB1/22 1024 Cable unplugged

IB1/23 1024 Cable unplugged

IB1/24 1024 Cable unplugged

IB1/25 1024 Cable unplugged

IB1/26 1024 Cable unplugged

IB1/27 1024 Cable unplugged

IB1/28 1024 Cable unplugged

IB1/29 1024 Cable unplugged

IB1/30 1024 Cable unplugged

IB1/31 1024 Cable unplugged

IB1/32 1024 Cable unplugged

IB1/33 1024 Cable unplugged

IB1/34 1024 Cable unplugged

IB1/35 1 The port is closed by command.

IB1/36 2 Auto-Negotiation failure..

Related Commands

Notes


show interfaces ib internal leaf <module/port> link-diagnostics

Displays a specific InfiniBand internal leaf module/port.

Syntax Description

N/A

Default

N/A

Configuration Mode

config

History

3.6.4000

Example

    
switch (config) # show interfaces ib internal leaf 1 link-diagnostics

----------------------------------------------------------------------

Interface Code Status

----------------------------------------------------------------------

IB1/1/19 0 No issue was observed

IB1/1/20 0 No issue was observed

IB1/1/21 0 No issue was observed

IB1/1/22 0 No issue was observed

IB1/1/23 0 No issue was observed

IB1/1/24 0 No issue was observed

IB1/1/25 0 No issue was observed

IB1/1/26 0 No issue was observed

IB1/1/27 0 No issue was observed

IB1/1/28 0 No issue was observed

IB1/1/29 0 No issue was observed

IB1/1/30 0 No issue was observed

Related Commands

Notes


show interfaces ib internal spine <module/port> link-diagnostics

Displays a specific InfiniBand internal spine module/port.

Syntax Description

N/A

Default

N/A

Configuration Mode

config

History

3.6.4000

Example

    
switch (config) # show interfaces ib internal spine 3/1/1 link-diagnostics

-----------------------------------------------------------------------

Interface Code Status

-----------------------------------------------------------------------

IB3/1/1 0 No issue was observed

Related Commands

Notes

© Copyright 2024, NVIDIA. Last updated on May 6, 2024.