Proprietary Mellanox Adapter Diagnostics Counters

Proprietary Mellanox adapter diagnostics counter set consists of the NIC diagnostics. These counters collect information from ConnectX®-3 and ConnectX®-3 Pro firmware flows.

Mellanox Adapter Diagnostics Counter

Description

Requester length errors

Number of local length errors when the local machine generates outbound traffic.

Responder length errors

Number of local length errors when the local machine receives inbound traffic.

Requester QP operation errors

Number of local QP operation errors when the local machine generates outbound traffic.

Responder QP operation errors

Number of local QP operation errors when the local machine receives inbound traffic.

Requester protection errors

Number of local protection errors when the local machine generates outbound traffic.

Responder protection errors

Number of local protection errors when the local machine receives inbound traffic.

Requester CQE errors

Number of local CQE with errors when the local machine generates outbound traffic.

Responder CQE errors

Number of local CQE with errors when the local machine receives inbound traffic.

Note: RDMA receivers need to post receive WQEs to handle incoming messages. If the application does not know how many messages are expected to be received (e.g. by maintaining high level message credits), they may post more receive WQEs than what will actually be used.

On application teardown, if the application did not use up all of its received WQEs, the device will issue completion with error for these WQEs to indicate HW does not plan to use them. This is done with a clear syndrome indication of “Flushed with error”.

Requester Invalid request errors

Number of remote invalid request errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end detected invalid OpCode request.

Responder Invalid request errors

Number of remote invalid request errors when the local machine receives inbound traffic.

Requester Remote access errors

Number of remote access errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end detected wrong rkey.

Responder Remote access errors

Number of remote access errors when the local machine receives inbound traffic, i.e. the local machine received RDMA request with wrong rkey.

Requester RNR NAK

Number of RNR (Receiver Not Ready) NAKs received when the local machine generates outbound traffic.

Responder RNR NAK

Number of RNR (Receiver Not Ready) NAKs sent when the local machine receives inbound traffic.

Requester out of order sequence NAK

Number of Out of Sequence NAK received when the local machine generates outbound traffic, i.e. the number of times the local machine received NAKs indicating OOS on the receiving side.

Responder out of order sequence received

Number of Out of Sequence packet received when the local machine receives inbound traffic, i.e. the number of times the local machine received messages that are not consecutive.

Requester resync

Number of resync operations when the local machine generates outbound traffic.

Responder resync

Number of resync operations when the local machine receives inbound traffic.

Requester Remote operation errors

Number of remote operation errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end encountered an error that prevented it from completing the request.

Requester transport retries exceeded errors

Number of transport retries exceeded errors when the local machine generates outbound traffic.

Requester RNR NAK retries exceeded errors

Number of RNR (Receiver Not Ready) NAKs retries exceeded errors when the local machine generates outbound traffic.

Bad multicast received

Number of bad multicast packet received.

Discarded UD packets

Number of UD packets silently discarded on the receive queue due to lack of receives descriptor.

Discarded UC packets

Number of UC packets silently discarded on the receive queue due to lack of receives descriptor.

CQ overflows

Number of CQ overflows.

Note: This value is evaluated for the entire NIC since there are cases where CQ might be associated with both ports (i.e. the value on all ports is identical).

EQ overflows

Number of EQ overflows.

Note: This value is evaluated for the entire NIC since there are cases where EQ might be associated with both ports (i.e. the value on all ports is identical).

Bad doorbells

Number of bad DoorBells

Responder duplicate request received (pending firmware implementation)

Number of duplicate requests received when the local machine receives inbound traffic.

Requester time out received (pending firmware implementation)

Number of time out received when the local machine generates outbound traffic.

Device detected stalled state

Number of times the device has entered the stalled state (per port).

Packet detected as stalled

Number of events where device was stalled for longer than the watermark.

Link down events

Number of times that the link operative state changes to down.

TX Copied Packets

Number of packets copied internally after either exceeding the size of the fragment list, or not reaching the minimum packet size.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.