Proprietary Mellanox Adapter Diagnostics Counters
Proprietary Mellanox adapter diagnostics counter set consists of the NIC diagnostics. These counters collect information from ConnectX®-3 and ConnectX®-3 Pro firmware flows.
Mellanox Adapter Diagnostics Counter |
Description |
Requester length errors |
Number of local length errors when the local machine generates outbound traffic. |
Responder length errors |
Number of local length errors when the local machine receives inbound traffic. |
Requester QP operation errors |
Number of local QP operation errors when the local machine generates outbound traffic. |
Responder QP operation errors |
Number of local QP operation errors when the local machine receives inbound traffic. |
Requester protection errors |
Number of local protection errors when the local machine generates outbound traffic. |
Responder protection errors |
Number of local protection errors when the local machine receives inbound traffic. |
Requester CQE errors |
Number of local CQE with errors when the local machine generates outbound traffic. |
Responder CQE errors |
Number of local CQE with errors when the local machine receives inbound traffic. Note: RDMA receivers need to post receive WQEs to handle incoming messages. If the application does not know how many messages are expected to be received (e.g. by maintaining high level message credits), they may post more receive WQEs than what will actually be used. On application teardown, if the application did not use up all of its received WQEs, the device will issue completion with error for these WQEs to indicate HW does not plan to use them. This is done with a clear syndrome indication of “Flushed with error”. |
Requester Invalid request errors |
Number of remote invalid request errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end detected invalid OpCode request. |
Responder Invalid request errors |
Number of remote invalid request errors when the local machine receives inbound traffic. |
Requester Remote access errors |
Number of remote access errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end detected wrong rkey. |
Responder Remote access errors |
Number of remote access errors when the local machine receives inbound traffic, i.e. the local machine received RDMA request with wrong rkey. |
Requester RNR NAK |
Number of RNR (Receiver Not Ready) NAKs received when the local machine generates outbound traffic. |
Responder RNR NAK |
Number of RNR (Receiver Not Ready) NAKs sent when the local machine receives inbound traffic. |
Requester out of order sequence NAK |
Number of Out of Sequence NAK received when the local machine generates outbound traffic, i.e. the number of times the local machine received NAKs indicating OOS on the receiving side. |
Responder out of order sequence received |
Number of Out of Sequence packet received when the local machine receives inbound traffic, i.e. the number of times the local machine received messages that are not consecutive. |
Requester resync |
Number of resync operations when the local machine generates outbound traffic. |
Responder resync |
Number of resync operations when the local machine receives inbound traffic. |
Requester Remote operation errors |
Number of remote operation errors when the local machine generates outbound traffic, i.e. NAK was received indicating that the other end encountered an error that prevented it from completing the request. |
Requester transport retries exceeded errors |
Number of transport retries exceeded errors when the local machine generates outbound traffic. |
Requester RNR NAK retries exceeded errors |
Number of RNR (Receiver Not Ready) NAKs retries exceeded errors when the local machine generates outbound traffic. |
Bad multicast received |
Number of bad multicast packet received. |
Discarded UD packets |
Number of UD packets silently discarded on the receive queue due to lack of receives descriptor. |
Discarded UC packets |
Number of UC packets silently discarded on the receive queue due to lack of receives descriptor. |
CQ overflows |
Number of CQ overflows. Note: This value is evaluated for the entire NIC since there are cases where CQ might be associated with both ports (i.e. the value on all ports is identical). |
EQ overflows |
Number of EQ overflows. Note: This value is evaluated for the entire NIC since there are cases where EQ might be associated with both ports (i.e. the value on all ports is identical). |
Bad doorbells |
Number of bad DoorBells |
Responder duplicate request received (pending firmware implementation) |
Number of duplicate requests received when the local machine receives inbound traffic. |
Requester time out received (pending firmware implementation) |
Number of time out received when the local machine generates outbound traffic. |
Device detected stalled state |
Number of times the device has entered the stalled state (per port). |
Packet detected as stalled |
Number of events where device was stalled for longer than the watermark. |
Link down events |
Number of times that the link operative state changes to down. |
TX Copied Packets |
Number of packets copied internally after either exceeding the size of the fragment list, or not reaching the minimum packet size. |