Events

This chapter describes the details of the events that occur when using the VPI API

IBV_EVENT_CQ_ERR

This event is triggered when a Completion Queue (CQ) overrun occurs or (rare condition) due to a protection error. When this happens, there are no guarantees that completions from the CQ can be pulled. All of the QPs associated with this CQ either in the Read or Send Queue will also get the IBV_EVENT_QP_FATAL event. When this event occurs, the best course of action is for the user to destroy and recreate the resources.

IBV_EVENT_QP_FATAL

This event is generated when an error occurs on a Queue Pair (QP) which prevents the generation of completions while accessing or processing the Work Queue on either the Send or Receive Queues. The user must modify the QP state to Reset for recovery. It is the responsibility of the software to ensure that all error processing is completed prior to calling the modify QP verb to change the QP state to Reset.
If the problem that caused this event is in the CQ of that Work Queue, the appropriate CQ will also receive the IBV_EVENT_CQ_ERR event. In the event of a CQ error, it is best to destroy and recreate the resources

IBV_EVENT_QP_REQ_ERR

This event is generated when the transport layer of the RDMA device detects a transport error violation on the responder side. The error may be caused by the use of an unsupported or reserved opcode, or the use of an out of sequence opcode.
These errors are rare but may occur when there are problems in the subnet or when an RDMA device sends illegal packets.
When this happens, the QP is automatically transitioned to the IBV_QPS_ERR state by the RDMA device. The user must modify the states of any such QPs from the error state to the Reset state for recovery
This event applies only to RC QPs.

IBV_EVENT_QP_ACCESS_ERR

This event is generated when the transport layer of the RDMA device detects a request error violation on the responder side. The error may be caused by

  • Misaligned atomic request

  • Too many RDMA Read or Atomic requests R_Key violation

  • Length errors without immediate data

These errors usually occur because of bugs in the user code.
When this happens, the QP is automatically transitioned to the IBV_QPS_ERR state by the RDMA device.The user must modify the QP state to Reset for recovery.
This event is relevant only to RC QPs.

IBV_EVENT_COMM_EST

This event is generated when communication is established on a given QP. This event implies that a QP whose state is IBV_QPS_RTR has received the first packet in its Receive Queue and the packet was processed without error.
This event is relevant only to connection oriented QPs (RC and UC QPs). It may be generated for UD QPs as well but that is driver implementation specific.

IBV_EVENT_SQ_DRAINED

This event is generated when all outstanding messages have been drained from the Send Queue (SQ) of a QP whose state has now changed from IBV_QPS_RTS to IBV_QPS_SQD. For RC QPs, this means that all the messages received acknowledgements as appropriate.
Generally, this event will be generated when the internal QP state changes from SQD.draining to SQD.drained. The event may also be generated if the transition to the state IBV_QPS_SQD is aborted because of a transition, either by the RDMA device or by the user, into the IBV_QPS_SQE, IBV_QPS_ERR or IBV_QPS_RESET QP states.
After this event, and after ensuring that the QP is in the IBV_QPS_SQD state, it is safe for the user to start modifying the Send Queue attributes since there aren't are no longer any send mes- sages in progress. Thus it is now safe to modify the operational characteristics of the QP and transition it back to the fully operational RTS state.

IBV_EVENT_PATH_MIG

This event is generated when a connection successfully migrates to an alternate path. The event is relevant only for connection oriented QPs, that is, it is relevant only for RC and UC QPs.
When this event is generated, it means that the alternate path attributes are now in use as the primary path attributes. If it is necessary to load attributes for another alternate path, the user may do that after this event is generated.

IBV_EVENT_PATH_MIG_ERR

This event is generated when an error occurs which prevents a QP which has alternate path attributes loaded from performing a path migration change. The attempt to effect the path migration may have been attempted automatically by the RDMA device or explicitly by the user.
This error usually occurs if the alternate path attributes are not consistent on the two ends of the connection. It could be, for example, that the DLID is not set correctly or if the source port is invalid.CQ The event may also occur if a cable to the alternate port is unplugged.

IBV_EVENT_DEVICE_FATAL

This event is generated when a catastrophic error is encountered on the channel adapter. The port and possibly the channel adapter becomes unusable.
When this event occurs, the behavior of the RDMA device is undetermined and it is highly recommended to close the process immediately. Trying to destroy the RDMA resources may fail and thus the device may be left in an unstable state.

IBV_EVENT_PORT_ACTIVE

This event is generated when the link on a given port transitions to the active state. The link is now available for send/receive packets.
This event means that the port_attr.state has moved from one of the following states

  • IBV_PORT_DOWN

  • IBV_PORT_INIT

  • IBV_PORT_ARMED

to either

  • IBV_PORT_ACTIVE

  • IBV_PORT_ACTIVE_DEFER

This might happen for example when the SM configures the port.
The event is generated by the device only if the IBV_DEVICE_PORT_ACTIVE_EVENT attribute is set in the dev_cap.device_cap_flags.

IBV_EVENT_PORT_ERR

This event is generated when the link on a given port becomes inactive and is thus unavailable to send/receive packets.
The port_attr.state must have been in either in either IBV_PORT_ACTIVE or IBV_PORT_AC- TIVE_DEFER state and transitions to one of the following states:

  • IBV_PORT_DOWN

  • IBV_PORT_INIT

  • IBV_PORT_ARMED

This can happen when there are connectivity problems within the IB fabric, for example when a cable is accidentally pulled.
This will not affect the QPs associated with this port, although if this is a reliable connection, the retry count may be exceeded if the link takes a long time to come back up.

IBV_EVENT_LID_CHANGE

The event is generated when the LID on a given port changes. This is done by the SM. If this is not the first time that the SM configures the port LID, it may indicate that there is a new SM on the subnet or that the SM has reconfigured the subnet. If the user cached the structure returned from ibv_query_port(), then these values must be flushed when this event occurs.

IBV_EVENT_PKEY_CHANGE

This event is generated when the P_Key table changes on a given port. The PKEY table is configured by the SM and this also means that the SM can change it. When that happens, an IBV_EVENT_PKEY_CHANGE event is generated.
Since QPs use GID table indexes rather than absolute values (as the source GID), it is suggested for clients to check that the GID indexes used by the client's QPs are not changed as a result of this event.
If a user caches the values of the P_Key table, then these must be flushed when the IBV_EVENT_PKEY_CHANGE event is received.

IBV_EVENT_SM_CHANGE

This event is generated when the SM being used at a given port changes. The user application must re-register with the new SM. This means that all subscriptions previously registered from the given port, such as one to join a multicast group, must be re-registered.

IBV_EVENT_SRQ_ERR

This event is generated when an error occurs on a Shared Receive Queue (SRQ) which prevents the RDMA device from dequeuing WRs from the SRQ and reporting of receive completions.
When an SRQ experiences this error, all the QPs associated with this SRQ will be transitioned to the IBV_QPS_ERR state and the IBV_EVENT_QP_FATAL asynchronous event will be generated for them. Any QPs which have transitioned to the error state must have their state modified to Reset for recovery.

IBV_EVENT_SRQ_LIMIT_REACHED

This event is generated when the limit for the SRQ resources is reached. This means that the number of SRQ Work Requests (WRs) is less than the SRQ limit. This event may be used by the user as an indicator that more WRs need to be posted to the SRQ and rearm it.

IBV_EVENT_QP_LAST_WQE_REACHED

This event is generated when a QP which is associated with an SRQ is transitioned into the IBV_QPS_ERR state either automatically by the RDMA device or explicitly by the user. This may have happened either because a completion with error was generated for the last WQE, or the QP transitioned into the IBV_QPS_ERR state and there are no more WQEs on the Receive Queue of the QP.
This event actually means that no more WQEs will be consumed from the SRQ by this QP.
If an error occurs to a QP and this event is not generated, the user must destroy all of the QPs associated with this SRQ as well as the SRQ itself in order to reclaim all of the WQEs associated with the offending QP. At the minimum, the QP which is in the error state must have its state changed to Reset for recovery.

IBV_EVENT_CLIENT_REREGISTER

This event is generated when the SM sends a request to a given port for client re-registration for all subscriptions previously requested for the port. This could happen if the SM suffers a failure and as a result, loses its own records of the subscriptions. It may also happen if a new SM becomes operational on the subnet.
The event will be generated by the device only if the bit that indicates a client reregister is supported is set in port_attr.port_cap_flags.

IBV_EVENT_GID_CHANGE

This event is generated when a GID changes on a given port. The GID table is configured by the SM and this also means that the SM can change it. When that happens, an IBV_EVENT_GID_CHANGE event is generated. If a user caches the values of the GID table, then these must be flushed when the IBV_EVENT_GID_CHANGE event is received.

IBV_WC_SUCCESS

The Work Request completed successfully.

IBV_WC_LOC_LEN_ERR

This event is generated when the receive buffer is smaller than the incoming send. It is generated on the receiver side of the connection.

IBV_WC_LOC_QP_OP_ERR

This event is generated when a QP error occurs. For example, it may be generated if a user neglects to specify responder_resources and initiator_depth values in struct rdma_conn_param before calling rdma_connect() on the client side and rdma_accept() on the server side.

IBV_WC_LOC_EEC_OP_ERR

This event is generated when there is an error related to the local EEC's receive logic while executing the request packet. The responder is unable to complete the request. This error is not caused by the sender.

IBV_WC_LOC_PROT_ERR

This event is generated when a user attempts to access an address outside of the registered memory region. For example, this may happen if the Lkey does not match the address in the WR.

IBV_WC_WR_FLUSH_ERR

This event is generated when an invalid remote error is thrown when the responder detects an invalid request. It may be that the operation is not supported by the request queue or there is insufficient buffer space to receive the request.

IBV_WC_MW_BIND_ERR

This event is generated when a memory management operation error occurs. The error may be due to the fact that the memory window and the QP belong to different protection domains. It may also be that the memory window is not allowed to be bound to the specified MR or the access permissions may be wrong.

IBV_WC_BAD_RESP_ERR

This event is generated when an unexpected transport layer opcode is returned by the responder.

IBV_WC_LOC_ACCESS_ERR

This event is generated when a local protection error occurs on a local data buffer during the process of an RDMA Write with Immediate Data operation sent from the remote node.

IBV_WC_REM_INV_REQ_ERR

This event is generated when the receive buffer is smaller than the incoming send. It is generated on the sender side of the connection. It may also be generated if the QP attributes are not set cor- rectly, particularly those governing MR access.

IBV_WC_REM_ACCESS_ERR

This event is generated when a protection error occurs on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. The error is reported only on RDMA operations or atomic operations.

IBV_WC_REM_OP_ERR

This event is generated when an operation cannot be completed successfully by the responder. The failure to complete the operation may be due to QP related errors which prevent the responder from completing the request or a malformed WQE on the Receive Queue.

IBV_WC_RETRY_EXC_ERR

This event is generated when a sender is unable to receive feedback from the receiver. This means that either the receiver just never ACKs sender messages in a specified time period, or it has been disconnected or it is in a bad state which prevents it from responding.

IBV_WC_RNR_RETRY_EXC_ERR

This event is generated when the RNR NAK retry count is exceeded. This may be caused by lack of receive buffers on the responder side.

IBV_WC_LOC_RDD_VIOL_ERR

This event is generated when the RDD associated with the QP does not match the RDD associated with the EEC.

IBV_WC_REM_INV_RD_REQ_ERR

This event is generated when the responder detects an invalid incoming RD message. The message may be invalid because it has in invalid Q_Key or there may be a Reliable Datagram Domain (RDD) violation.

IBV_WC_REM_ABORT_ERR

This event is generated when an error occurs on the responder side which causes it to abort the operation.

IBV_WC_INV_EECN_ERR

This event is generated when an invalid End to End Context Number (EECN) is detected.

IBV_WC_INV_EEC_STATE_ERR

This event is generated when an illegal operation is detected in a request for the specified EEC state.

IBV_WC_FATAL_ERR

This event is generated when a fatal transport error occurs. The user may have to restart the RDMA device driver or reboot the server to recover from the error.

IBV_WC_RESP_TIMEOUT_ERR

This event is generated when the responder is unable to respond to a request within the timeout period. It generally indicates that the receiver is not ready to process requests.

IBV_WC_GENERAL_ERR

This event is generated when there is a transport error which cannot be described by the other specific events discussed here.

RDMA_CM_EVENT_ADDR_RESOLVED

This event is generated on the client (active) side in response to rdma_resolve_addr(). It is generated when the system is able to resolve the server address supplied by the client.

RDMA_CM_EVENT_ADDR_ERROR

This event is generated on the client (active) side. It is generated in response to rdma_re- solve_addr() in the case where an error occurs. This may happen, for example, if the device can- not be found such as when a user supplies an incorrect device. Specifically, if the remote device has both ethernet and IB interfaces, and the client side supplies the ethernet device name instead of the IB device name of the server side, an RDMA_CM_EVENT_ADDR_ERROR will be generated.

RDMA_CM_EVENT_ROUTE_RESOLVED

This event is generated on the client (active) side in response to rdma_resolve_route(). It is generated when the system is able to resolve the server address supplied by the client.

RDMA_CM_EVENT_ROUTE_ERROR

This event is generated when rdma_resolve_route() fails.

RDMA_CM_EVENT_CONNECT_REQUEST

This is generated on the passive side of the connection to notify the user of a new connection request. It indicates that a connection request has been received.

RDMA_CM_EVENT_CONNECT_RESPONSE

This event may be generated on the active side of the connection to notify the user that the con- nection request has been successful. The event is only generated on rdma_cm_ids which do not have a QP associated with them.

RDMA_CM_EVENT_CONNECT_ERROR

This event may be generated on the active or passive side of the connection. It is generated when an error occurs while attempting to establish a connection.

RDMA_CM_EVENT_UNREACHABLE

This event is generated on the active side of a connection. It indicates that the (remote) server is unreachable or unable to respond to a connection request.

RDMA_CM_EVENT_REJECTED

This event may be generated on the client (active) side and indicates that a connection request or response has been rejected by the remote device. This may happen for example if an attempt is made to connect with the remote end point on the wrong port.

RDMA_CM_EVENT_ESTABLISHED

This event is generated on both sides of a connection. It indicates that a connection has been established with the remote end point.

RDMA_CM_EVENT_DISCONNECTED

This event is generated on both sides of the connection in response to rdma_disconnect(). The event will be generated to indicate that the connection between the local and remote devices has been disconnected. Any associated QP will transition to the error state. All posted work requests are flushed. The user must change any such QP's state to Reset for recovery.

RDMA_CM_EVENT_DEVICE_REMOVAL

This event is generated when the RDMA CM indicates that the device associated with the rdma_cm_id has been removed. Upon receipt of this event, the user must destroy the related rdma_cm_id.

RDMA_CM_EVENT_MULTICAST_JOIN

This event is generated in response to rdma_join_multicast(). It indicates that the multicast join operation has completed successfully.

RDMA_CM_EVENT_MULTICAST_ERROR

This event is generated when an error occurs while attempting to join a multicast group or on an existing multicast group if the group had already been joined. When this happens, the multicast group will no longer be accessible and must be rejoined if necessary.

RDMA_CM_EVENT_ADDR_CHANGE

This event is generated when the network device associated with this ID through address resolution changes its hardware address. For example, this may happen following bonding fail over. This event may serve to aid applications which want the links used for their RDMA sessions to align with the network stack.

RDMA_CM_EVENT_TIMEWAIT_EXIT

This event is generated when the QP associated with the connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.