Advanced Transport
Atomic Operations in mlx5 Driver
Atomic Operations in Connect-IB® are fully supported on big-endian machines (e.g. PPC). Their support is limited on little-endian machines (e.g. x86)
When using ibv_exp_query_device on little-endian machines with Connect-IB® the attr.exp_atomic_cap is set to IBV_EXP_ATOMIC_HCA_REPLY_BE which indicates that if enabled, the atomic operation replied value is big-endian and contradicts the host endianness.
To enable atomic operation with this endianness contradiction use the ibv_exp_create_qp to create the QP and set the IBV_EXP_QP_CREATE_ATOMIC_BE_REPLY flag on exp_create_flags.
Enhanced Atomic Operations
ConnectX® implements a set of Extended Atomic Operations beyond those defined by the IB spec. Atomicity guarantees, Atomic Ack generation, ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations (as defined in section 9.4.5 of the IB spec).
Masked Compare and Swap (MskCmpSwap)
The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec. MskCmpSwap allows the user to select a portion of the 64 bit target data for the "compare" check, as well as to restrict the swap to a (possibly different) portion. The pseudo-code below describes the operation:
The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length. The atomic add is done independently on each one of this fields. A bit set in the field_boundary parameter specifies the field boundaries. The pseudo-code below describes the operation:
The additional operands are carried in the Extended Transport Header. Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations.
| atomic_response = *va
| if
(!((compare_add ^ *va) & compare_add_mask)) then
| *va = (*va & ~(swap_mask)) | (swap & swap_mask)
|
| return
atomic_response
Masked Fetch and Add (MFetchAdd)
The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length. The atomic add is done independently on each one of this fields. A bit set in the field_boundary parameter specifies the field boundaries. The pseudocode below describes the operation:
XRC - eXtended Reliable Connected Transport Service for InfiniBand
| bit_adder(ci, b1, b2, *co)
| {
| value = ci + b1 + b2
| *co = !!(value & 2
)
|
| return
value & 1
| }
|
| #define MASK_IS_SET(mask, attr) (!!((mask)&(attr)))
| bit_position = 1
| carry = 0
| atomic_response = 0
|
| for
i = 0
to 63
| {
| if
( i != 0
)
| bit_position = bit_position << 1
|
| bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position),
| MASK_IS_SET(compare_add, bit_position), &new_carry)
| if
(bit_add_res)
| atomic_response |= bit_position
|
| carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position)))
| }
|
| return
atomic_response
XRC allows significant savings in the number of QPs and the associated memory resources required to establish all to all process connectivity in large clusters.
It significantly improves the scalability of the solution for large clusters of multicore end-nodes by reducing the required resources.
For further details, please refer to the "Annex A14 Supplement to InfiniBand Architecture Specification Volume 1.2.1"
A new API can be used by user space applications to work with the XRC transport. The legacy API is currently supported in both binary and source modes, however it is deprecated. Thus we recommend using the new API.
The new verbs to be used are:
ibv_open_xrcd/ibv_close_xrcd
ibv_create_srq_ex
ibv_get_srq_num
ibv_create_qp_ex
ibv_open_qp
Please use ibv_xsrq_pingpong for basic tests and code reference. For detailed information regarding the various options for these verbs, please refer to their appropriate man pages.
Supported in ConnectX®-4, ConnectX®-4 Lx and Connect-IB adapter cards only.
Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active. This results in smaller memory footprint, less overhead to set connections and higher on-chip cache utilization and hence increased performance. DCT is supported only in mlx5 driver.
Supported in ConnectX®-5 adapter cards family only.
Tag Matching and Rendezvous Offloads is a technology employed by Mellanox to offload the processing of MPI messages from the host machine onto the network card. Employing this technology enables a zero copy of MPI messages, i.e. messages are scattered directly to the user's buffer without intermediate buffering and copies. It also provides a complete rendezvous progress by Mellanox devices. Such overlap capability enables the CPU to perform the application's computational tasks while the remote data is gathered by the adapter.
For more information Tag Matching Offload, please refer to the Community post "
Understanding
MPI
Tag
Matching
and
Rendezvous
Offloads
(ConnectX-5)
"
.