NVIDIA® MLNX_OFED Documentation Rev 5.1-0.6.6.0
Linux Kernel Upstream Release Notes v5.17

Advanced Transport

Atomic Operations in mlx5 Driver

To enable atomic operation with this endianness contradiction, use the ibv_create_qp to create the QP and set the IBV_QP_CREATE_ATOMIC_BE_REPLY flag on create_flags.

Enhanced Atomic Operations

ConnectX® implements a set of Extended Atomic Operations beyond those defined by the IB spec. Atomicity guarantees, Atomic Ack generation, ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations (as defined in section 9.4.5 of the IB spec).

Masked Compare and Swap (MskCmpSwap)

The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec. MskCmpSwap allows the user to select a portion of the 64 bit target data for the "compare" check, as well as to restrict the swap to a (possibly different) portion. The pseudo-code below describes the operation:
The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length. The atomic add is done independently on each one of this fields. A bit set in the field_boundary parameter specifies the field boundaries. The pseudo-code below describes the operation:

The additional operands are carried in the Extended Transport Header. Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations.

Copy
Copied!
            

| atomic_response = *va | if (!((compare_add ^ *va) & compare_add_mask)) then | *va = (*va & ~(swap_mask)) | (swap & swap_mask) | | return atomic_response


Masked Fetch and Add (MFetchAdd)

The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length. The atomic add is done independently on each one of this fields. A bit set in the field_boundary parameter specifies the field boundaries. The pseudocode below describes the operation:

Copy
Copied!
            

| bit_adder(ci, b1, b2, *co) | { | value = ci + b1 + b2 | *co = !!(value & 2) | | return value & 1 | } | | #define MASK_IS_SET(mask, attr) (!!((mask)&(attr))) | bit_position = 1 | carry = 0 | atomic_response = 0 | | for i = 0 to 63 | { | if ( i != 0 ) | bit_position = bit_position << 1 | | bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position), | MASK_IS_SET(compare_add, bit_position), &new_carry) | if (bit_add_res) | atomic_response |= bit_position | | carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position))) | } | | return atomic_response

XRC allows significant savings in the number of QPs and the associated memory resources required to establish all to all process connectivity in large clusters.
It significantly improves the scalability of the solution for large clusters of multicore end-nodes by reducing the required resources.
For further details, please refer to the "Annex A14 Supplement to InfiniBand Architecture Specification Volume 1.2.1"
A new API can be used by user space applications to work with the XRC transport. The legacy API is currently supported in both binary and source modes, however it is deprecated. Thus we recommend using the new API.
The new verbs to be used are:

  • ibv_open_xrcd/ibv_close_xrcd

  • ibv_create_srq_ex

  • ibv_get_srq_num

  • ibv_create_qp_ex

  • ibv_open_qp

Please use ibv_xsrq_pingpong for basic tests and code reference. For detailed information regarding the various options for these verbs, please refer to their appropriate man pages.

Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active. This results in smaller memory footprint, less overhead to set connections and higher on-chip cache utilization and hence increased performance. DCT is supported only in mlx5 driver.

Warning

Supported in ConnectX®-5 and above adapter cards.

Tag Matching and Rendezvous Offloads is a technology employed by Mellanox to offload the processing of MPI messages from the host machine onto the network card. Employing this technology enables a zero copy of MPI messages, i.e. messages are scattered directly to the user's buffer without intermediate buffering and copies. It also provides a complete rendezvous progress by Mellanox devices. Such overlap capability enables the CPU to perform the application's computational tasks while the remote data is gathered by the adapter.
For more information Tag Matching Offload, please refer to the Community post " Understanding MPI Tag Matching and Rendezvous Offloads (ConnectX-5) " .

© Copyright 2023, NVIDIA. Last updated on Oct 23, 2023.