DOCA Documentation v3.0.0

Advanced Transport

The mlx5 driver supports atomic operations; however, due to potential endianness mismatches between requester and responder systems, specific configuration is required to ensure correct behavior.

To enable atomic operations in environments where such an endianness contradiction may exist, the application must explicitly configure the Queue Pair (QP) during creation.

Use the ibv_create_qp() function and set the following flag in the create_flags field:

Copy
Copied!
            

IBV_QP_CREATE_ATOMIC_BE_REPLY

This flag ensures that the QP is configured to handle big-endian formatted atomic replies, allowing atomic operations to complete correctly across systems with differing byte orders.

XRC (eXtended Reliable Connected transport) is an InfiniBand transport type designed to improve scalability in large-scale deployments by reducing the number of required QPs and associated memory resources. It enables more efficient all-to-all process connectivity, particularly in environments with many multicore end-nodes.

By decoupling the send and receive sides of a QP, XRC significantly reduces memory usage and improves resource management, making it ideal for large clusters.

For detailed technical background, refer to "Annex A14 Supplement to InfiniBand Architecture Specification Volume 1.2.1".

API Usage

A new set of verbs APIs has been introduced to support XRC in user-space applications. While the legacy XRC API is still supported in both binary and source form, it is deprecated. Developers are strongly encouraged to use the new API.

Key Verbs for XRC:

  • ibv_open_xrcd() / ibv_close_xrcd() – Open and close XRC domain objects

  • ibv_create_srq_ex() – Create an SRQ with extended attributes

  • ibv_get_srq_num() – Retrieve SRQ number

  • ibv_create_qp_ex() – Create a QP with extended attributes (including XRC target/source)

  • ibv_open_qp() – Open an existing QP from a remote XRC domain

Testing and Reference

For basic testing and example usage of the XRC transport, refer to the ibv_xsrq_pingpong utility. It provides a working reference for integrating XRC into user applications.

For complete documentation of each API function and its options, consult the corresponding man pages.

Dynamically Connected Transport (DCT) is an advanced RDMA transport service designed to improve scalability and efficiency in environments with sparse or dynamic communication patterns. It allows Reliable Connection (RC)-type QPs to dynamically connect and disconnect from remote nodes as needed, rather than maintaining persistent connections.

This on-demand connection model reduces system-wide resource usage and enhances performance by:

  • Minimizing the total number of QPs required.

  • Lowering the memory footprint.

  • Reducing connection setup overhead.

  • Improving on-chip cache utilization.

As a result, DCT is particularly beneficial for large-scale or highly dynamic deployments where traditional RC QP scaling becomes a bottleneck.

Note

DCT is supported only by the mlx5 driver.

Note
  • NVIDIA® ConnectX®-4 devices support DCT v0.

  • NVIDIA® ConnectX®-5 and later devices support DCT v1.

DCT v0 and DCT v1 are not interoperable. Ensure all participating nodes are using the same DCT version to maintain compatibility.

Note

Supported in NVIDIA® ConnectX®-5 and above adapter cards.

MPI Tag Matching and Rendezvous Offloads is a technology developed by NVIDIA to offload the processing of MPI (Message Passing Interface) communications from the host CPU to the network adapter. This offload mechanism is designed to optimize message matching and reduce memory overhead in high-performance computing environments.

By leveraging this feature:

  • MPI messages are delivered directly to the user’s buffer with zero-copy semantics, eliminating the need for intermediate memory copies.

  • Rendezvous protocol progress is handled entirely by the network adapter, without host CPU intervention.

  • The CPU is freed to continue executing application-level computation while the adapter concurrently manages data transfers.

This offload model significantly improves performance and scalability for MPI-based applications by overlapping computation with communication and reducing CPU load.

For more information, refer to the community post Understanding MPI Tag Matching and Rendezvous Offloads (ConnectX-5) .

© Copyright 2025, NVIDIA. Last updated on Aug 28, 2025.