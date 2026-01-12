Note This feature is in maintenance support mode for adapter cards from ConnectX-5 through ConnectX-7 and is unsupported on ConnectX-8 NICs and newer.

Starting ConnectX-5, Tag Matching previously done by the software, can now be offloaded in UCX to the HCA. For MPI applications, sending messages with numeric tags accelerates the processing of incoming messages, leading to better CPU utilization and lower latency for expected messages. In Tag Matching, the software holds a list of matching entries called matching list. Each matching entry contains a tag and a pointer to an application buffer. The matching list is used to steer arriving messages to a specific buffer according to the message tag. The action of traversing the matching list and finding the matching entry is called Tag Matching, and it is performed on the HCA instead of the CPU. This is useful for cases where incoming messages are consumed not in the order they arrive, but rather based on numeric identifier coordinated with the sender.

Hardware Tag Matching avails the CPU for other application needs. Currently, Hardware Tag Matching is supported for the accelerated RC and DC transports (RC_X and DC_X), and can be enabled in UCX with the following environment parameters:

For the RC_X transport: Copy Copied! UCX_RC_MLX5_TM_ENABLE=y

For the DC_X transport: Copy Copied! UCX_DC_MLX5_TM_ENABLE=y

By default, only messages larger than a certain threshold are offloaded to the transport. This threshold is managed by the “UCX_TM_THRESH” environment variable (its default value is 1024 bytes).

UCX may also use bounce buffers for hardware Tag Matching, offloading internal pre-registered buffers instead of user buffers up to a certain threshold. This threshold is controlled by the UCX_TM_MAX_BB_SIZE environment variable. The value of this variable has to be equal or less than the segment size, and it must be larger than the value of UCX_TM_THRESH to take effect (1024 bytes is the default value, meaning that optimization is disabled by default).

Note With hardware Tag Matching enabled, the Rendezvous threshold is limited by the segment size, which is controlled by UCX_RC_MLX5_TM_MAX_BCOPY or UCX_DC_MLX5_TM_MAX_BCOPY variables (for RC_X and DC_X transports, respectively). Thus, the real Rendezvous threshold is the minimum value between the segment size and the value of UCX_RNDV_THRESH environment variable.

Note Hardware Tag Matching for RoCE is not supported.

For further information, refer to Understanding Tag Matching for Developers post.

SR-IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the device with separate resources. These virtual functions can then be provisioned separately. Each VF can be seen as an additional device connected to the Physical Function. It shares the same resources with the Physical Function, and its number of ports equals those of the Physical Function.

This feature is supported on ConnectX-5 HCAs and above only. To enable SR-IOV in UCX while it is configured in the fabric, use the following environment parameter:

Copy Copied! UCX_IB_ADDR_TYPE=ib_global

Notes:

This environment parameter should also be used when running UCX on a fabric with Socket Direct HCA installed. When working with Socket Direct HCAs, make sure Multi-Rail feature is enabled as well (refer to Multi-Rail.).

SRI-OV is not supported with dc and dc_x transports in UCX.

Adaptive Routing (AR) enables sending messages between two HCAs on different routes, based on the network load. While in static routing, a packet that arrives to the switch is forwarded based on its destination only, in Adaptive Routing, the packet is loaded to all possible ports that the packet can be forwarded to, resulting in the load being balanced between ports, and the fabric adapting to load changes over time. This feature requires support for out-of-order arrival of messages, which UCX has for the RC, rc_x and dc_x transports.

Note To be able to use Adaptive Routing on the fabric, make sure it is enabled in OpenSM and in the switches.

For RoCE adapters, the option UCX_IB_AR_ENABLE=<yes/try/auto/no> enables adaptive routing if the hardware configuration supports it. If set to yes and the network does not support it, an error will occur. If set to try , any lack of support will be silently ignored. Setting to no will force disable it or return an error.

For InfiniBand adapters, the activation is performed with UCX_IB_SL and UCX_IB_AR_ENABLE options as listed in the table below.

Note Enabling Adaptive Routing on a certain SL is done according to the following table. UCX_IB_AR_ENABLE=yes UCX_IB_AR_ENABLE=no UCX_IB_AR_ENABLE=try UCX_IB_AR_ENABLE=auto UCX_IB_SL=auto AR enabled on some SLs Use 1st SL with AR Use 1st SL without AR Use 1st SL with AR Use SL=0 AR enabled on all SLs Use SL=0 Failure Use SL=0 Use SL=0 AR disabled on all SLs Failure Use SL=0 Use SL=0 Use SL=0 UCX_IB_SL=<sl> AR enabled on <sl> Use SL=<sl> Failure Use SL=<sl> Use SL=<sl> AR disabled on <sl> Failure Use SL=<sl> Use SL=<sl> Use SL=<sl>

Note Adaptive routing is not supported for OpenSHMEM applications.

Error Handling enables UCX to handle errors that occur due to algorithms with fault recovery logic. To handle such errors, a new mode was added, guaranteeing an accurate status on every sent message. In addition, the process classifies errors by their origin (i.e. local or remote) and severity, thus allowing the user to decide how to proceed and what would that possibly recovery method be. To use Error Handling in UCX, the user must register with the UCP API (for example, the ucp_ep_create API function needs to be addressed).

CUDA environment support in HPC-X enables the use of NVIDIA’s GPU memory in UCX and HCOLL communication libraries for point-to-point and collective routines, respectively.

System Requirements

CUDA v12.0 or higher. For information on how to install CUDA, please refer to NVIDIA documents for CUDA Toolkit.

For GPUDirect support, need either a recent kernel with dmabuf support or a GPUDirect RDMA driver. For more information, please refer to DOCA-Host documentation.

To install GPUDirect RDMA driver with DOCA-Host. For more information, please refer to DOCA-Host documentation.

The optional gdrcopy driver optimizes the transfer latency of small messages residing in GPU memory. For information on how to install GDR COPY, refer to its GitHub webpage. Note It is important to make sure that the gdrcopy driver is installed properly on each of the compute nodes taking part in the MPI job.

To check whether the GDR COPY module is loaded, run:

Copy Copied! lsmod | grep gdrdrv

Multi-Rail enables users to use more than one of the active ports on the host, making better use of system resources, and allowing increased throughput. When using Socket Direct cards, the Multi-Rail capability becomes essential.

Each process would be able to use up to the first 4 active ports on the host in parallel (this 4 port limitation is for performance considerations), if the following parameters are set:

For setting the number of active ports to use for the Eager protocol, i.e. for small messages, please set the following parameter:

Copy Copied! % mpirun -mca pml ucx -x UCX_MAX_EAGER_RAILS= 4 ...

For setting the number of active ports to use for the Rendezvous protocol, i.e. for large messages, please set the following parameter:

Copy Copied! % mpirun -mca pml ucx -x UCX_MAX_RNDV_RAILS= 4 ...

Possible values for these parameters are 1, 2, 3 and 4. The default values are UCX_MAX_EAGER_LANES =1, and UCX_MAX_RNDV_LANES = 2 .

Note The Multi-Rail feature will be disabled while the Hardware Tag Matching feature is enabled.

Note Starting from HPC-X v2.8, multi-rail is also supported out-of-box for the client-server API. To enable or disable it, use the following environment parameter: UCX_CM_USE_ALL_DEVICES=y/n





Memory in chip feature allows for using on-device memory for sending messages from the UCX layer. This feature is enabled by default on ConnectX-5 HCAs. It is supported only for the rc_x and dc_x transports in UCX.

The environment parameters that control this feature behavior are:

UCX_RC_MLX5_DM_SIZE

UCX_RC_MLX5_DM_COUNT

UCX_DC_MLX5_DM_SIZE

UCX_DC_MLX5_DM_COUNT

For more information on these parameters, please refer to the ucx_info utility: % $HPCX_UCX_DIR/bin/ucx_info -f.

UCX supports the usage of a non-default PKey. In order to specify which PKEY value to use, please set it with the following environment parameter: UCX_IB_PKEY .

Valid values are between 0 - 0x7fff.

In an environment where the default PKey is not found, the PKey in index 0 will be used.

When using the UCX client-server API for connection establishment, it is also possible to have a graceful teardown, i.e a disconnection, between each pair of client and the server it's connected to, at the end of the communication. Either side can be the initiator of the disconnection.

UCX now supports RoCE LAG out-of-box.

UCX is now able to detect a RoCE LAG device and automatically create two RDMA connections to utilize the full bandwidth of LAG interface.

”. The default value for UCX_ROCE_PATH_FACTOR is 1. This feature is currently supported for RC transport only.



For Ethernet packets, the network switch path is usually determined by a hash function on the packet’s IP and UDP header fields. In order to force using distinct paths for various switch topologies, it is possible to set “UCX_ROCE_PATH_FACTOR=n” environment variable to influence UDP.source_port field: the first connection will use “UDP.source_port=0xC000”, while the second connection will use “UDP.source_port=0xC000+

This feature is intended to prevent network congestion when many processes send messages to the same destination. To reduce network pressure, the user may limit the number of simultaneously transferred data by setting UCX_RC_TX_NUM_GET_BYTES environment variable to a certain value (e.g. 10MB). In addition, to achieve better pipelining of network transfer and data processing, the user may limit the maximal message size which can be transferred using RDMA Read operation by setting UCX_RC_MAX_GET_ZCOPY environment variable to a certain value (e.g. 64KB).

UCX supports enabling Relaxed Ordering for PCIe Write transactions in order to improve performance on systems where the PCI bandwidth of relaxed-ordered Writes is higher than that of the default strict-ordered Writes.

The environment variable UCX_IB_PCI_RELAXED_ORDERING can force a specific behavior: “on” enables relaxed ordering; “off” disables it; while “auto” (default) sets relaxed ordering mode based on the system type.

The UCX configuration file enables the user to apply configuration variables set by the user in the $HPCX_UCX_DIR/etc/ucx/ucx.conf file. A configuration file can be created with initial default values by running "ucx_info -fC > $HPCX_UCX_DIR/etc/ucx/ucx.conf" .

The values are applied in the following order of precedence:

1. If an environment variable is set explicitly, it overrides the file's configuration.

2. Otherwise, value from $HPCX_UCX_DIR/etc/ucx/ucx.conf is used if it exists.

3. Otherwise, default (compile-time) value is used.

Note The configuration file applies settings only to the host where it is located.





This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in /tmp/ucx . The directory name is the PID of the target process. The process directory contains three sub-directories: UCP, UCT, UCS.

Note UCX inside HPC-X is built with --with-fuse3 option by default.

While building, UCX checks for fuse3 library presence and enables building the tool. Once UCX is built, the ucx_vfs binary will be created in the install directory and will be used to launch a daemon process and enable UCX-based applications analysis.

You can use the UCX_VFS_ENABLE environment variable to control the feature. It is set to ‘y’ by default. Setting the variable to ‘n’ disables creating the service thread in user’s UCX application.

For the feature to function properly, the following is required:

fuse3 utilities to run the daemon and analyze applications

fuse3 library to build the tool

ucx_vfs daemon must be started before the target processes. Otherwise, if the number of processes exceeds the limit, fs.inotify.max_user_instances are increased.

If the user starts simultaneously more than the maximum allowed number of processes and then starts the daemon, only the first processes that meet the limit will be monitored by the tool.

On-demand paging mitigates the limitations of memory registration by allowing applications to avoid pinning down physical pages of the address space and tracking mapping validity.

Instead, the Host Channel Adapter (HCA) requests updated translations from the operating system when pages are absent, and the OS invalidates translations affected by non-present pages or mapping alterations.

With ODP, system memory is not locked and therefore is allowed to be swapped out if necessary.

The feature is controlled by UCX_REG_NONBLOCK_MEM_TYPES environment variable. UCX supports both ODP versions - ODPv1 and ODPv2. Which ODP version can be used depends on FW version and configuration.

To enable ODPv2 in UCX it is enough to specify UCX_REG_NONBLOCK_MEM_TYPES=host .

To enable ODPv1, in addition to setting UCX_REG_NONBLOCK_MEM_TYPES=host , devx objects creation must be disabled using UCX_IB_MLX5_DEVX_OBJECTS="" .

Note This feature is enabled by default on Grace platforms. On all other platforms it is disabled by default and can be activated using environment variables described above.





Multi-Node NVLINK (MNNVL) enables NVLINK communication between processes located on different nodes. MNNVL support is disabled by default in UCX. To enable this feature, set the environment variable UCX_CUDA_IPC_ENABLE_MNNVL=y .

Additionally, for applications that create endpoints with error handling support, UCX_RNDV_PIPELINE_ERROR_HANDLING=y needs to be set.

Note Setting UCX_RNDV_PIPELINE_ERROR_HANDLING=y may result in data corruption during the error handling flow. Once all missing parts of the error handling flow are implemented, this environment variable will be deprecated.



