Unified Communication - X Framework (UCX) is an acceleration library, integrated into the Open MPI (as a pml layer) and to OpenSHMEM (as an spml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. UCX has a broad range of optimizations for achieving low-software overheads in communication path which allows near native-level performance.
UCX supports receive side tag matching, one-sided communication semantics, efficient memory registration and a variety of enhancements which increase the scalability and performance of HPC applications significantly.
UCX is also highly useful for storage, big-data and cloud domains where client-server based applications are used.
- InfiniBand transports:
- Unreliable Datagram (UD)
- Reliable Connected (RC)
Dynamically Connected (DC)
DC is supported on Connect-IB®/ConnectX®-4 and above HCAs with MLNX_OFED v5.0 and higher.
- Accelerated verbs
- Shared Memory communication with support for KNEM, CMA and XPMEM
Supported CPU Architectures
Unified Communication - X Framework (UCX) supported CPU architectures are: x86, ARM, PowerPC.
As of HPC-X v2.1, UCX is set as the default pml for Open MPI, default spml for OpenSHMEM.
Using UCX with OpenMPI
UCX is the default pml in Open MPI and the default spml in OpenSHMEM.
To use UCX with Open MPI explicitly:
To use UCX with OpenSHMEM explicitly:
Configuring UCX with XPMEM
By default, UCX library embedded within HPC-X is compiled with an open source version of the XPMEM driver. The recommended version of the XPMEM driver is from: https://github.com/openucx/xpmem.
In order to compile UCX with another version of XPMEM, follow the steps below:
- Make sure your host has XPMEM headers and the userspace library is installed.
Untar the UCX sources available inside the $HPCX_HOME/sources directory, and recompile UCX:
Note: In case the new UCX version is installed in a different location, use LD_LIBRARY_PATH for Open MPI to use the new location:
When UCX is compiled from sources, it can be configured for best performance.
To accomplish this, please compile UCX with:
Tuning UCX Settings
The default UCX settings are already optimized. To check the available UCX parameters and their default values, run the '$HPCX_UCX_DIR/bin/ucx_info -f' utility.
To check the UCX version, run:
UCX parameters can be modified using one of the following methods:
Modifying the default UCX parameters value as part of the mpirun:
Modifying the default UCX parameters value from SHELL (when running as part of a resource manager job):
(when running as part of a resource manager job):
Selecting the transports to use from the command line:
The above command will select pml ucx and set its transports for usage, shared memory, and accelerated verbs.
Excluding specific transports from the command line:
The above command line will select pml ucx and use all its available transports except for rc. The rc transport will be excluded from usage.
As of HPC-X v2.5, shared memory has new transport naming. The available shared memory transports are: posix, sysv and xpmem.
The 'device' name for the shared memory transport is 'memory' (for usage in UCX_SHM_DEVICES).
When selecting one of the several devices or interfaces in the server, please use the UCX_NET_DEVICES flag to specify which RDMA device you would like to use.
The above command will select pml ucx and set the HCA for usage, mlx5_1, port 1.
Improving performance at scale by increasing the value of DC initiator QPs (DCI) number used by the interface when using the DC transport:
In case the DC transport is not available or disabled on a large scale, UCX will fall back to the UD transport.
The RC transport is disabled after 256 established connections. The counter of established connections can be overridden using the UCX_RC_MAX_NUM_EPS environmental parameter.
Running UCX on a RoCE port, by:
Configuring the fabric as lossless (see RoCE Deployment Community post), and setting UCX_IB_TRAFFIC_CLASS=106.
Setting the specific port using the UCX_NET_DEVICES environment variable. For example:
By default, RoCE v2 and IPv4 are used, if available. Otherwise, RoCE v1 with MAC address is used. In order to set a specific RoCE version to use, set UCX_IB_GID_INDEX to the index of the required RoCE version and address type, as reported by “show_gids” command. For example:
Setting the threshold for using the Rendezvous protocol in UCX:
By default, UCX will calculate the optimal threshold on its own, but the value can be overwritten using the above environment parameter.
Setting the threshold for using the zero-copy in UCX:
By default, UCX will calculate the optimal threshold on its own, but the value can be overwritten using the above environment parameter.
UCX_IB_ADDR_TYPE=ib_globalwhen running on GID-based multi-host setup (see also Single Root IO Virtualization (SR-IOV) section below).
Enabling various optimizations intended for homogeneous environment. Enabling this mode implies that the local transport resources/devices of all entities that connect to each other are the same.
-x UCX_IB_SUBNET_PREFIXto filter for the InfiniBand subnet prefix (empty means no filter). This is relevant for IB link layer only. For example, a filter for the default subnet prefix can be specified as follows: fe80:0:0:0.
Specifying how DC initiator (DCI) is selected by the endpoint with UCX_DC_MLX5_TX_POLICY=<policy> (relevant for DC transport only). The policy options are:
The endpoint either uses already assigned DCI, or DCI is allocated in a LIFO order and gets released once it has no outstanding operations
Same as "dcs". In addition, the DCI is scheduled for release in case it has sent more than one quota and there are endpoints waiting for a DCI. The DCI is released once it completes all outstanding operations. This policy ensures that there will be no starvation among endpoints
Every endpoint is assigned with a randomly selected DCI. Multiple endpoints may share the same DCI
Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options:
Disabling GPU memory staging protocols and using only
GPUDirectRDMA, if possible:
Running the application on close NUMA nodes:
The shared memory new transport naming:
The available shared memory transport names are: posix, sysv and xpmem.
'sm' and 'mm' will include all the three mentioned above.
The 'device' name for the shared memory transport is 'memory' (for usage in UCX_SHM_DEVICES)
To get more information in case of any error (for troubleshooting purposes), please set the following environment parameter:
- DC full handshake config can be set by the environment variables UCX_DC_MLX5_DCI_FULL_HANDSHAKE, UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE, UCX_DC_MLX5_DCT_FULL_HANDSHAKE. Possible values are: on / off / auto, with the default being “off”. In auto mode, FH will be used according to the AR config of the SL in use (if the SL is with AR – FH will be used, otherwise – HH).
Hardware Tag Matching
Starting ConnectX-5, Tag Matching previously done by the software, can now be offloaded in UCX to the HCA. For MPI applications, sending messages with numeric tags accelerates the processing of incoming messages, leading to better CPU utilization and lower latency for expected messages. In Tag Matching, the software holds a list of matching entries called matching list. Each matching entry contains a tag and a pointer to an application buffer. The matching list is used to steer arriving messages to a specific buffer according to the message tag. The action of traversing the matching list and finding the matching entry is called Tag Matching, and it is performed on the HCA instead of the CPU. This is useful for cases where incoming messages are consumed not in the order they arrive, but rather based on numeric identifier coordinated with the sender.
Hardware Tag Matching avails the CPU for other application needs. Currently, Hardware Tag Matching is supported for the accelerated RC and DC transports (RC_X and DC_X), and can be enabled in UCX with the following environment parameters:
For the RC_X transport:
For the DC_X transport:
By default, only messages larger than a certain threshold are offloaded to the transport. This threshold is managed by the “UCX_TM_THRESH” environment variable (its default value is 1024 bytes).
UCX may also use bounce buffers for hardware Tag Matching, offloading internal pre-registered buffers instead of user buffers up to a certain threshold. This threshold is controlled by the UCX_TM_MAX_BB_SIZE environment variable. The value of this variable has to be equal or less than the segment size, and it must be larger than the value of UCX_TM_THRESH to take effect (1024 bytes is the default value, meaning that optimization is disabled by default).
With hardware Tag Matching enabled, the Rendezvous threshold is limited by the segment size, which is controlled by
UCX_DC_MLX5_TM_MAX_BCOPY variables (for RC_X and DC_X transports, respectively). Thus, the real Rendezvous threshold is the minimum value between the segment size and the value of UCX_RNDV_THRESH environment variable.
Hardware Tag Matching for InfiniBand requires MLNX_OFED v4.1-x.x.x.x and above.
Hardware Tag Matching for RoCE is not supported.
For further information, refer to
Single Root IO Virtualization (SR-IOV)
SR-IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the device with separate resources. These virtual functions can then be provisioned separately. Each VF can be seen as an additional device connected to the Physical Function. It shares the same resources with the Physical Function, and its number of ports equals those of the Physical Function.
This feature is supported on ConnectX-5 HCAs and above only.To enable SR-IOV in UCX while it is configured in the fabric, use the following environment parameter:
- This environment parameter should also be used when running UCX on a fabric with Socket Direct HCA installed. When working with Socket Direct HCAs, make sure Multi-Rail feature is enabled as well (refer to 39273289.).
- SRI-OV is not supported with dc and dc_x transports in UCX.
Adaptive Routing (AR) enables sending messages between two HCAs on different routes, based on the network load. While in static routing, a packet that arrives to the switch is forwarded based on its destination only, in Adaptive Routing, the packet is loaded to all possible ports that the packet can be forwarded to, resulting in the load being balanced between ports, and the fabric adapting to load changes over time. This feature requires support for out-of-order arrival of messages, which UCX has for the RC, rc_x and dc_x transports.
To be able to use Adaptive Routing on the fabric, make sure it is enabled in OpenSM and in the switches.
Enabling Adaptive Routing on a certain SL is done according to the following table.
Adaptive routing is not supported for OpenSHMEM applications.
Error Handling enables UCX to handle errors that occur due to algorithms with fault recovery logic. To handle such errors, a new mode was added, guaranteeing an accurate status on every sent message. In addition, the process classifies errors by their origin (i.e. local or remote) and severity, thus allowing the user to decide how to proceed and what would that possibly recovery method be. To use Error Handling in UCX, the user must register with the UCP API (the ucp_ep_create API function needs to be addressed, for example)
CUDA environment support in HPC-X enables the use of NVIDIA’s GPU memory in UCX and HCOLL communication libraries for point-to-point and collective routines, respectively.
- CPU architecture: x86
- NVIDIA GPU architectures:
- CUDA v8.0 or higher - for information on how to install CUDA, refer to NVIDIA documents for CUDA Toolkit. This version of HPC-X is compiled with CUDA v11.2.
- Mellanox OFED GPUDirect RDMA plugin module - for information on how to install:
- Mellanox OFED - refer to MLNX_OFED webpage
- GPUDirect RDMA - refer to Mellanox OFED GPUDirect RDMA webpage
Once the NVIDIA software components are installed, it is important to verify that the GPUDirect RDMA kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GPUDirect RDMA.
To check whether the GPUDirect RDMA module is loaded, run:
service nv_peer_mem status
To run this verification on other Linux flavors:
lsmod | grep nv_peer_mem
- Once GDR COPY is installed, it is important to verify that the gdrcopy kernel module is properly loaded on each of the compute systems where you plan to run the job that requires the GDR COPY.GDR COPY plugin module - GDR COPY is a fast copy library from NVIDIA, used to transfer between HOST and GPU. For information on how to install GDR COPY, refer to its GitHub webpage
lsmod | grep gdrdrvTo check whether the GDR COPY module is loaded, run:
Multi-Rail enables users to use more than one of the active ports on the host, making better use of system resources, and allowing increased throughput. When using Socket Direct cards, the Multi-Rail capability becomes essential.
Each process would be able to use up to the first 4 active ports on the host in parallel (this 4 port limitation is for performance considerations), if the following parameters are set:
For setting the number of active ports to use for the Eager protocol, i.e. for small messages, please set the following parameter:
For setting the number of active ports to use for the Rendezvous protocol, i.e. for large messages, please set the following parameter:
Possible values for these parameters are 1, 2, 3 and 4. The default values are
UCX_MAX_EAGER_LANES =1, and
UCX_MAX_RNDV_LANES = 2.
The Multi-Rail feature will be disabled while the Hardware Tag Matching feature is enabled.
Starting from HPC-X v2.8, multi-rail is also supported out-of-box for the client-server API. To enable or disable it, use the following environment parameter:
Memory in Chip (MEMIC)
Memory in chip feature allows for using on-device memory for sending messages from the UCX layer. This feature is enabled by default on ConnectX-5 HCAs. It is supported only for the rc_x and dc_x transports in UCX.
The environment parameters that control this feature behavior are:
For more information on these parameters, please refer to the ucx_info utility: % $HPCX_UCX_DIR/bin/ucx_info -f.
UCX supports the usage of a non-default PKey. In order to specify which PKEY value to use, please set it with the following environment parameter:
Valid values are between 0 - 0x7fff.
In an environment where the default PKey is not found, the PKey in index 0 will be used.
When using the UCX client-server API for connection establishment, it is also possible to have a graceful teardown, i.e a disconnection, between each pair of client and the server it's connected to, at the end of the communication. Either side can be the initiator of the disconnection.
UCX now supports RoCE LAG out-of-box.
UCX is now able to detect a RoCE LAG device and automatically create two RDMA connections to utilize the full bandwidth of LAG interface.
For Ethernet packets, the network switch path is usually determined by a hash function on the packet’s IP and UDP header fields. In order to force using distinct paths for various switch topologies, it is possible to set “UCX_ROCE_PATH_FACTOR=n” environment variable to influence UDP.source_port field: the first connection will use “UDP.source_port=0xC000”, while the second connection will use “UDP.source_port=0xC000+<n>”.
The default value for UCX_ROCE_PATH_FACTOR is 1. This feature is currently supported for RC transport only.
Flow Control for RDMA Read Operations
This feature is intended to prevent network congestion when many processes send messages to the same destination. To reduce network pressure, the user may limit the number of simultaneously transferred data by setting UCX_RC_TX_NUM_GET_BYTES environment variable to a certain value (e.g. 10MB). In addition, to achieve better pipelining of network transfer and data processing, the user may limit the maximal message size which can be transferred using RDMA Read operation by setting UCX_RC_MAX_GET_ZCOPY environment variable to a certain value (e.g. 64KB).
PCIe Relaxed Ordering Support
UCX supports enabling Relaxed Ordering for PCIe Write transactions in order to improve performance on systems where the PCI bandwidth of relaxed-ordered Writes is higher than that of the default strict-ordered Writes.
The environment variable UCX_IB_PCI_RELAXED_ORDERING can force a specific behavior: “on” enables relaxed ordering; “off” disables it; while “auto” (default) sets relaxed ordering mode based on the system type.
UCX Configuration File
The UCX configuration file enables the user to apply configuration variables set by the user in the $HPCX_UCX_DIR/etc/ucx/ucx.conf file. A configuration file can be created with initial default values by running
"ucx_info -fC > $HPCX_UCX_DIR/etc/ucx/ucx.conf".
The values are applied in the following order of precedence:
1. If an environment variable is set explicitly, it overrides the file's configuration.
2. Otherwise, value from $HPCX_UCX_DIR/etc/ucx/ucx.conf is used if it exists.
3. Otherwise, default (compile-time) value is used.
The configuration file applies settings only to the host where it is located.
Instrumentation and Monitoring FUSE-based Tool
This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in
/tmp/ucx. The directory name is the PID of the target process. The process directory contains three sub-directories: UCP, UCT, UCS.
This feature requires rebuild of UCX with
"--with-fuse3" flag in the configure line. UCX inside HPC-X is not built with this option by default.
While building, UCX checks for fuse3 library presence and enables building the tool. Once UCX is built, the
ucx_vfs binary will be created in the install directory and will be used to launch a daemon process and enable UCX-based applications analysis.
You can use the
UCX_VFS_ENABLE environment variable to control the feature. It is set to ‘y’ by default. Setting the variable to ‘n’ disables creating the service thread in user’s UCX application.
For the feature to function properly, the following is required:
- fuse3 utilities to run the daemon and analyze applications
- fuse3 library to build the tool
- ucx_vfs daemon must be started before the target processes. Otherwise, if the number of processes exceeds the limit, fs.inotify.max_user_instances are increased.
- If the user starts simultaneously more than the maximum allowed number of processes and then starts the daemon, only the first processes that meet the limit will be monitored by the tool.
A client-server based application which is designed to test UCX's performance and sanity checks.
To run it, two terminals are required to be opened, one on the server side and one on the client side.
The working flow is as follow:
- The server listens to the request coming from the client.
- Once a connection is established, UCX sends and receives messages between the two sides according to what the client requested.
- The results of the communications are displayed.
For further information, run: $HPCX_HOME/ucx/bin/ucx_perftest -help .
- From the server side, run:
- From the client side, run:
$HPCX_HOME/ucx/bin/ucx_perftest <server_host_name> -t ucp_am_bw
Among other parameters, you can specify the test you would like to run, the message size and the number of iterations.
Generating UCX Statistics for Open MPI/OpenSHMEM
In order to generate statistics, the statistics destination and trigger should be set, and they can optionally be filtered and/or formatted.
The destination is set by UCX_STATS_DEST environment variable whose values can be one of the following:
Statistics are not reported
Print to standard output
Print to standard error
Save to a file. Following substitutions are made: %h: host, %p:pid, %c:cpu, %t: time, %e:exe
Send over UDP to the given host:port
Trigger is set by UCX_STATS_TRIGGER environment variables. It can be one of the following:
Environment Variable Description
Dump statistics just before exiting the program
Dump statistics periodically, interval is given in seconds
Dump when processes signaled
It is possible to filter the counters in the report using the UCX_STATS_FILTER environment parameter. It accepts a comma-separated list of glob patterns specifying counters to display. Statistics summary will contain only the matching counters. The order is not meaningful. Each expression in the list may contain any of the following options:
Environment Variable Description
Matches any number of any characters including none (prints a full report)
Matches any single character
Matches one character given in the bracket
Matches one character from the range given in the bracket
More information about this parameter can be found at: https://github.com/openucx/ucx/wiki/StatisticsIt is possible to filter the counters in the report using the UCX_STATS_FILTER environment parameter. It accepts a comma-separated list of glob patterns specifying counters to display. Statistics summary will contain only the matching counters. The order is not meaningful. Each expression in the list may contain any of the following options:
It is possible to control the formatting of the statistics using the UCX_STATS_FORMAT parameter:
Environment Variable Description
Each counter will be displayed in a separate line
Each counter will be displayed in a separate line. However, there will also be an aggregation between similar counters
All counters will be printed in the same line
The statistics feature is only enabled when UCX is compiled with the enable-stats flag. This flag is set to 'No' by default. Therefore, in order to use the statistics feature, please recompile UCX using the contrib/configure-prof file, or use the 'debug' version of UCX, which can be found in $HPCX_UCX_DIR/debug:
$ mpirun -mca pml ucx -x LD_PRELOAD=$HPCX_UCX_DIR/debug/lib/libucp.so ...
Please note that recompiling UCX using the aforementioned methods may impact the performance.
When there are several devices or interfaces in the server, please use the UCX_NET_DEVICES flag to specify which RDMA device you would like to use.