Unified Communication - X Framework Library
Unified Communication - X Framework (UCX) is an acceleration library, integrated into the Open MPI (as a pml layer) and to OpenSHMEM (as an spml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. UCX has a broad range of optimizations for achieving low-software overheads in communication path which allows near native-level performance.
UCX supports receive side tag matching, one-sided communication semantics, efficient memory registration and a variety of enhancements which increase the scalability and performance of HPC applications significantly.
UCX is also highly useful for storage, big-data and cloud domains where client-server based applications are used.
UCX supports:
InfiniBand transports:
Reliable Connected (RC)
Unreliable Datagram (UD)
Dynamically Connected (DC)
NoteDC is supported on Connect-IB®/ConnectX®-4 and above HCAs with MLNX_OFED v5.0 and higher.
Accelerated verbs
Shared memory communication, including CMA and XPMEM
RoCE
TCP
CUDA/gdrcopy
For further information on UCX, please refer to https://github.com/openucx/ucx and http://www.openucx.org/
Supported CPU Architectures
Unified Communication - X Framework (UCX) supported CPU architectures are: x86_64 and aarch64.
As of HPC-X v2.1, UCX is set as the default pml for Open MPI, default spml for OpenSHMEM.
Using UCX with OpenMPI
UCX is the default pml in Open MPI and the default spml in OpenSHMEM.
$mpirun --mca pml ucx -mca osc ucx ...
$oshrun --mca spml ucx -mca atomic ucx ...
Configuring UCX with XPMEM
By default, UCX library embedded within HPC-X is compiled with an open source version of the XPMEM driver. The recommended version of the XPMEM driver is from: https://github.com/openucx/xpmem.
In order to compile UCX with another version of XPMEM, follow the steps below:
Make sure your host has XPMEM headers and the userspace library is installed.
Untar the UCX sources available inside the $HPCX_HOME/sources directory, and recompile UCX:
% ./autogen.sh % ./contrib/configure-release --with-xpmem=/path/to/xpmem --prefix=/path/to/
new
/ucx/install % make -j8 installNote: In case the new UCX version is installed in a different location, use LD_LIBRARY_PATH for Open MPI to use the new location:
% mpirun -mca pml ucx -x LD_LIBRARY_PATH=/path/to/
new
/ucx/install/lib:$LD_LIBRARY_PATH ...NoteWhen UCX is compiled from sources, it can be compiled with tuning for the specific CPU on the system (instead of the default build, which aims to be compatible with a broad range of CPU models).
To accomplish this, please compile UCX with:./contrib/configure-release --enable-optimizations
The default UCX settings are already optimized for out-of-box performance. To check the available UCX parameters and their default values, run the '$HPCX_UCX_DIR/bin/ucx_info -f' utility.
To check the UCX version, run:
$HPCX_UCX_DIR/bin/ucx_info -v
UCX parameters can be modified using one of the following methods:
Modifying the default UCX parameters value as part of the mpirun:
mpirun -x UCX_RC_VERBS_RX_MAX_BUFS=
128000
<...>
Modifying the default UCX parameters value from SHELL (when running as part of a resource manager job):
export UCX_RC_VERBS_RX_MAX_BUFS=
128000
mpirun <...>(when running as part of a resource manager job):
Selecting the transports to use from the command line:
mpirun -mca pml ucx -x UCX_TLS=sm,rc_x ...
The above command will select pml ucx and set its transports for usage, shared memory, and accelerated verbs.
Excluding specific transports from the command line:
mpirun -mca pml ucx -x UCX_TLS=^rc ...
The above command line will select pml ucx and use all its available transports except for rc. The rc transport will be excluded from usage.
NoteAs of HPC-X v2.5, shared memory has new transport naming. The available shared memory transports are: posix, sysv and xpmem.
The 'device' name for the shared memory transport is 'memory' (for usage in UCX_SHM_DEVICES).
When selecting one of the several devices or interfaces in the server, please use the UCX_NET_DEVICES flag to specify which RDMA device you would like to use.
$mpirun -mca pml ucx -x UCX_NET_DEVICES=mlx5_1:
1
The above command will select pml ucx and set the HCA for usage, mlx5_1, port 1.
Improving performance at scale by increasing the value of DC initiator QPs (DCI) number used by the interface when using the DC transport:
mpirun -mca pml ucx -x UCX_TLS=sm,dc_x -x UCX_DC_MLX5_NUM_DCI=
16
In case the DC transport is not available or disabled on a large scale, UCX will fall back to the UD transport.
The RC transport is disabled after 256 established connections. The counter of established connections can be overridden using the UCX_RC_MAX_NUM_EPS environmental parameter.
Running UCX on a RoCE port, by:
Configuring the fabric as lossless (see RoCE Deployment Community post), and setting UCX_IB_TRAFFIC_CLASS=106.
OR
Setting the specific port using the UCX_NET_DEVICES environment variable. For example:
mpirun -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:
1
By default, RoCE v2 and IPv4 are used, if available. Otherwise, RoCE v1 with MAC address is used. In order to set a specific RoCE version to use, set UCX_IB_GID_INDEX to the index of the required RoCE version and address type, as reported by “show_gids” command. For example:
mpirun -x UCX_NET_DEVICES=mlx5_0:
1
-x UCX_TRAFFIC_CLASS=106
-x UCX_IB_GID_INDEX=3
Setting the threshold for using the Rendezvous protocol in UCX:
mpirun -mca pml ucx -x UCX_RNDV_THRESH=
16384
By default, UCX will calculate the optimal threshold on its own, but the value can be overwritten using the above environment parameter.
Setting the threshold for using the zero-copy in UCX:
mpirun -mca pml ucx -x UCX_ZCOPY_THRESH=
16384
By default, UCX will calculate the optimal threshold on its own, but the value can be overwritten using the above environment parameter.
Setting UCX_IB_ADDR_TYPE=ib_global when running on GID-based multi-host setup (see also Single Root IO Virtualization (SR-IOV) section below).
Enabling various optimizations intended for homogeneous environment. Enabling this mode implies that the local transport resources/devices of all entities that connect to each other are the same.
UCX_UNIFIED_MODE=y
Using -x UCX_IB_SUBNET_PREFIX to filter for the InfiniBand subnet prefix (empty means no filter). This is relevant for IB link layer only. For example, a filter for the default subnet prefix can be specified as follows: fe80:0:0:0.
Specifying how DC initiator (DCI) is selected by the endpoint with UCX_DC_MLX5_TX_POLICY=<policy> (relevant for DC transport only). The policy options are:
Policy
Description
dcs
The endpoint either uses already assigned DCI, or DCI is allocated in a LIFO order and gets released once it has no outstanding operations
dcs_quota
Same as "dcs". In addition, the DCI is scheduled for release in case it has sent more than one quota and there are endpoints waiting for a DCI. The DCI is released once it completes all outstanding operations. This policy ensures that there will be no starvation among endpoints
rand
Every endpoint is assigned with a randomly selected DCI. Multiple endpoints may share the same DCI
Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options:
-x UCX_MEMTYPE_CACHE=
0
-x HCOLL_GPU_CUDA_MEMTYPE_CACHE_ENABLE=0
-x HCOLL_GPU_ENABLE=1
Disabling GPU memory staging protocols and using only GPUDirectRDMA, if possible:
-x UCX_RNDV_SCHEME=get_zcopy
Running the application on close NUMA nodes:
mpirun -mca rmaps_dist_device <HCA name> -mca rmaps_base_mapping_policy dist:span
The shared memory new transport naming:
The available shared memory transport names are: posix, sysv and xpmem.
'sm' and 'mm' will include all the three mentioned above.
The 'device' name for the shared memory transport is 'memory' (for usage in UCX_SHM_DEVICES)To get more information in case of any error (for troubleshooting purposes), please set the following environment parameter:
mpirun -mca pml ucx -x UCX_LOG_LEVEL=diag ...
DC full handshake config can be set by the environment variables UCX_DC_MLX5_DCI_FULL_HANDSHAKE, UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE, UCX_DC_MLX5_DCT_FULL_HANDSHAKE. Possible values are: on / off / auto, with the default being “off”. In auto mode, FH will be used according to the AR config of the SL in use (if the SL is with AR – FH will be used, otherwise – HH).
Hardware Tag Matching
Starting ConnectX-5, Tag Matching previously done by the software, can now be offloaded in UCX to the HCA. For MPI applications, sending messages with numeric tags accelerates the processing of incoming messages, leading to better CPU utilization and lower latency for expected messages. In Tag Matching, the software holds a list of matching entries called matching list. Each matching entry contains a tag and a pointer to an application buffer. The matching list is used to steer arriving messages to a specific buffer according to the message tag. The action of traversing the matching list and finding the matching entry is called Tag Matching, and it is performed on the HCA instead of the CPU. This is useful for cases where incoming messages are consumed not in the order they arrive, but rather based on numeric identifier coordinated with the sender.
Hardware Tag Matching avails the CPU for other application needs. Currently, Hardware Tag Matching is supported for the accelerated RC and DC transports (RC_X and DC_X), and can be enabled in UCX with the following environment parameters:
For the RC_X transport:
UCX_RC_MLX5_TM_ENABLE=y
For the DC_X transport:
UCX_DC_MLX5_TM_ENABLE=y
By default, only messages larger than a certain threshold are offloaded to the transport. This threshold is managed by the “UCX_TM_THRESH” environment variable (its default value is 1024 bytes).
UCX may also use bounce buffers for hardware Tag Matching, offloading internal pre-registered buffers instead of user buffers up to a certain threshold. This threshold is controlled by the UCX_TM_MAX_BB_SIZE environment variable. The value of this variable has to be equal or less than the segment size, and it must be larger than the value of UCX_TM_THRESH to take effect (1024 bytes is the default value, meaning that optimization is disabled by default).
With hardware Tag Matching enabled, the Rendezvous threshold is limited by the segment size, which is controlled by UCX_RC_MLX5_TM_MAX_BCOPY or UCX_DC_MLX5_TM_MAX_BCOPY variables (for RC_X and DC_X transports, respectively). Thus, the real Rendezvous threshold is the minimum value between the segment size and the value of UCX_RNDV_THRESH environment variable.
Hardware Tag Matching for InfiniBand requires MLNX_OFED v4.1-x.x.x.x and above.
Hardware Tag Matching for RoCE is not supported.
For further information, refer to Understanding Tag Matching for Developers post.
Single Root IO Virtualization (SR-IOV)
SR-IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the device with separate resources. These virtual functions can then be provisioned separately. Each VF can be seen as an additional device connected to the Physical Function. It shares the same resources with the Physical Function, and its number of ports equals those of the Physical Function.
This feature is supported on ConnectX-5 HCAs and above only.To enable SR-IOV in UCX while it is configured in the fabric, use the following environment parameter:
UCX_IB_ADDR_TYPE=ib_global
Notes:
- This environment parameter should also be used when running UCX on a fabric with Socket Direct HCA installed. When working with Socket Direct HCAs, make sure Multi-Rail feature is enabled as well (refer to Multi-Rail.).
- SRI-OV is not supported with dc and dc_x transports in UCX.
Adaptive Routing
Adaptive Routing (AR) enables sending messages between two HCAs on different routes, based on the network load. While in static routing, a packet that arrives to the switch is forwarded based on its destination only, in Adaptive Routing, the packet is loaded to all possible ports that the packet can be forwarded to, resulting in the load being balanced between ports, and the fabric adapting to load changes over time. This feature requires support for out-of-order arrival of messages, which UCX has for the RC, rc_x and dc_x transports.
To be able to use Adaptive Routing on the fabric, make sure it is enabled in OpenSM and in the switches.
Enabling Adaptive Routing on a certain SL is done according to the following table.
UCX_IB_AR_ENABLE=yes |
UCX_IB_AR_ENABLE=no |
UCX_IB_AR_ENABLE=try |
UCX_IB_AR_ENABLE=auto |
||
UCX_IB_SL=auto |
AR enabled on some SLs |
Use 1st SL with AR |
Use 1st SL without AR |
Use 1st SL with AR |
Use SL=0 |
AR enabled on all SLs |
Use SL=0 |
Failure |
Use SL=0 |
Use SL=0 |
|
AR disabled on all SLs |
Failure |
Use SL=0 |
Use SL=0 |
Use SL=0 |
|
UCX_IB_SL=<sl> |
AR enabled on <sl> |
Use SL=<sl> |
Failure |
Use SL=<sl> |
Use SL=<sl> |
AR disabled on <sl> |
Failure |
Use SL=<sl> |
Use SL=<sl> |
Use SL=<sl> |
Adaptive routing is not supported for OpenSHMEM applications.
Error Handling
Error Handling enables UCX to handle errors that occur due to algorithms with fault recovery logic. To handle such errors, a new mode was added, guaranteeing an accurate status on every sent message. In addition, the process classifies errors by their origin (i.e. local or remote) and severity, thus allowing the user to decide how to proceed and what would that possibly recovery method be. To use Error Handling in UCX, the user must register with the UCP API (the ucp_ep_create API function needs to be addressed, for example)
CUDA GPU
Overview
CUDA environment support in HPC-X enables the use of NVIDIA’s GPU memory in UCX and HCOLL communication libraries for point-to-point and collective routines, respectively.
System Requirements
- CUDA v12.0 or higher - for information on how to install CUDA, refer to NVIDIA documents for CUDA Toolkit.
- For GPUDirect support, need either a recent kernel with dmabuf support or a GPUDirect RDMA driver.
To install GPUDirect RDMA driver with MLNX_OFED:
- MLNX_OFED - refer to MLNX_OFED webpage
GPUDirect RDMA - refer to MLNX_OFED GPUDirect RDMA webpageOnce the NVIDIA software components are installed, it is important to verify that the GPUDirect RDMA kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GPUDirect RDMA.
To check whether the GPUDirect RDMA module is loaded, run: service nv_peer_mem status To run this verification on other Linux flavors:
lsmod | grep nv_peer_mem
The optional gdrcopy driver optimizes the transfer latency of small messages residing in GPU memory. For information on how to install GDR COPY, refer to its GitHub webpage.
NoteNOTE: it is important to make sure that the gdrcopy driver is installed properly on each of the compute nodes taking part in the MPI job.
To check whether the GDR COPY module is loaded, run:
lsmod | grep gdrdrv
Multi-Rail
Multi-Rail enables users to use more than one of the active ports on the host, making better use of system resources, and allowing increased throughput. When using Socket Direct cards, the Multi-Rail capability becomes essential.
Each process would be able to use up to the first 4 active ports on the host in parallel (this 4 port limitation is for performance considerations), if the following parameters are set:
For setting the number of active ports to use for the Eager protocol, i.e. for small messages, please set the following parameter:
% mpirun -mca pml ucx -x UCX_MAX_EAGER_RAILS=4
...
For setting the number of active ports to use for the Rendezvous protocol, i.e. for large messages, please set the following parameter:
% mpirun -mca pml ucx -x UCX_MAX_RNDV_RAILS=4
...
Possible values for these parameters are 1, 2, 3 and 4. The default values are UCX_MAX_EAGER_LANES =1, and UCX_MAX_RNDV_LANES = 2.
The Multi-Rail feature will be disabled while the Hardware Tag Matching feature is enabled.
Starting from HPC-X v2.8, multi-rail is also supported out-of-box for the client-server API. To enable or disable it, use the following environment parameter:
UCX_CM_USE_ALL_DEVICES=y/n
Memory in Chip (MEMIC)
Memory in chip feature allows for using on-device memory for sending messages from the UCX layer. This feature is enabled by default on ConnectX-5 HCAs. It is supported only for the rc_x and dc_x transports in UCX.
The environment parameters that control this feature behavior are:
- UCX_RC_MLX5_DM_SIZE
- UCX_RC_MLX5_DM_COUNT
- UCX_DC_MLX5_DM_SIZE
- UCX_DC_MLX5_DM_COUNT
For more information on these parameters, please refer to the ucx_info utility: % $HPCX_UCX_DIR/bin/ucx_info -f.
PKey Support
UCX supports the usage of a non-default PKey. In order to specify which PKEY value to use, please set it with the following environment parameter: UCX_IB_PKEY.Valid values are between 0 - 0x7fff.In an environment where the default PKey is not found, the PKey in index 0 will be used.
Close Protocol
When using the UCX client-server API for connection establishment, it is also possible to have a graceful teardown, i.e a disconnection, between each pair of client and the server it's connected to, at the end of the communication. Either side can be the initiator of the disconnection.
RoCE LAG
UCX now supports RoCE LAG out-of-box.UCX is now able to detect a RoCE LAG device and automatically create two RDMA connections to utilize the full bandwidth of LAG interface.For Ethernet packets, the network switch path is usually determined by a hash function on the packet’s IP and UDP header fields. In order to force using distinct paths for various switch topologies, it is possible to set “UCX_ROCE_PATH_FACTOR=n” environment variable to influence UDP.source_port field: the first connection will use “UDP.source_port=0xC000”, while the second connection will use “UDP.source_port=0xC000+<n>”.The default value for UCX_ROCE_PATH_FACTOR is 1. This feature is currently supported for RC transport only.
Flow Control for RDMA Read Operations
This feature is intended to prevent network congestion when many processes send messages to the same destination. To reduce network pressure, the user may limit the number of simultaneously transferred data by setting UCX_RC_TX_NUM_GET_BYTES environment variable to a certain value (e.g. 10MB). In addition, to achieve better pipelining of network transfer and data processing, the user may limit the maximal message size which can be transferred using RDMA Read operation by setting UCX_RC_MAX_GET_ZCOPY environment variable to a certain value (e.g. 64KB).
PCIe Relaxed Ordering Support
UCX supports enabling Relaxed Ordering for PCIe Write transactions in order to improve performance on systems where the PCI bandwidth of relaxed-ordered Writes is higher than that of the default strict-ordered Writes.The environment variable UCX_IB_PCI_RELAXED_ORDERING can force a specific behavior: “on” enables relaxed ordering; “off” disables it; while “auto” (default) sets relaxed ordering mode based on the system type.
UCX Configuration File
The UCX configuration file enables the user to apply configuration variables set by the user in the $HPCX_UCX_DIR/etc/ucx/ucx.conf file. A configuration file can be created with initial default values by running "ucx_info -fC > $HPCX_UCX_DIR/etc/ucx/ucx.conf".The values are applied in the following order of precedence:1. If an environment variable is set explicitly, it overrides the file's configuration.2. Otherwise, value from $HPCX_UCX_DIR/etc/ucx/ucx.conf is used if it exists.3. Otherwise, default (compile-time) value is used.
The configuration file applies settings only to the host where it is located.
Instrumentation and Monitoring FUSE-based Tool
This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in /tmp/ucx. The directory name is the PID of the target process. The process directory contains three sub-directories: UCP, UCT, UCS.
UCX inside HPC-X is built with "--with-fuse3" option by default.
While building, UCX checks for fuse3 library presence and enables building the tool. Once UCX is built, the ucx_vfs binary will be created in the install directory and will be used to launch a daemon process and enable UCX-based applications analysis.
You can use the UCX_VFS_ENABLE environment variable to control the feature. It is set to ‘y’ by default. Setting the variable to ‘n’ disables creating the service thread in user’s UCX application.
Requirements
For the feature to function properly, the following is required:
- fuse3 utilities to run the daemon and analyze applications
- fuse3 library to build the tool
Limitations
- ucx_vfs daemon must be started before the target processes. Otherwise, if the number of processes exceeds the limit, fs.inotify.max_user_instances are increased.
- If the user starts simultaneously more than the maximum allowed number of processes and then starts the daemon, only the first processes that meet the limit will be monitored by the tool.
On-demand Paging (ODP)
On-demand paging mitigates the limitations of memory registration by allowing applications to avoid pinning down physical pages of the address space and tracking mapping validity.
Instead, the Host Channel Adapter (HCA) requests updated translations from the operating system when pages are absent, and the OS invalidates translations affected by non-present pages or mapping alterations.
With ODP, system memory is not locked and therefore is allowed to be swapped out if necessary.
The feature is controlled by UCX_REG_NONBLOCK_MEM_TYPES environment variable. UCX supports both ODP versions - ODPv1 and ODPv2. Which ODP version can be used depends on FW version and configuration.
To enable ODPv2 in UCX it is enough to specify UCX_REG_NONBLOCK_MEM_TYPES=host.
To enable ODPv1, in addition to setting UCX_REG_NONBLOCK_MEM_TYPES=host , devx objects creation must be disabled using UCX_IB_MLX5_DEVX_OBJECTS="".
This feature is enabled by default on Grace platforms. On all other platforms it is disabled by default and can be activated using environment variables described above.
ucx_perftest
A client-server based application which is designed to test UCX's performance and sanity checks.
To run it, two terminals are required to be opened, one on the server side and one on the client side.
The working flow is as follow:
The server listens to the request coming from the client.
Once a connection is established, UCX sends and receives messages between the two sides according to what the client requested.
The results of the communications are displayed.
For further information, run: $HPCX_HOME/ucx/bin/ucx_perftest -help .
Examples:
From the server side, run:
$HPCX_HOME/ucx/bin/ucx_perftestFrom the client side, run:
$HPCX_HOME/ucx/bin/ucx_perftest <server_host_name> -t ucp_am_bw
Among other parameters, you can specify the test you would like to run, the message size and the number of iterations.
In order to generate statistics, the statistics destination and trigger should be set, and they can optionally be filtered and/or formatted.
The destination is set by UCX_STATS_DEST environment variable whose values can be one of the following:
Value
Description
empty string
Statistics are not reported
stdout
Print to standard output
stderr
Print to standard error
file:<filename>
Save to a file. Following substitutions are made: %h: host, %p:pid, %c:cpu, %t: time, %e:exe
udp:<host>[:<port>]
Send over UDP to the given host:port
Example:
$ export UCX_STATS_DEST=
"file:ucx_%h_%e_%p.stats"
$ export UCX_STATS_DEST="stdout"
Trigger is set by UCX_STATS_TRIGGER environment variables. It can be one of the following:
Environment Variable
Description
exit
Dump statistics just before exiting the program
timer:<interval>
Dump statistics periodically, interval is given in seconds
signal:<signo>
Dump when processes signaled
Example:
$ export UCX_STATS_TRIGGER=exit $ export UCX_STATS_TRIGGER=timer:
3.5
It is possible to filter the counters in the report using the UCX_STATS_FILTER environment parameter. It accepts a comma-separated list of glob patterns specifying counters to display. Statistics summary will contain only the matching counters. The order is not meaningful. Each expression in the list may contain any of the following options:
Environment Variable
Description
*
Matches any number of any characters including none (prints a full report)
?
Matches any single character
[abc]
Matches one character given in the bracket
[a-z]
Matches one character from the range given in the bracket
More information about this parameter can be found at: https://github.com/openucx/ucx/wiki/StatisticsIt is possible to filter the counters in the report using the UCX_STATS_FILTER environment parameter. It accepts a comma-separated list of glob patterns specifying counters to display. Statistics summary will contain only the matching counters. The order is not meaningful. Each expression in the list may contain any of the following options:
It is possible to control the formatting of the statistics using the UCX_STATS_FORMAT parameter:
Environment Variable
Description
full
Each counter will be displayed in a separate line
agg
Each counter will be displayed in a separate line. However, there will also be an aggregation between similar counters
summary
All counters will be printed in the same line
NoteThe statistics feature is only enabled when UCX is compiled with the enable-stats flag. This flag is set to 'No' by default. Therefore, in order to use the statistics feature, please recompile UCX using the contrib/configure-prof file, or use the 'debug' version of UCX, which can be found in $HPCX_UCX_DIR/debug:
$ mpirun -mca pml ucx -x LD_PRELOAD=$HPCX_UCX_DIR/debug/lib/libucp.so ...
Please note that recompiling UCX using the aforementioned methods may impact the performance.
When there are several devices or interfaces in the server, please use the UCX_NET_DEVICES flag to specify which RDMA device you would like to use.