image image image image image

On This Page

The following is a list of general limitations and known issues of the various components of this HPC-X release.

Reference NumberIssue

Description: In order to get the best performance when running on ConnectX-7 NDR400 fabric, the following parameter should be set with mpirun.


Workaround: N/A
Keywords: ConnectX-7; UCX; mpirun
Discovered in Version: 2.11 (UCX 1.13)


Description: UCX job may hang when the DC transport is used.


  • Exclude RoCE LAG devices from the list of available devices (managed by UCX_NET_DEVICES environment variable) and make sure UCX_IB_NUM_PATH is set to 1.
  • Exclude DC from the list of available transports managed by the UCX_TLS environment variable (e.g. set UCX_TLS=sm,self,rc,tcp).
Keywords: UCX
Discovered in Version: 2.9 (UCX 1.11)


Description: Once the TCP detects a “Connection reset by a peer” failure on a connection, it stops sending data, and the MPI/SHMEM application hangs.

Error printouts from the UCP/UCT can be seen in the log.

Workaround: On small scale cases, change the "UCX_TLS=tcp" to "UCX_TLS=sm,tcp" parameter. On larger scales this workaround is not applicable.

Keywords: UCX hang
Discovered in Version: 2.9 (UCX 1.11)


Description: NCCL plugin works only with NCCL v2.8 or higher.

Workaround: Build plugin version v2.0 from the following source.

Keywords: NCCL Plugin
Discovered in Version: 2.7 (NCCL 2.1)
-Description: UD timeout error may appear.
Workaround: Disable the UD transport and use DC instead. Set UCX_TLS=dc_x,self,sm
Keywords: UD, DC, timeout, UCX
Discovered in Version: 2.7 (UCX 1.9)

Description: On some platforms, GPUDirect RDMA does not work reliably when the path between HCA and GPU traverses QPI link.

Workaround: Disable GPUDirect support in UCX by setting UCX_IB_GPU_DIRECT_RDMA=n.

Keywords: GPUDirect. RDMA, UCX
Discovered in Version: 2.7 (UCX 1.9)

Description: UCX may fail to compile with Clang compiler version 9 if --dynamic-list-data flag is used in the compilation.

(Github issue:

Workaround: [optional] Compile UCX without using this flag. However, note that ucx_perftest will not be available for usage.
Keywords: Clang compiler, UCX
Discovered in Version: 2.6 (UCX 1.8)
-Description: When using GPU memory on an InfiniBand network with GPUDirect enabled yet without gdrcopy library, performance of small messages can be low.
Workaround: Use the Rendezvous protocol by setting the UCX_RNDV_THRESH parameter to 0.
Keywords: GPU, GPUDirect, memory
Discovered in Version: 2.6 (UCX 1.8)

Description: Adaptive Routing is not supported when used with OpenSHMEM applications.

(Github issue:

Workaround: N/A
Keywords: Adaptive Routing, AR, OpenSHMEM, OSHMEM
Discovered in Version: 2.5 (OpenSHMEM 1.4)


Description: In ConnectX-4 and Connect-IB HCAs, when the DC transport is used on a large scale, “Retry exceeded” messages may be printed from UCX.

Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'.

Keywords: UCX, DC transport, ConnectX-4, Connect-IB

Discovered in Version: 2.1 (UCX 1.3)


Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX:

“... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l”.

Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file.

Keywords: UCX, memory

Discovered in Version: 2.1 (UCX 1.3)


Description: UCX currently does not support canceling send requests.

(Github issue:

Workaround: N/A

Keywords: UCX

Discovered in Version: 2.0


Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV.

Workaround: Set UCX_IB_ADDR_TYPE=ib_global

Keywords: UCX


Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support.

In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no

Workaround: Recompile UCX with AVX disabled:

$ ./utils/ --rebuild-ucx --ucx-extra-config "--with-avx=no"

Keywords: UCX