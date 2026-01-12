Known Issues
The following is a list of general limitations and known issues of the various components of this HPC-X release.
Reference Number
Issue
4693674
Description: Performance degradation may occur when running
Workaround: Set the environment variable
Keywords: Performance; MPI: collectives; GPU
Discovered in Version: 2.25
4662365
Description: In certain cases, performance degradation may occur with
Workaround: Set the environment variable
Keywords: Performance; MPI; Collectives
Discovered in Version: 2.25
4546016
Description: HCOLL is not supported on GB200 and GB300 systems, with no plans for future support
Workaround: N/A
Keywords: HCOLL; GB200; GB300
Discovered in Version: 2.24
4422894
Description: When sending relatively large amounts of data, the following error may occur:
rc_verbs_iface.c:128 send completion with error: remote access error
Workaround: To avoid this issue, exclude the
Keywords: UCX; remote access error
Discovered in Version: 2.23
4177839
Description: When using a bidirectional traffic pattern in setups with 4 NICs, each connected to a separate NUMA node, a 10% degradation in bandwidth performance is observed compared to the full wire speed (FWS).
Workaround: Use a single lane for the RNDV operation instead of the default two lanes by setting the following environment variable:
Keywords: NUMA; BW; FWS
Discovered in Version: 2.22.x
3995982
Description: GPU device variables (obtained from
Workaround: Copy the contents of the device buffer to a bounce buffer allocated by
Keywords:
Discovered in Version: 2.21.0
4050321
Description: Significant bandwidth degradation occurs when the Global VA feature is enabled (by setting
Workaround: Avoid setting UCX_GVA_ENABLE=y to prevent potential bandwidth degradation.
Keywords: Global VA; GVA; ODP
Discovered in Version: 2.21.0
4097336
Description: Enabling HW DCS (by setting
Workaround: Avoid setting
Keywords: DC; DCS; hang
Discovered in Version: 2.21.0
4139280
Description: Asynchronously allocated CUDA memory may not work correctly with the gdr_copy transport, potentially resulting in an error such as:
Workaround: Set the
Keywords: gdr_copy; memory registration; Stream Ordered CUDA Allocator
Discovered in Version: 2.21.0
4026461
Description: UCX atomic operations on Grace CPU may fail with Remote Access error.
Workaround: Disable DevX and KSM memory registration by setting
Keywords: Atomic; Grace
Discovered in Version: 2.20.0
3884209
Description: In certain scenarios, a significant performance degradation can be observed due to excessive memory registrations.
Workaround: Switch back to legacy protocols implementation by setting
Keywords: UCC, Performance
Discovered in Version: 2.19.0
3606732
Description: In some cases, when using CUDA buffers for intra-node transfers, the program may crash with an assertion `
In other cases, the error message "
Workaround: N/A
Keywords:
Discovered in Version: 2.17.0
3586369
Description: When UD transport is being used explicitly, the MPI or SHMEM job may hang during cleanup or
Workaround: Disable adaptive progress optimization by setting the environment variable
Keywords: Hang, UD, Flush
Discovered in Version: 2.17.0
3653404
Description: When registering a large memory region
Workaround: Disable multi-thread registration by setting the environment variable "
Keywords: Multi-Threaded, Indirect, Key Registration
Discovered in Version: 2.17.0
3606445
Description: The performance of
Workaround: Revert to previous thresholds selection logic by setting the environment variable to
Keywords: Performance,
Discovered in Version: 2.17.0
-
Description: In order to get the best performance when running on ConnectX-7 NDR400 fabric, the following parameter should be set with mpirun.
Workaround: N/A
Keywords: ConnectX-7; UCX; mpirun
Discovered in Version: 2.11 (UCX 1.13)
-
Description: Once the TCP detects a
Error printouts from the UCP/UCT can be seen in the log.
Workaround: On small scale cases, change the
Keywords: UCX hang
Discovered in Version: 2.9 (UCX 1.11)
-
Description: NCCL plugin works only with NCCL v2.8 or higher.
Workaround: Build plugin version v2.0 from the following source.
https://github.com/Mellanox/nccl-rdma-sharp-plugins/tree/v2.0.x
Keywords: NCCL Plugin
Discovered in Version: 2.7 (NCCL 2.1)
-
Description: UD timeout error may appear.
Workaround:
Keywords: UD, DC, timeout, UCX
Discovered in Version: 2.7 (UCX 1.9)
-
Description: When using GPU memory on an InfiniBand network with GPUDirect enabled yet without gdrcopy library, performance of small messages can be low.
Workaround: Use the Rendezvous protocol by setting the UCX_RNDV_THRESH parameter to 0.
Keywords: GPU, GPUDirect, memory
Discovered in Version: 2.6 (UCX 1.8)
3672903/Github 4105
Description: Adaptive Routing is not supported when used with OpenSHMEM applications.
(Github issue: https://github.com/openucx/ucx/issues/4105)
Workaround: Enable strong synchronization by adding
Keywords: Adaptive Routing, AR, OpenSHMEM, OSHMEM
Discovered in Version: 2.5 (OpenSHMEM 1.4)
-
Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX:
“... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l”.
Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file.
Keywords: UCX, memory
Discovered in Version: 2.1 (UCX 1.3)
1162
Description: UCX currently does not support canceling send requests.
(Github issue: https://github.com/openucx/ucx/issues/1162)
Workaround: N/A
Keywords: UCX
Discovered in Version: 2.0
-
Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV.
Workaround: Set UCX_IB_ADDR_TYPE=ib_global
Keywords: UCX
-
Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support.
In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no
Workaround: Recompile UCX with AVX disabled:
$ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "--with-avx=no"
Keywords: UCX