Known Issues
The following is a list of general limitations and known issues of the various components of this HPC-X release.
| Reference Number | Issue | 
| 3995982 | Description: GPU device variables (obtained from  | 
| Workaround: Copy the contents of the device buffer to a bounce buffer allocated by  | |
| Keywords:  | |
| Discovered in Version: 2.21.0 | |
| 4050321 | Description: Significant bandwidth degradation occurs when the Global VA feature is enabled (by setting  | 
| Workaround: Avoid setting UCX_GVA_ENABLE=y to prevent potential bandwidth degradation. | |
| Keywords: Global VA; GVA; ODP | |
| Discovered in Version: 2.21.0 | |
| 4097336 | Description: Enabling HW DCS (by setting  | 
| Workaround: Avoid setting  | |
| Keywords: DC; DCS; hang | |
| Discovered in Version: 2.21.0 | |
| 4139280 | Description: Asynchronously allocated CUDA memory may not work correctly with the gdr_copy transport, potentially resulting in an error such as: 
 | 
| Workaround: Set the  | |
| Keywords: gdr_copy; memory registration; Stream Ordered CUDA Allocator | |
| Discovered in Version: 2.21.0 | |
| 4026461 | Description: UCX atomic operations on Grace CPU may fail with Remote Access error. | 
| Workaround: Disable DevX and KSM memory registration by setting  | |
| Keywords: Atomic; Grace | |
| Discovered in Version: 2.20.0 | |
| 3884209 | Description: In certain scenarios, a significant performance degradation can be observed due to excessive memory registrations. | 
| Workaround: Switch back to legacy protocols implementation by setting  | |
| Keywords: UCC, Performance | |
| Discovered in Version: 2.19.0 | |
| 3606732 | Description: In some cases, when using CUDA buffers for intra-node transfers, the program may crash with an assertion ` In other cases, the error message " | 
| Workaround: N/A | |
| Keywords:  | |
| Discovered in Version: 2.17.0 | |
| 3586369 | Description: When UD transport is being used explicitly, the MPI or SHMEM job may hang during cleanup or  | 
| Workaround: Disable adaptive progress optimization by setting the environment variable  | |
| Keywords: Hang, UD, Flush | |
| Discovered in Version: 2.17.0 | |
| 3653404 | Description: When registering a large memory region  | 
| Workaround: Disable multi-thread registration by setting the environment variable " | |
| Keywords: Multi-Threaded, Indirect, Key Registration | |
| Discovered in Version: 2.17.0 | |
| 3606445 | Description: The performance of  | 
| Workaround: Revert to previous thresholds selection logic by setting the environment variable to  | |
| Keywords: Performance,  | |
| Discovered in Version: 2.17.0 | |
| - | Description: In order to get the best performance when running on ConnectX-7 NDR400 fabric, the following parameter should be set with mpirun. 
 | 
| Workaround: N/A | |
| Keywords: ConnectX-7; UCX; mpirun | |
| Discovered in Version: 2.11 (UCX 1.13) | |
| - | Description: Once the TCP detects a  Error printouts from the UCP/UCT can be seen in the log. | 
| Workaround: On small scale cases, change the  | |
| Keywords: UCX hang | |
| Discovered in Version: 2.9 (UCX 1.11) | |
| - | Description: NCCL plugin works only with NCCL v2.8 or higher. | 
| Workaround: Build plugin version v2.0 from the following source. https://github.com/Mellanox/nccl-rdma-sharp-plugins/tree/v2.0.x | |
| Keywords: NCCL Plugin | |
| Discovered in Version: 2.7 (NCCL 2.1) | |
| - | Description: UD timeout error may appear. | 
| Workaround: 
 | |
| Keywords: UD, DC, timeout, UCX | |
| Discovered in Version: 2.7 (UCX 1.9) | |
| - | Description: When using GPU memory on an InfiniBand network with GPUDirect enabled yet without gdrcopy library, performance of small messages can be low. | 
| Workaround: Use the Rendezvous protocol by setting the UCX_RNDV_THRESH parameter to 0. | |
| Keywords: GPU, GPUDirect, memory | |
| Discovered in Version: 2.6 (UCX 1.8) | |
| 3672903/Github 4105 | Description: Adaptive Routing is not supported when used with OpenSHMEM applications. (Github issue: https://github.com/openucx/ucx/issues/4105) | 
| Workaround: Enable strong synchronization by adding  | |
| Keywords: Adaptive Routing, AR, OpenSHMEM, OSHMEM | |
| Discovered in Version: 2.5 (OpenSHMEM 1.4) | |
| - | Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: “... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l”. | 
| Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file. | |
| Keywords: UCX, memory | |
| Discovered in Version: 2.1 (UCX 1.3) | |
| 1162 | Description: UCX currently does not support canceling send requests. (Github issue: https://github.com/openucx/ucx/issues/1162) | 
| Workaround: N/A | |
| Keywords: UCX | |
| Discovered in Version: 2.0 | |
| - | Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV. | 
| Workaround: Set UCX_IB_ADDR_TYPE=ib_global | |
| Keywords: UCX | |
| - | Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no | 
| Workaround: Recompile UCX with AVX disabled: $ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "--with-avx=no" | |
| Keywords: UCX |