The following is a list of general limitations and known issues of the various components of this HPC-X release.
Reference Number | Issue |
3819771 | Description: In certain scenarios, RDMA operations involving CUDA memory may encounter failure, resulting in the following error: UCX ERROR ibv_reg_dmabuf_mr(address=0xfff939e00000, length=16, access=0xf) failed: Invalid argument. |
Workaround: Disable gdr_copy transport by setting environment variable UCX_TLS=^gdr_copy. An alternative is to disable DMA buffer registrations by setting the environment variable UCX_CUDA_COPY_DMABUF=no. | |
Keywords: DMA buffer, memory registration, ibv_reg_dmabuf_mr | |
Discovered in Version: 2.19.0 | |
3884209 | Description: In certain scenarios, a significant performance degradation can be observed due to excessive memory registrations. |
Workaround: Switch back to legacy protocols implementation by setting UCX_PROTO_ENABLE=n | |
Keywords: UCC, Performance | |
Discovered in Version: 2.19.0 | |
3606732 | Description: In some cases, when using Cuda buffers for intra-node transfers, the program may crash with an assertion `offset <= key->b_len’ failed in cuda_ipc. This happens due to a conflict between cuda_ipc and gdrcopy memory registration on the same buffer. In other cases, the error message “gdr_map failed” can be printed. |
Workaround: N/A | |
Keywords: gdr_copy, cuda_ipc | |
Discovered in Version: 2.17.0 | |
3586369 | Description: When UD transport is being used explicitly, the MPI or SHMEM job may hang during cleanup or MPI_Finalize, while waiting for UCX endpoint flush operation to complete. |
Workaround: Disable adaptive progress optimization by setting the environment variable UCX_ADAPTIVE_PROGRESS=n, or don’t select UD transport explicitly. | |
Keywords: Hang, UD, Flush | |
Discovered in Version: 2.17.0 | |
3653404 | Description: When registering a large memory region with ucp_mem_map(), and peer failure handling support is enabled on the UCX endpoint, the process may crash with the error “LRU push returned Unsupported operation” while sending a buffer belonging to that region. The issue happens because multi-threaded registration is being used for large regions, and it does not work well with peer failure support. |
Workaround: Disable multi-thread registration by setting the environment variable “UCX_REG_MT_THRESH=inf”. | |
Keywords: Multi-Threaded, Indirect, Key Registration | |
Discovered in Version: 2.17.0 | |
3606445 | Description: The performance of osu_mbw_mr for some message sizes can be worse than the previous release. This can happen because of different default protocol thresholds. |
Workaround: Revert to previous thresholds selection logic by setting the environment variable to UCX_PROTO_ENABLE=n | |
Keywords: Performance, osu_mbw_mr | |
Discovered in Version: 2.17.0 | |
- | Description: In order to get the best performance when running on ConnectX-7 NDR400 fabric, the following parameter should be set with mpirun. mpirun -x UCX_MAX_RNDV_LANES=4 -x UCX_RNDV_THRESH=20k … |
Workaround: N/A | |
Keywords: ConnectX-7; UCX; mpirun | |
Discovered in Version: 2.11 (UCX 1.13) | |
- | Description: Once the TCP detects a “Connection reset by a peer” failure on a connection, it stops sending data, and the MPI/SHMEM application hangs. Error printouts from the UCP/UCT can be seen in the log. |
Workaround: On small scale cases, change the “UCX_TLS=tcp” to “UCX_TLS=sm,tcp” parameter. On larger scales this workaround is not applicable. | |
Keywords: UCX hang | |
Discovered in Version: 2.9 (UCX 1.11) | |
- | Description: NCCL plugin works only with NCCL v2.8 or higher. |
Workaround: Build plugin version v2.0 from the following source. https://github.com/Mellanox/nccl-rdma-sharp-plugins/tree/v2.0.x | |
Keywords: NCCL Plugin | |
Discovered in Version: 2.7 (NCCL 2.1) | |
- | Description: UD timeout error may appear. |
Workaround: Disable the UD transport and use DC instead. Set UCX_TLS=dc_x,self,sm | |
Keywords: UD, DC, timeout, UCX | |
Discovered in Version: 2.7 (UCX 1.9) | |
- | Description: When using GPU memory on an InfiniBand network with GPUDirect enabled yet without gdrcopy library, performance of small messages can be low. |
Workaround: Use the Rendezvous protocol by setting the UCX_RNDV_THRESH parameter to 0. | |
Keywords: GPU, GPUDirect, memory | |
Discovered in Version: 2.6 (UCX 1.8) | |
4105 | Description: Adaptive Routing is not supported when used with OpenSHMEM applications. (Github issue: https://github.com/openucx/ucx/issues/4105) |
Workaround: N/A | |
Keywords: Adaptive Routing, AR, OpenSHMEM, OSHMEM | |
Discovered in Version: 2.5 (OpenSHMEM 1.4) | |
- | Description: In ConnectX-4 and Connect-IB HCAs, when the DC transport is used on a large scale, “Retry exceeded” messages may be printed from UCX. |
Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'. | |
Keywords: UCX, DC transport, ConnectX-4, Connect-IB | |
Discovered in Version: 2.1 (UCX 1.3) | |
- | Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: “... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by ‘ipcs -l”. |
Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file. | |
Keywords: UCX, memory | |
Discovered in Version: 2.1 (UCX 1.3) | |
1162 | Description: UCX currently does not support canceling send requests. (Github issue: https://github.com/openucx/ucx/issues/1162) |
Workaround: N/A | |
Keywords: UCX | |
Discovered in Version: 2.0 | |
- | Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV. |
Workaround: Set UCX_IB_ADDR_TYPE=ib_global | |
Keywords: UCX | |
- | Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no |
Workaround: Recompile UCX with AVX disabled: $ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "--with-avx=no” | |
Keywords: UCX |