NVIDIA Docs Hub Homepage NVIDIA Networking Accelerator Software NVIDIA HPC-X Software Toolkit Rev 2.17.0 Known Issues

Known Issues

The following is a list of general limitations and known issues of the various components of this HPC-X release.

Reference Number	Issue
3633383	Description: When allocating device memory (MEMIC) on some FW versions, by passing `memheap_base_device_nic_mem_seg_size` parameter to SHMEM runner, the process may crash with the error message "failed to allocate 4096 bytes on using md ib". In such cases, avoid using MEMIC.
	Workaround: N/A
	Keywords: MEMIC, SHMEM, Allocation
	Discovered in Version: 2.17.0
3606732	Description: In some cases, when using Cuda buffers for intra-node transfers, the program may crash with an assertion ``offset <= key->b_len`' failed in `cuda_ipc`. This happens due to a conflict between `cuda_ipc` and `gdrcopy` memory registration on the same buffer. In other cases, the error message "`gdr_map failed`" can be printed.
	Workaround: N/A
	Keywords: `gdr_copy, cuda_ipc`
	Discovered in Version: 2.17.0
3586369	Description: When UD transport is being used explicitly, the MPI or SHMEM job may hang during cleanup or `MPI_Finalize`, while waiting for UCX endpoint flush operation to complete.
	Workaround: Disable adaptive progress optimization by setting the environment variable `UCX_ADAPTIVE_PROGRESS=n`, or don't select UD transport explicitly.
	Keywords: Hang, UD, Flush
	Discovered in Version: 2.17.0
3653404	Description: When registering a large memory region `with ucp_mem_map()`, and peer failure handling support is enabled on the UCX endpoint, the process may crash with the error "LRU push returned Unsupported operation" while sending a buffer belonging to that region. The issue happens because multi-threaded registration is being used for large regions, and it does not work well with peer failure support.
	Workaround: Disable multi-thread registration by setting the environment variable "`UCX_REG_MT_THRESH=inf`".
	Keywords: Multi-Threaded, Indirect, Key Registration
	Discovered in Version: 2.17.0
3606445	Description: The performance of `osu_mbw_mr` for some message sizes can be worse than the previous release. This can happen because of different default protocol thresholds.
	Workaround: Revert to previous thresholds selection logic by setting the environment variable to `UCX_PROTO_ENABLE=n`
	Keywords: Performance, `osu_mbw_mr`
	Discovered in Version: 2.17.0
-	Description: In order to get the best performance when running on ConnectX-7 NDR400 fabric, the following parameter should be set with mpirun. `mpirun -x UCX_MAX_RNDV_LANES=4 -x UCX_RNDV_THRESH=20k …`
	Workaround: N/A
	Keywords: ConnectX-7; UCX; mpirun
	Discovered in Version: 2.11 (UCX 1.13)
2705762	Description: UCX job may hang when the DC transport is used.
	Workaround: Exclude RoCE LAG devices from the list of available devices (managed by UCX_NET_DEVICES environment variable) and make sure UCX_IB_NUM_PATH is set to 1. Exclude DC from the list of available transports managed by the UCX_TLS environment variable (e.g. set UCX_TLS=sm,self,rc,tcp).
	Keywords: UCX
	Discovered in Version: 2.9 (UCX 1.11)
-	Description: Once the TCP detects a `“Connection reset by a peer”` failure on a connection, it stops sending data, and the MPI/SHMEM application hangs. Error printouts from the UCP/UCT can be seen in the log.
	Workaround: On small scale cases, change the `"UCX_TLS=tcp" to "UCX_TLS=sm,tcp"` parameter. On larger scales this workaround is not applicable.
	Keywords: UCX hang
	Discovered in Version: 2.9 (UCX 1.11)
-	Description: NCCL plugin works only with NCCL v2.8 or higher.
	Workaround: Build plugin version v2.0 from the following source. https://github.com/Mellanox/nccl-rdma-sharp-plugins/tree/v2.0.x
	Keywords: NCCL Plugin
	Discovered in Version: 2.7 (NCCL 2.1)
-	Description: UD timeout error may appear.
	Workaround: Disable the UD transport and use DC instead. Set `UCX_TLS=dc_x,self,sm`
	Keywords: UD, DC, timeout, UCX
	Discovered in Version: 2.7 (UCX 1.9)
2235234	Description: On some platforms, GPUDirect RDMA does not work reliably when the path between HCA and GPU traverses QPI link.
	Workaround: Disable GPUDirect support in UCX by setting UCX_IB_GPU_DIRECT_RDMA=n.
	Keywords: GPUDirect. RDMA, UCX
	Discovered in Version: 2.7 (UCX 1.9)
4549	Description: UCX may fail to compile with Clang compiler version 9 if `--dynamic-list-data` flag is used in the compilation. (Github issue: https://github.com/openucx/ucx/issues/4549)
	Workaround: [optional] Compile UCX without using this flag. However, note that ucx_perftest will not be available for usage.
	Keywords: Clang compiler, UCX
	Discovered in Version: 2.6 (UCX 1.8)
-	Description: When using GPU memory on an InfiniBand network with GPUDirect enabled yet without gdrcopy library, performance of small messages can be low.
	Workaround: Use the Rendezvous protocol by setting the UCX_RNDV_THRESH parameter to 0.
	Keywords: GPU, GPUDirect, memory
	Discovered in Version: 2.6 (UCX 1.8)
4105	Description: Adaptive Routing is not supported when used with OpenSHMEM applications. (Github issue: https://github.com/openucx/ucx/issues/4105)
	Workaround: N/A
	Keywords: Adaptive Routing, AR, OpenSHMEM, OSHMEM
	Discovered in Version: 2.5 (OpenSHMEM 1.4)
-	Description: In ConnectX-4 and Connect-IB HCAs, when the DC transport is used on a large scale, “Retry exceeded” messages may be printed from UCX.
	Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'.
	Keywords: UCX, DC transport, ConnectX-4, Connect-IB
	Discovered in Version: 2.1 (UCX 1.3)
-	Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: “... total number of segments in the system (%lu) would exceed the limit in /proc/sys/kernel/shmmni (=%lu)... please check shared memory limits by 'ipcs -l”.
	Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file.
	Keywords: UCX, memory
	Discovered in Version: 2.1 (UCX 1.3)
1162	Description: UCX currently does not support canceling send requests. (Github issue: https://github.com/openucx/ucx/issues/1162)
	Workaround: N/A
	Keywords: UCX
	Discovered in Version: 2.0
-	Description: UCX job hangs with SocketDirect/MultiHost/SR-IOV.
	Workaround: Set UCX_IB_ADDR_TYPE=ib_global
	Keywords: UCX
-	Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no
	Workaround: Recompile UCX with AVX disabled: $ ./utils/hpcx_rebuild.sh --rebuild-ucx --ucx-extra-config "--with-avx=no"
	Keywords: UCX