NVIDIA HPC-X Software Toolkit Rev 2.24

Bug Fixes History

Internal Reference Number

Issue

4173667

Description: Fixed discrepancy in UCX perftest results between client and server.

Keywords: UCX; perftest

Discovered in Release: 2.21

Fixed in Release: 2.22.1

4231305/4238964

Description: Fixed a statistical issue causing a hang or time-out in ConnectX-8 when using DDP (Dynamic Device Pooling), which is enabled by default.

Keywords: DDP; ConnectX-8; time-out

Discovered in Release: 2.22.0

Fixed in Release: 2.22.1

4088373

Description: Fixed an issue where the application could fail with the error: symbol lookup error: libuct_cuda_gdrcopy.so.0: undefined symbol: gdr_get_info_v2, caused by a version mismatch in the gdrcopylibrary.

Keywords: gdrcopy

Discovered in Release: 2.20.0

Fixed in Release: 2.21.0

4025026

Description: Fixed a FETCH_ADD remote access error for ODP regions.

Keywords: Atomic operations; ODP; UCX

Discovered in Release: 2.19.0 (UCX 1.17)

Fixed in Release: 2.21.0

3955117

Description: Fixed an issue where a segmentation fault could take place in applications using cuda_ipc transport across multiple UCX contexts, due to incorrect handling of the connectivity map data structure.

Keywords: cuda_ipc; segfault

Discovered in Release: 2.19.0

Fixed in Release: 2.21.0

3763160

Description: Fixed an issue where MPI_Init experienced significant delays when there were many files in the /tmp directory. This occurred due to the use of the inotify mechanism for synchronization with a statistics monitoring tool.

Keywords: VFS; MPI_Init

Discovered in Release: 2.17.0

Fixed in Release: 2.21.0

3664432

Description: Fixed an issue where a multi-threaded MPI application using its own lock to synchronize MPI calls could experience crashes or data corruption, even when calling MPI_Init_thread with MPI_THREAD_SERIALIZED mode. The problem was caused by incorrect synchronization of the BlueFlame register.

Keywords: Data corruption; segfault; crash; multi-thread; MPI_THREAD_SERIALIZED

Discovered in Release: Open MPI 4.1

Fixed in Release: 2.21.0

3819771

Description: Fixed the issue where in certain scenarios, RDMA operations involving CUDA memory could encounter a failure, resulting in the following error: UCX ERROR ibv_reg_dmabuf_mr(address=0xfff939e00000, length=16, access=0xf) failed: Invalid argument.

Keywords: DMA buffer, memory registration, ibv_reg_dmabuf_mr

Discovered in Release: 2.19.0

Fixed in Release: 2.21.0

3653404

Description: When registering a large memory region with ucp_mem_map(), and peer failure handling support is enabled on the UCX endpoint, the process may crash with the error "LRU push returned Unsupported operation" while sending a buffer belonging to that region. The issue happens because multi-threaded registration is being used for large regions, and it does not work well with peer failure support.

Keywords: Multi-Threaded, Indirect, Key Registration

Discovered in Release: 2.17.0

Fixed in Release: 2.18.0

3837556

Description: Fixed UCX to not create an SRQ on RDMA network devices that do not support it. Before this fix, the application could fail with the error message "ibv_create_srq() failed: Operation not supported".

Keywords: SRQ, UCX

Discovered in Release: 2.17.0

Fixed in Release: 2.18.0

3774158

Description: Fixed a failure with the message "Local length error". The issue is caused by some compilers replacing direct assignments with memmove() function, leading to corruption while writing to IO memory.

Keywords: UCX, Local length error

Discovered in Release: 2.17.0

Fixed in Release: 2.18.0

3774153

Description: Fixed the issue where in some cases, there could be a race condition between RDMA_WRITE and shared memory write, leading to MPI receiving invalid data with large messages or collective operations between ranks on the same node.

Keywords: RDMA_WRITE

Discovered in Release: 2.17.0

Fixed in Release: 2.18.0

3762227

Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration.

Keywords: UCX, assertion

Discovered in Release: 2.17.1

Fixed in Release: 2.18.0

3748762

Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration.

Keywords: UCX, assertion

Discovered in Release: 2.17.1

Fixed in Release: 2.18.0

3712109

Description: Fixed UCC error in PyTorch 23.12 from HPC-X 2.17.0 upgrade

Keywords: UCC error, PyTorch, Upgrade

Discovered in Release: 2.17.0

Fixed in Release: 2.18.0

3436244

Description: On rare occasions, a 'group join' request may reach a timeout.

Keywords: NDR Switch, SHARP

Discovered in Version: 2.16

Fixed in Version: 2.16.2

3479712

Description: In virtualized environments, the performance of large messages can drop due to repeated failures to create indirect-atomic key (KSM).

Keywords: Virtualized Environments; Failure; Indericet-atomic Key; KSM;

Discovered in Version: 2.15

Fixed in Version: 2.16

3268964

Description: Improved performance in MPI_Bcast on AMD Genoa.

Note: To make use of these improvements, make sure UCC is explicitly enabled using:

--mca coll_ucc_enable 1 --mca coll_ucc_priority 99 --mca coll ucc,basic,libnbc --mca coll_ucc_cls basic,hier

Keywords: MPI_Bcast; AMD Genoa; UCC

Discovered in Version: 2.14

Fixed in Version: 2.15

3255925

Description: Fixed the issue where mpi_init was creating an internal CUDA context on GPU0, which could have an impact on CUDA applications behavior.

Keywords: CUDA; MPI

Discovered in Version: 2.13

Fixed in Version: 2.14

3223214

Description: Fixed the issue where shmem_ulong_wait_until() unsigned comparison was not working as expected.

Keywords: SHMEM

Discovered in Version: 2.13

Fixed in Version: 2.14

3261844

Description: Fixed the issue of when TCP transport was used on RDMA-capable setup, this led to lower performance and occasional hangs during mpi_finalize.

Keywords: TCP; RDMA; MPI; performance

Discovered in Version: 2.13

Fixed in Version: 2.13.1 LTS

3139906

Description: Port counters were not updated for UCX traffic when creating QP with DevX.

Keywords: UCX; QP; DevX

Discovered in Version: 2.13

Fixed in Version: 2.13.1 LTS

3084053

Description: Fixed the issue where performance of some applications was lower compared with HPC-X v2.10 and earlier.

Keywords: Performance

Discovered in Version: 2.12

Fixed in Version: 2.13

3163697

Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux.

Keywords: File descriptor; libsharp; HCOLL; HPC-X

Discovered in Version: 2.12

Fixed in Version: 2.13

3208615

Description: Fixed Data Integrity failure in Broadcast when using sparse subarray data type in OMPI with hcoll library by using the TRUE extent of the datatype, which includes any additional padding the datatype may require.

Keywords: OMPI; HCOLL; data integrity

Discovered in Version: 1.12

Fixed in Version: 2.13

4549

Description: Fixed the issue where UCX may have failed to compile with Clang compiler version 9 if --dynamic-list-dataflag was used in the compilation.

(Github issue: https://github.com/openucx/ucx/issues/4549)

Keywords: Clang compiler, UCX

Discovered in Version: 2.6 (UCX 1.8)

Fixed in Version: 2.11 (UCX 1.13)

-

Description: DevX does not work on architectures without "Write combining" support, such as some flavors of ARM, prompting the following error message.

UCX ERROR mlx5dv_devx_alloc_uar() failed: Operation not supported

Keywords: DevX, UCX, ARM

Discovered in Version: 2.8 (UCX 1.10)

Fixed in Version: 2.9 (UCX 1.11)

-

Description: NVIDIA SHARP library is not available in HPC-X for the Community OFED and Inbox OFED.

Keywords: NVIDIA SHARP library

Discovered in Version: 2.0

Fixed in Version: 2.9 (UCX 1.11)

2190337

Description: Fixed the issue where errors from the UCX TCP transport about refused connection may have appeared.

Keywords: UCX_TLS, UCX, TCP

Discovered in Version: 2.7 (UCX 1.9)

Fixed in Version: 2.8 (UCX 1.10)

2131893

Description: Fixed the issue where OpenSHMEM or MPI applications may have failed with the following error:

“Fatal: endpoint reconfiguration not supported yet”

This could happen when running in heterogeneous environment, such as when different nodes in the job had different types of HCAs or PCI atomics configuration.

Keywords: OpenSHMEM, UCX, MPI

Discovered in Version: 2.7 (UCX 1.9)

Fixed in Version: 2.8 (UCX 1.10)

2084450

Description: Fixed the issue where the osu_ialltoallw and osu_iallgather benchmarks may have not performed well over RoCE with the ud_x transport starting messages of 8192 bytes.

Keywords: osu_ialltoallwת osu_iallgather, ud_x transport, RoCE, UCX

Discovered in Version: 2.6 (UCX 1.8)

Fixed in Version: 2.8 (UCX 1.10)

1886580

Description: Fixed the issue where the below error messages might have been received when running OMPI with ‘direct modex’, i.e. when the following command line parameters were used:

-mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0

Error messages:

  • PMIX ERROR: NOT-FOUND in file server/pmix_server_get.c at line 751

  • PMIX ERROR: NOT-FOUND in file client/pmix_client_get.c at line 334

Keywords: OMPI, pmix, direct modex, full modex

Discovered in Version: 2.5 (OpenMPI 4.0.x)

Fixed in Version: 2.7 (OpenMPI 4.0.x)

4710

Description: Fixed the issue of when using UCX with XPMEM module on Kernels 4.10 and above, there might have been a "Bus error" due to an issue in the XPMEM driver.

(Github issue: https://github.com/openucx/ucx/issues/4710)

Keywords: UCX, XPMEM

Discovered in Version: 2.6 (UCX 1.8)

Fixed in Version: 2.7 (UCX 1.9)

2096036

Description: Fixed the issue where the verifier test may have failed with the following error when using the ud_x transport:

ib_mlx5_log.c:139 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/66)

ib_mlx5_log.c:139 UD QP 0x37161 wqe[368]: SEND --- [rqpn 0x36a01 rlid 93] [inl len 16]

Keywords: ud_x transport, UCX

Discovered in Version: 2.6 (UCX 1.8)

Fixed in Version: 2.7 (UCX 1.9)

2095618

Description: Fixed the issue where the host may have run out of memory when enabling Hardware Tag-Matching.

Keywords: Hardware Tag-Matching, UCX

Discovered in Version: 2.6 (UCX 1.8)

Fixed in Version: 2.7 (UCX 1.9)

3758

Description: Fixed the issue of when running UCX with TCP transport on more than 16 hosts with full PPN (processes per node), the following error message might have appeared.

sock.c:228 UCX ERROR recv(fd=1377) failed: 104

(Github issue: https://github.com/openucx/ucx/issues/3758)

Keywords: TCP, UCX, backlog

Discovered in Version: 2.5 (UCX 1.7)

Fixed in Version: 2.6 (UCX 1.8)

1582208

Description: Fixed the issue where sending data over multiple SHMEM contexts may lead to memory corruption or segmentation fault.

Keywords: Open SHMEM, segmentation fault

Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4)

2934

Description: Fixed the issue where OpenMPI and OpenSHMEM applications may hang with DC transport.

(Github issue: https://github.com/openucx/ucx/issues/2934)

Keywords: UCX, Open MPI, DC

Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4)

1307243

Description: Fixed the issue where one-sided tests may fail with a segmentation fault.

Keywords: OSC UCX, Open MPI, one-sided

Discovered in Version: 2.1 (Open MPI 3.1.x)

Fixed in Version: 2.5 (Open MPI 4.0.x)

-

Description: Fixed the issue where OpenSHMEM atomic operations AND/OR/XOR for datatypes int32/int64/uint32/uint64 were not implemented, which might have caused build failures.

Keywords: OpenSHMEM atomic, Open MPI

Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4)

2226

Description: Fixed the issue where the following assertion may have failed in certain cases:

Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed

(Gihub issue: https://github.com/openucx/ucx/issues/2226)

Keywords: UCX, assertion

Discovered in Version: 2.1 (UCX 1.3)

Fixed in Version: 2.4 (UCX 1.6)

-

Description: Fixed the issue where zero-length OpenSHMEM collectives might have failed due to incomplete implementation.

Keywords: OpenSHMEM atomic, Open MPI

Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4)

-

Description: Fixed the issue where OSC UCX module was not selected by default on ConnectX-4/ConnectX-5 HCAs.

Keywords: OSC UCX, one-sided, Open MPI

Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4)

Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4)

-

Description: Fixed the issue where using UCX on ARM hosts may result in hangs due to a known issue in Open MPI when running on ARM.

Keywords: UCX

Discovered in Version: 1.3 (Open MPI 1.8.2)

Fixed in Version: 2.3 (Open MPI 4.0.x)

-

Description: MCA options rmaps_dist_device and rmaps_base_mapping_policy are now functional.

Keywords: Process binding policy, NUMA/HCA locality

Discovered in Version: 2.0 (Open MPI 3.0.0)

Fixed in Version: 2.3 (Open MPI 4.0.x)

2111

Description: Fixed the issue of when UCX was used in the multi-threaded mode, it might have taken the osu_latency_mt test a long time to be completed.

(Github issue: https://github.com/openucx/ucx/issues/2111)

Keywords: UCX, multi-threaded

Discovered in Version: 2.1 (UCX 1.3)

Fixed in Version: 2.3 (UCX 1.5)

2267

Description: Fixed the issue where the following error message might have appeared when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only:

Fatal: send completion with error: Endpoint timeout ”.

(Github issue: https://github.com/openucx/ucx/issues/2267)

Keywords: UCX

Discovered in Version: 2.1 (UCX 1.3)

Fixed in Version: 2.3 (UCX 1.5)

2702

Description: Fixed the issue of when using the Hardware Tag Matching feature, the following error messages may have been printed:

  • “rcache.c:481 UCX WARN failed to register region 0xdec25a0 [0x2b7139ae0020..0x2b7139ae2020]: Input/output error”

  • “ucp_mm.c:105 UCX ERROR failed to register address 0x2b7139ae0020 length 8192 on md[1]=ib/mlx5_0: Input/output error”

  • “ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b7139ae0020 len 8192: Input/output error”

(Github issue: https://github.com/openucx/ucx/issues/2702)

Keywords: Hardware Tag Matching

Discovered in Version: 2.2 (UCX 1.4)

Fixed in Version: 2.3 (UCX 1.5)

2454

Description: Fixed the issue where some one-sided benchmarks may have hung when using “osc ucx”.

For example: osu-micro-benchmarks-5.3.2/osu_get_acc_latency (Latency Test for accumulate with Active/Passive Synchronization).

(Github issue: https://github.com/openucx/ucx/issues/2454)

Keywords: UCX, one_sided

Discovered in Version: 2.2 (UCX 1.4)

Fixed in Version: 2.3 (UCX 1.5)

2670

Description: Fixed the issue of when enabling the Hardware Tag Matching feature on a large scale, the following error message may have been printed due to the increased threshold for BCOPY messages:

“mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory.”

(Github issue: https://github.com/openucx/ucx/issues/2670)

Keywords: Hardware Tag Matching

Discovered in Version: 2.2 (UCX 1.4)

Fixed in Version: 2.3 (UCX 1.5)

1295679

Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: “ group cache overflow on rank xxx: cache_size = 100 ”.

Keywords: OpenSHMEM, Open MPI

Discovered in Version: 2.1 (Open MPI 3.1.x)

Fixed in Version: 2.2 (Open MPI 3.1.x)

-

Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support.

Keywords: UCX, CUDA

Discovered in Version: 2.2 (UCX 1.4)

Fixed in Version: 2.1 (UCX 1.3)

1926

Description: Fixed the issue of when using multiple transports, invalid data was sent out-of-sync with Hardware Tag Matching traffic.

(Github issue: https://github.com/openucx/ucx/issues/1926)

Keywords: Hardware Tag Matching

Discovered in Version: 2.1 (UCX 1.3)

Fixed in Version: 2.2 (UCX 1.4)

1949

Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport.

(Github issue: https://github.com/openucx/ucx/issues/1949)

Keywords: UCX, Hardware Tag Matching, DC transport

Discovered in Version: 2.0

Fixed in Version: 2.1

-

Description: Fixed job data transfer from SD to libsharp.

Keywords: NVIDIA SHARP library

Discovered in Release: 1.9

Fixed in Release: 1.9.7

884482

Description: Fixed internal HCOLL datatype mapping.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

884508

Description: Fixed internal HCOLL datatype lower bound calculation.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

884490

Description: Fixed allgather unpacking issues.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

885009

Description: Fixed wrong answer in alltoallv.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

882193

Description: Fixed mcast group leak in HCOLL.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

-

Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

-

Description: Fixed an issue related to multi-threaded MPI_Bcast.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

Salesforce: 316541

Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

Salesforce: 316547

Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

894346

Description: Fixed Quantum Espresso hanging issues.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

898283

Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled.

Keywords: HCOLL, FCA

Discovered in Release: 1.7.405

Fixed in Release: 1.7.406

906155

Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce.

Keywords: HCOLL, FCA

Discovered in Release: 1.6

Fixed in Release: 1.7.406

© Copyright 2025, NVIDIA. Last updated on Sep 9, 2025.