Bug Fixes History
Internal Reference Number |
Issue |
3436244 |
Description: On rare occasions, a 'group join' request may reach a timeout. |
Keywords: NDR Switch, SHARP |
|
Discovered in Version: 2.16 |
|
Fixed in Version: 2.16.2 |
|
3479712 |
Description: In virtualized environments, the performance of large messages can drop due to repeated failures to create indirect-atomic key (KSM). |
Keywords: Virtualized Environments; Failure; Indericet-atomic Key; KSM; |
|
Discovered in Version: 2.15 |
|
Fixed in Version: 2.16 |
|
3268964 |
Description: Improved performance in MPI_Bcast on AMD Genoa. Note: To make use of these improvements, make sure UCC is explicitly enabled using: --mca coll_ucc_enable 1 --mca coll_ucc_priority 99 --mca coll ucc,basic,libnbc --mca coll_ucc_cls basic,hier |
Keywords: MPI_Bcast; AMD Genoa; UCC |
|
Discovered in Version: 2.14 |
|
Fixed in Version: 2.15 |
|
3255925 |
Description: Fixed the issue where mpi_init was creating an internal CUDA context on GPU0, which could have an impact on CUDA applications behavior. |
Keywords: CUDA; MPI |
|
Discovered in Version: 2.13 |
|
Fixed in Version: 2.14 |
|
3223214 |
Description: Fixed the issue where shmem_ulong_wait_until() unsigned comparison was not working as expected. |
Keywords: SHMEM |
|
Discovered in Version: 2.13 |
|
Fixed in Version: 2.14 |
|
3261844 |
Description: Fixed the issue of when TCP transport was used on RDMA-capable setup, this led to lower performance and occasional hangs during mpi_finalize. |
Keywords: TCP; RDMA; MPI; performance |
|
Discovered in Version: 2.13 |
|
Fixed in Version: 2.13.1 LTS |
|
3139906 |
Description: Port counters were not updated for UCX traffic when creating QP with DevX. |
Keywords: UCX; QP; DevX |
|
Discovered in Version: 2.13 |
|
Fixed in Version: 2.13.1 LTS |
|
3084053 |
Description: Fixed the issue where performance of some applications was lower compared with HPC-X v2.10 and earlier. |
Keywords: Performance |
|
Discovered in Version: 2.12 |
|
Fixed in Version: 2.13 |
|
3163697 |
Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux. |
Keywords: File descriptor; libsharp; HCOLL; HPC-X |
|
Discovered in Version: 2.12 |
|
Fixed in Version: 2.13 |
|
3208615 |
Description: Fixed Data Integrity failure in Broadcast when using sparse subarray data type in OMPI with hcoll library by using the TRUE extent of the datatype, which includes any additional padding the datatype may require. |
Keywords: OMPI; HCOLL; data integrity |
|
Discovered in Version: 1.12 |
|
Fixed in Version: 2.13 |
|
4549 |
Description: Fixed the issue where UCX may have failed to compile with Clang compiler version 9 if --dynamic-list-data flag was used in the compilation. (Github issue: https://github.com/openucx/ucx/issues/4549) |
Keywords: Clang compiler, UCX |
|
Discovered in Version: 2.6 (UCX 1.8) |
|
Fixed in Version: 2.11 (UCX 1.13) |
|
- |
Description: DevX does not work on architectures without "Write combining" support, such as some flavors of ARM, prompting the following error message. UCX ERROR mlx5dv_devx_alloc_uar() failed: Operation not supported |
Keywords: DevX, UCX, ARM |
|
Discovered in Version: 2.8 (UCX 1.10) |
|
Fixed in Version: 2.9 (UCX 1.11) |
|
- |
Description: NVIDIA SHARP library is not available in HPC-X for the Community OFED and Inbox OFED. |
Keywords: NVIDIA SHARP library |
|
Discovered in Version: 2.0 |
|
Fixed in Version: 2.9 (UCX 1.11) |
|
2190337 |
Description: Fixed the issue where errors from the UCX TCP transport about refused connection may have appeared. |
Keywords: UCX_TLS, UCX, TCP |
|
Discovered in Version: 2.7 (UCX 1.9) |
|
Fixed in Version: 2.8 (UCX 1.10) |
|
2131893 |
Description: Fixed the issue where OpenSHMEM or MPI applications may have failed with the following error: “Fatal: endpoint reconfiguration not supported yet” This could happen when running in heterogeneous environment, such as when different nodes in the job had different types of HCAs or PCI atomics configuration. |
Keywords: OpenSHMEM, UCX, MPI |
|
Discovered in Version: 2.7 (UCX 1.9) |
|
Fixed in Version: 2.8 (UCX 1.10) |
|
2084450 |
Description: Fixed the issue where the osu_ialltoallw and osu_iallgather benchmarks may have not performed well over RoCE with the ud_x transport starting messages of 8192 bytes. |
Keywords: osu_ialltoallwת osu_iallgather, ud_x transport, RoCE, UCX |
|
Discovered in Version: 2.6 (UCX 1.8) |
|
Fixed in Version: 2.8 (UCX 1.10) |
|
1886580 |
Description: Fixed the issue where the below error messages might have been received when running OMPI with ‘direct modex’, i.e. when the following command line parameters were used: -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 Error messages:
|
Keywords: OMPI, pmix, direct modex, full modex |
|
Discovered in Version: 2.5 (OpenMPI 4.0.x) |
|
Fixed in Version: 2.7 (OpenMPI 4.0.x) |
|
4710 |
Description: Fixed the issue of when using UCX with XPMEM module on Kernels 4.10 and above, there might have been a "Bus error" due to an issue in the XPMEM driver. (Github issue: https://github.com/openucx/ucx/issues/4710) |
Keywords: UCX, XPMEM |
|
Discovered in Version: 2.6 (UCX 1.8) |
|
Fixed in Version: 2.7 (UCX 1.9) |
|
2096036 |
Description: Fixed the issue where the verifier test may have failed with the following error when using the ud_x transport: ib_mlx5_log.c:139 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/66) ib_mlx5_log.c:139 UD QP 0x37161 wqe[368]: SEND --- [rqpn 0x36a01 rlid 93] [inl len 16] |
Keywords: ud_x transport, UCX |
|
Discovered in Version: 2.6 (UCX 1.8) |
|
Fixed in Version: 2.7 (UCX 1.9) |
|
2095618 |
Description: Fixed the issue where the host may have run out of memory when enabling Hardware Tag-Matching. |
Keywords: Hardware Tag-Matching, UCX |
|
Discovered in Version: 2.6 (UCX 1.8) |
|
Fixed in Version: 2.7 (UCX 1.9) |
|
3758 |
Description: Fixed the issue of when running UCX with TCP transport on more than 16 hosts with full PPN (processes per node), the following error message might have appeared. sock.c:228 UCX ERROR recv(fd=1377) failed: 104 (Github issue: https://github.com/openucx/ucx/issues/3758) |
Keywords: TCP, UCX, backlog |
|
Discovered in Version: 2.5 (UCX 1.7) |
|
Fixed in Version: 2.6 (UCX 1.8) |
|
1582208 |
Description: Fixed the issue where sending data over multiple SHMEM contexts may lead to memory corruption or segmentation fault. |
Keywords: Open SHMEM, segmentation fault |
|
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
2934 |
Description: Fixed the issue where OpenMPI and OpenSHMEM applications may hang with DC transport. (Github issue: https://github.com/openucx/ucx/issues/2934) |
Keywords: UCX, Open MPI, DC |
|
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
1307243 |
Description: Fixed the issue where one-sided tests may fail with a segmentation fault. |
Keywords: OSC UCX, Open MPI, one-sided |
|
Discovered in Version: 2.1 (Open MPI 3.1.x) |
|
Fixed in Version: 2.5 (Open MPI 4.0.x) |
|
- |
Description: Fixed the issue where OpenSHMEM atomic operations AND/OR/XOR for datatypes int32/int64/uint32/uint64 were not implemented, which might have caused build failures. |
Keywords: OpenSHMEM atomic, Open MPI |
|
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
2226 |
Description: Fixed the issue where the following assertion may have failed in certain cases: Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed (Gihub issue: https://github.com/openucx/ucx/issues/2226) |
Keywords: UCX, assertion |
|
Discovered in Version: 2.1 (UCX 1.3) |
|
Fixed in Version: 2.4 (UCX 1.6) |
|
- |
Description: Fixed the issue where zero-length OpenSHMEM collectives might have failed due to incomplete implementation. |
Keywords: OpenSHMEM atomic, Open MPI |
|
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
- |
Description: Fixed the issue where OSC UCX module was not selected by default on ConnectX-4/ConnectX-5 HCAs. |
Keywords: OSC UCX, one-sided, Open MPI |
|
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) |
|
- |
Description: Fixed the issue where using UCX on ARM hosts may result in hangs due to a known issue in Open MPI when running on ARM. |
Keywords: UCX |
|
Discovered in Version: 1.3 (Open MPI 1.8.2) |
|
Fixed in Version: 2.3 (Open MPI 4.0.x) |
|
- |
Description: MCA options rmaps_dist_device and rmaps_base_mapping_policy are now functional. |
Keywords: Process binding policy, NUMA/HCA locality |
|
Discovered in Version: 2.0 (Open MPI 3.0.0) |
|
Fixed in Version: 2.3 (Open MPI 4.0.x) |
|
2111 |
Description: Fixed the issue of when UCX was used in the multi-threaded mode, it might have taken the osu_latency_mt test a long time to be completed. (Github issue: https://github.com/openucx/ucx/issues/2111) |
Keywords: UCX, multi-threaded |
|
Discovered in Version: 2.1 (UCX 1.3) |
|
Fixed in Version: 2.3 (UCX 1.5) |
|
2267 |
Description: Fixed the issue where the following error message might have appeared when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only: “ Fatal: send completion with error: Endpoint timeout ”. (Github issue: https://github.com/openucx/ucx/issues/2267) |
Keywords: UCX |
|
Discovered in Version: 2.1 (UCX 1.3) |
|
Fixed in Version: 2.3 (UCX 1.5) |
|
2702 |
Description: Fixed the issue of when using the Hardware Tag Matching feature, the following error messages may have been printed:
(Github issue: https://github.com/openucx/ucx/issues/2702) |
Keywords: Hardware Tag Matching |
|
Discovered in Version: 2.2 (UCX 1.4) |
|
Fixed in Version: 2.3 (UCX 1.5) |
|
2454 |
Description: Fixed the issue where some one-sided benchmarks may have hung when using “osc ucx”. For example: osu-micro-benchmarks-5.3.2/osu_get_acc_latency (Latency Test for accumulate with Active/Passive Synchronization). (Github issue: https://github.com/openucx/ucx/issues/2454) |
Keywords: UCX, one_sided |
|
Discovered in Version: 2.2 (UCX 1.4) |
|
Fixed in Version: 2.3 (UCX 1.5) |
|
2670 |
Description: Fixed the issue of when enabling the Hardware Tag Matching feature on a large scale, the following error message may have been printed due to the increased threshold for BCOPY messages: “mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory.” (Github issue: https://github.com/openucx/ucx/issues/2670) |
Keywords: Hardware Tag Matching |
|
Discovered in Version: 2.2 (UCX 1.4) |
|
Fixed in Version: 2.3 (UCX 1.5) |
|
1295679 |
Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: “ group cache overflow on rank xxx: cache_size = 100 ”. |
Keywords: OpenSHMEM, Open MPI |
|
Discovered in Version: 2.1 (Open MPI 3.1.x) |
|
Fixed in Version: 2.2 (Open MPI 3.1.x) |
|
- |
Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support. |
Keywords: UCX, CUDA |
|
Discovered in Version: 2.2 (UCX 1.4) |
|
Fixed in Version: 2.1 (UCX 1.3) |
|
1926 |
Description: Fixed the issue of when using multiple transports, invalid data was sent out-of-sync with Hardware Tag Matching traffic. (Github issue: https://github.com/openucx/ucx/issues/1926) |
Keywords: Hardware Tag Matching |
|
Discovered in Version: 2.1 (UCX 1.3) |
|
Fixed in Version: 2.2 (UCX 1.4) |
|
1949 |
Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport. (Github issue: https://github.com/openucx/ucx/issues/1949) |
Keywords: UCX, Hardware Tag Matching, DC transport |
|
Discovered in Version: 2.0 |
|
Fixed in Version: 2.1 |
|
- |
Description: Fixed job data transfer from SD to libsharp. |
Keywords: NVIDIA SHARP library |
|
Discovered in Release: 1.9 |
|
Fixed in Release: 1.9.7 |
|
884482 |
Description: Fixed internal HCOLL datatype mapping. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
884508 |
Description: Fixed internal HCOLL datatype lower bound calculation. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
884490 |
Description: Fixed allgather unpacking issues. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
885009 |
Description: Fixed wrong answer in alltoallv. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
882193 |
Description: Fixed mcast group leak in HCOLL. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
- |
Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
- |
Description: Fixed an issue related to multi-threaded MPI_Bcast. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
Salesforce: 316541 |
Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
Salesforce: 316547 |
Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
894346 |
Description: Fixed Quantum Espresso hanging issues. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
898283 |
Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.7.405 |
|
Fixed in Release: 1.7.406 |
|
906155 |
Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce. |
Keywords: HCOLL, FCA |
|
Discovered in Release: 1.6 |
|
Fixed in Release: 1.7.406 |