Bug Fixes History
Internal Reference Number | Issue |
- | Description: DevX does not work on architectures without "Write combining" support, such as some flavors of ARM, prompting the following error message. |
Keywords: DevX, UCX, ARM | |
Discovered in Version: 2.8 (UCX 1.10) | |
Fixed in Version: 2.9 (UCX 1.11) | |
- | Description: Mellanox SHARP library is not available in HPC-X for the Community OFED and Inbox OFED. |
Keywords: Mellanox SHARP library | |
Discovered in Version: 2.0 | |
Fixed in Version: 2.9 (UCX 1.11) | |
2190337 | Description: Fixed the issue where errors from the UCX TCP transport about refused connection may have appeared. |
Keywords: UCX_TLS, UCX, TCP | |
Discovered in Version: 2.7 (UCX 1.9) | |
Fixed in Version: 2.8 (UCX 1.10) | |
2131893 | Description: Fixed the issue where OpenSHMEM or MPI applications may have failed with the following error: “Fatal: endpoint reconfiguration not supported yet” This could happen when running in heterogeneous environment, such as when different nodes in the job had different types of HCAs or PCI atomics configuration. |
Keywords: OpenSHMEM, UCX, MPI | |
Discovered in Version: 2.7 (UCX 1.9) | |
Fixed in Version: 2.8 (UCX 1.10) | |
2084450 | Description: Fixed the issue where the osu_ialltoallw and osu_iallgather benchmarks may have not performed well over RoCE with the ud_x transport starting messages of 8192 bytes. |
Keywords: osu_ialltoallwת osu_iallgather, ud_x transport, RoCE, UCX | |
Discovered in Version: 2.6 (UCX 1.8) | |
Fixed in Version: 2.8 (UCX 1.10) | |
1886580 | Description: Fixed the issue where the below error messages might have been received when running OMPI with ‘direct modex’, i.e. when the following command line parameters were used: Error messages:
|
Keywords: OMPI, pmix, direct modex, full modex | |
Discovered in Version: 2.5 (OpenMPI 4.0.x) | |
Fixed in Version: 2.7 (OpenMPI 4.0.x) | |
4710 | Description: Fixed the issue of when using UCX with XPMEM module on Kernels 4.10 and above, there might have been a "Bus error" due to an issue in the XPMEM driver. (Github issue: https://github.com/openucx/ucx/issues/4710) |
Keywords: UCX, XPMEM | |
Discovered in Version: 2.6 (UCX 1.8) | |
Fixed in Version: 2.7 (UCX 1.9) | |
2096036 | Description: Fixed the issue where the verifier test may have failed with the following error when using the ud_x transport: ib_mlx5_log.c:139 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/66) |
Keywords: ud_x transport, UCX | |
Discovered in Version: 2.6 (UCX 1.8) | |
Fixed in Version: 2.7 (UCX 1.9) | |
2095618 | Description: Fixed the issue where the host may have run out of memory when enabling Hardware Tag-Matching. |
Keywords: Hardware Tag-Matching, UCX | |
Discovered in Version: 2.6 (UCX 1.8) | |
Fixed in Version: 2.7 (UCX 1.9) | |
3758 | Description: Fixed the issue of when running UCX with TCP transport on more than 16 hosts with full PPN (processes per node), the following error message might have appeared. sock.c:228 UCX ERROR recv(fd=1377) failed: 104 (Github issue: https://github.com/openucx/ucx/issues/3758) |
Keywords: TCP, UCX, backlog | |
Discovered in Version: 2.5 (UCX 1.7) | |
Fixed in Version: 2.6 (UCX 1.8) | |
1582208 | Description: Fixed the issue where sending data over multiple SHMEM contexts may lead to memory corruption or segmentation fault. |
Keywords: Open SHMEM, segmentation fault | |
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
2934 | Description: Fixed the issue where OpenMPI and OpenSHMEM applications may hang with DC transport. (Github issue: https://github.com/openucx/ucx/issues/2934) |
Keywords: UCX, Open MPI, DC | |
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
Fixed in Version: 2.5 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
1307243 | Description: Fixed the issue where one-sided tests may fail with a segmentation fault. |
Keywords: OSC UCX, Open MPI, one-sided | |
Discovered in Version: 2.1 (Open MPI 3.1.x) | |
Fixed in Version: 2.5 (Open MPI 4.0.x) | |
- | Description: Fixed the issue where OpenSHMEM atomic operations AND/OR/XOR for datatypes int32/int64/uint32/uint64 were not implemented, which might have caused build failures. |
Keywords: OpenSHMEM atomic, Open MPI | |
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
2226 | Description: Fixed the issue where the following assertion may have failed in certain cases: Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed (Gihub issue: https://github.com/openucx/ucx/issues/2226) |
Keywords: UCX, assertion | |
Discovered in Version: 2.1 (UCX 1.3) | |
Fixed in Version: 2.4 (UCX 1.6) | |
- | Description: Fixed the issue where zero-length OpenSHMEM collectives might have failed due to incomplete implementation. |
Keywords: OpenSHMEM atomic, Open MPI | |
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
- | Description: Fixed the issue where OSC UCX module was not selected by default on ConnectX-4/ConnectX-5 HCAs. |
Keywords: OSC UCX, one-sided, Open MPI | |
Discovered in Version: 2.3 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
Fixed in Version: 2.4 (Open MPI v4.0.x, OpenSHMEM v1.4) | |
- | Description: Fixed the issue where using UCX on ARM hosts may result in hangs due to a known issue in Open MPI when running on ARM. |
Keywords: UCX | |
Discovered in Version: 1.3 (Open MPI 1.8.2) | |
Fixed in Version: 2.3 (Open MPI 4.0.x) | |
- | Description: MCA options rmaps_dist_device and rmaps_base_mapping_policy are now functional. |
Keywords: Process binding policy, NUMA/HCA locality | |
Discovered in Version: 2.0 (Open MPI 3.0.0) | |
Fixed in Version: 2.3 (Open MPI 4.0.x) | |
2111 | Description: Fixed the issue of when UCX was used in the multi-threaded mode, it might have taken the osu_latency_mt test a long time to be completed. (Github issue: https://github.com/openucx/ucx/issues/2111) |
Keywords: UCX, multi-threaded | |
Discovered in Version: 2.1 (UCX 1.3) | |
Fixed in Version: 2.3 (UCX 1.5) | |
2267 | Description: Fixed the issue where the following error message might have appeared when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only: “ Fatal: send completion with error: Endpoint timeout ”. (Github issue: https://github.com/openucx/ucx/issues/2267) |
Keywords: UCX | |
Discovered in Version: 2.1 (UCX 1.3) | |
Fixed in Version: 2.3 (UCX 1.5) | |
2702 | Description: Fixed the issue of when using the Hardware Tag Matching feature, the following error messages may have been printed:
(Github issue: https://github.com/openucx/ucx/issues/2702) |
Keywords: Hardware Tag Matching | |
Discovered in Version: 2.2 (UCX 1.4) | |
Fixed in Version: 2.3 (UCX 1.5) | |
2454 | Description: Fixed the issue where some one-sided benchmarks may have hung when using “osc ucx”. (Github issue: https://github.com/openucx/ucx/issues/2454) |
Keywords: UCX, one_sided | |
Discovered in Version: 2.2 (UCX 1.4) | |
Fixed in Version: 2.3 (UCX 1.5) | |
2670 | Description: Fixed the issue of when enabling the Hardware Tag Matching feature on a large scale, the following error message may have been printed due to the increased threshold for BCOPY messages: “mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory.” (Github issue: https://github.com/openucx/ucx/issues/2670) |
Keywords: Hardware Tag Matching | |
Discovered in Version: 2.2 (UCX 1.4) | |
Fixed in Version: 2.3 (UCX 1.5) | |
1295679 | Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: “ group cache overflow on rank xxx: cache_size = 100 ”. |
Keywords: OpenSHMEM, Open MPI | |
Discovered in Version: 2.1 (Open MPI 3.1.x) | |
Fixed in Version: 2.2 (Open MPI 3.1.x) | |
- | Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support. |
Keywords: UCX, CUDA | |
Discovered in Version: 2.2 (UCX 1.4) | |
Fixed in Version: 2.1 (UCX 1.3) | |
1926 | Description: Fixed the issue of when using multiple transports, invalid data was sent out-of-sync with Hardware Tag Matching traffic. (Github issue: https://github.com/openucx/ucx/issues/1926) |
Keywords: Hardware Tag Matching | |
Discovered in Version: 2.1 (UCX 1.3) | |
Fixed in Version: 2.2 (UCX 1.4) | |
1949 | Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport. (Github issue: https://github.com/openucx/ucx/issues/1949) |
Keywords: UCX, Hardware Tag Matching, DC transport | |
Discovered in Version: 2.0 | |
Fixed in Version: 2.1 | |
- | Description: Fixed job data transfer from SD to libsharp. |
Keywords: Mellanox SHARP library | |
Discovered in Release: 1.9 | |
Fixed in Release: 1.9.7 | |
884482 | Description: Fixed internal HCOLL datatype mapping. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
884508 | Description: Fixed internal HCOLL datatype lower bound calculation. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
884490 | Description: Fixed allgather unpacking issues. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
885009 | Description: Fixed wrong answer in alltoallv. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
882193 | Description: Fixed mcast group leak in HCOLL. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
- | Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
- | Description: Fixed an issue related to multi-threaded MPI_Bcast. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
Salesforce: 316541 | Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
Salesforce: 316547 | Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
894346 | Description: Fixed Quantum Espresso hanging issues. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
898283 | Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.7.405 | |
Fixed in Release: 1.7.406 | |
906155 | Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce. |
Keywords: HCOLL, FCA | |
Discovered in Release: 1.6 | |
Fixed in Release: 1.7.406 |