Bug Fixes in this Version
Ref # |
Description |
3712109 |
Description: Fixed UCC error in PyTorch 23.12 from HPC-X 2.17.0 upgrade |
Keywords: UCC error, PyTorch, Upgrade |
|
Discovered in Release: 2.17.0 |
|
3653404 |
Description: When registering a large memory region with ucp_mem_map(), and peer failure handling support is enabled on the UCX endpoint, the process may crash with the error "LRU push returned Unsupported operation" while sending a buffer belonging to that region. The issue happens because multi-threaded registration is being used for large regions, and it does not work well with peer failure support. |
Keywords: Multi-Threaded, Indirect, Key Registration |
|
Discovered in Version: 2.17.0 |
|
3837556 |
Description: Fixed UCX to not create an SRQ on RDMA network devices that do not support it. Before this fix, the application could fail with the error message "ibv_create_srq() failed: Operation not supported". |
Keywords: SRQ, UCX |
|
Discovered in Release: 2.18.0 |
|
3774158 |
Description: Fixed a failure with the message "Local length error". The issue is caused by some compilers replacing direct assignments with memmove() function, leading to corruption while writing to IO memory. |
Keywords: UCX, Local length error |
|
Discovered in Release: 2.18.0 |
|
3774153 |
Description: Fixed the issue where in some cases, there could be a race condition between RDMA_WRITE and shared memory write, leading to MPI receiving invalid data with large messages or collective operations between ranks on the same node. |
Keywords: RDMA_WRITE |
|
Discovered in Release: 2.18.0 |
|
3762227 |
Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration. |
Keywords: UCX, assertion |
|
Discovered in Release: 2.17.1 |
|
3748762 |
Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration. |
Keywords: UCX, assertion |
|
Discovered in Release: 2.17.1 |