Bug Fixes in this Version

NVIDIA HPC-X Software Toolkit Rev 2.19.0

Ref #

Description

3712109

Description: Fixed UCC error in PyTorch 23.12 from HPC-X 2.17.0 upgrade

Keywords: UCC error, PyTorch, Upgrade

Discovered in Release: 2.17.0

3653404

Description: When registering a large memory region with ucp_mem_map(), and peer failure handling support is enabled on the UCX endpoint, the process may crash with the error “LRU push returned Unsupported operation” while sending a buffer belonging to that region. The issue happens because multi-threaded registration is being used for large regions, and it does not work well with peer failure support.

Keywords: Multi-Threaded, Indirect, Key Registration

Discovered in Version: 2.17.0

3837556

Description: Fixed UCX to not create an SRQ on RDMA network devices that do not support it. Before this fix, the application could fail with the error message “ibv_create_srq() failed: Operation not supported”.

Keywords: SRQ, UCX

Discovered in Release: 2.18.0

3774158

Description: Fixed a failure with the message “Local length error”. The issue is caused by some compilers replacing direct assignments with memmove() function, leading to corruption while writing to IO memory.

Keywords: UCX, Local length error

Discovered in Release: 2.18.0

3774153

Description: Fixed the issue where in some cases, there could be a race condition between RDMA_WRITE and shared memory write, leading to MPI receiving invalid data with large messages or collective operations between ranks on the same node.

Keywords: RDMA_WRITE

Discovered in Release: 2.18.0

3762227

Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration.

Keywords: UCX, assertion

Discovered in Release: 2.17.1

3748762

Description: Fixed the issue where the application may crash in UCX remote key packing procedure after failed memory registration.

Keywords: UCX, assertion

Discovered in Release: 2.17.1

© Copyright 2024, NVIDIA. Last updated on May 6, 2024.