Bug Fixes in this Version

Internal Ref.

Issue

3609384

Description: Fixed issues concerning Sharp_AM connection creation with rank zero clients of active jobs during a restart when UCX is enabled.

Keywords: sharp_am, libsharp, restart

Discovered in Version: 3.4.0

Fixed in Release: 3.5.0

3541153

Description: Fixed an issue where client application is abnormally terminated before the sharp_coll_finalize method, sharp_am is supposed to automatically detect and clean the job resources. However, with UCX, only one such termination is detected per cycle, leading to incomplete job cleaning. Similarly, when using NCCL and hosts with multiple GPUs/HCAs, each HCA gets its own SHARP job, which results in sharp_am taking several cycles to detect all the jobs that require cleaning. As a consequence, hosts operating in the previous application cannot initiate a new SHARP job until sharp_am detects and cleans all the necessary jobs.

Keywords: sharp_am, NCCL, UCX

Discovered in Version: 3.4.0

Fixed in Release: 3.5.0

© Copyright 2023, NVIDIA. Last updated on Nov 16, 2023.