Bug Fixes History
The following table provides a list of bugs fixed in this SHARP version.
Internal Ref. |
Issue |
3609384 |
Description: Fixed issues concerning Sharp_AM connection creation with rank zero clients of active jobs during a restart when UCX is enabled. |
Keywords: sharp_am, libsharp, restart |
|
Discovered in Version: 3.4.0 |
|
Fixed in Release: 3.5.0 |
|
3541153 |
Description: Fixed an issue where client application is abnormally terminated before the sharp_coll_finalize method, sharp_am is supposed to automatically detect and clean the job resources. However, with UCX, only one such termination is detected per cycle, leading to incomplete job cleaning. Similarly, when using NCCL and hosts with multiple GPUs/HCAs, each HCA gets its own SHARP job, which results in sharp_am taking several cycles to detect all the jobs that require cleaning. As a consequence, hosts operating in the previous application cannot initiate a new SHARP job until sharp_am detects and cleans all the necessary jobs. |
Keywords: sharp_am, NCCL, UCX |
|
Discovered in Version: 3.4.0 |
|
Fixed in Release: 3.5.0 |
|
3400293 |
Description: Fixed an issue in libsharp where it failed to respond to messages from the SM while searching for Service Records, causing the SM to print timeout messages. |
Keywords: sharp_am; openSM |
|
Discovered in Version: 3.1.0 |
|
Fixed in Release: 3.4.0 |
|
3479721 |
Description: Fixed the issue where sharp_am did not handle hypercube topologies well, causing it to incorrectly treat different switches as duplicates. |
Keywords: sharp_am; hypercube |
|
Discovered in Version: 3.3.0 |
|
Fixed in Release: 3.4.0 |
|
3496440 |
Description: Fixed the issue in sharp_am where excessive log messages were printed for each disconnected or restarted compute host. Now, the information is printed in a consolidated manner in the form of summaries of disconnected hosts or a list of those hosts in a single log message. However, for more comprehensive details, the complete list of hosts is still available and printed at the DEBUG level. |
Keywords: sharp_am |
|
Discovered in Version: 3.3.0 |
|
Fixed in Release: 3.4.0 |
|
3336788 |
Description: Fixed the issue in Firmware where MAD error responses might have been received in libsharp. |
Keywords: sharp_am; libsharp |
|
Discovered in Version: 3.2.0 |
|
Fixed in Release: 3.3.0 (Quantum-2 Firmware 31.2010.6064 ) |
|
3343503 |
Description: Fixed the issue where sharp_am installed from MLNX_OFED used an invalid range of job IDs, resulting in occasional errors when trying to establish new SHARP jobs. |
Keywords: MLNX_OFED; sharp_am |
|
Discovered in Version: 3.2.0 |
|
Fixed in Release: 3.3.0 |
|
3368381 |
Description: Fixed the issue of when no sufficient amount of retries was made to resend failed libsharp GroupJoin MADs, SHARP jobs failed before they even started. |
Keywords: libsharp; MADs |
|
Discovered in Release: 3.0.0 |
|
Fixed in Release: 3.3.0 |
|
3393902 |
Description: Fixed the issue where re-created virtual ports were not recognized by sharp_am, thus the correct tree was not built for them. This resulted in SAT jobs getting ibv_poll_cq failure in libsharp. |
Keywords: Virtual port; sharp_am; libsharp; SAT; ibv_poll_cq |
|
Discovered in Version: 3.2.0 |
|
Fixed in Release: 3.3.0 |
|
3404474 |
Description: Fixed an issue where failure of application allocation of all hosts done via /app/sharp/resources REST-API returned a successful job instead of error. |
Keywords: REST API; allocation |
|
Discovered in Release: 3.2.0 |
|
Fixed in Release: 3.3.0 |
|
3406186 |
Description: Fixed an issue where SHARP AM failed handling reports from OpenSM if some switch ports were down or isolated. |
Keywords: Aggregation Manager; Aggregation Node; OpenSM |
|
Discovered in Release: 3.2.0 |
|
Fixed in Release: 3.3.0 |
|
3236363 |
Description: Fixed the way physical link failures between switches are handled. In the event of a link failure, a SHARP job utilizing the link has to be stopped; however, this will bear no effect on the other present or future jobs. |
Keywords: Aggregation Manager; sharp_am; Link Failure |
|
Discovered in Release: 3.1.0 |
|
Fixed in Release: 3.2.0 |
|
3230585 |
Description: Fixed the issue of when operating in Dynamic trees mode, ibdiagnet may have printed warning messages about the existence of multiple distinct trees with the same tree ID. |
Keywords: Dynamic tree; ibdiagnet |
|
Discovered in Version: 3.1.0 |
|
Fixed in Release: 3.2.0 |
|
3226743 |
Description: Fixed the issue of when a management host was not connected to a leaf switch, sharp_am might have printed a number of warning messages about trees that could not reach all aggregation nodes. As of SHARP v3.2.0, the active management host is automatically identified and is not treated as a potential compute host. However, please note that this does not include standby management hosts for which a warning message would still appear. These management hosts can be mentioned in a list of GUIDs to ignore via the parameter ignore_host_guids_file. |
Keywords: Aggregation Manager; sharp_am; leaf; GUID |
|
Discovered in Release: 3.0.1 |
|
Fixed in Release: 3.2.0 |
|
3274564 |
Description: Fixed an issue where sharp_benchmark bash script failed to operate on all bash versions. |
Keywords: sharp_benchmark |
|
Discovered in Release: 3.1.1 |
|
Fixed in Release: 3.2.0 |
|
3262936 |
Description: Fixed the issue where a crash took place during sharp_am reboot while physical links were hanging between switches in the fabric. |
Keywords: sharp_am; physical links; crash |
|
Discovered in Release: 3.1.0 |
|
Fixed in Release: 3.1.1 LTS |
|
3192770 |
Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV. |
Keywords: SR-IOV |
|
Discovered in Release: 3.0.0 |
|
Fixed in Release: 3.1.0 |
|
3163697 |
Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux. |
Keywords: File descriptor; libsharp; HCOLL; HPC-X |
|
Discovered in Release: 3.0.0 |
|
Fixed in Release: 3.1.0 |
|
3192770 |
Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV. |
Keywords: SR-IOV |
|
Discovered in Release: 3.0.0 |
|
Fixed in Release: 3.0.1 |
|
3163697 |
Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux. |
Keywords: File descriptor; libsharp; HCOLL |
|
Discovered in Release: 3.0.0 |
|
Fixed in Release: 3.0.1 |
|
2995739 |
Description: Sharp_am daemon is no longer removed when performing rpm upgrade and is overridden instead. |
Keywords: Aggregation Manager; rpm |
|
Discovered in Release: 2.6.1 |
|
Fixed in Release: 2.7.0 |
|
2972970 |
Description: Fixed the issue where completion of SHARP installation using sharp_daemons_setup.sh script depended on python availability. |
Keywords: Aggregation Manager |
|
Discovered in Release: 2.6.1 |
|
Fixed in Release: 2.7.0 |
|
2749073 |
Description: SHARP AM reports the rediscovery of aggregation nodes on every topology change. |
Keywords: Aggregation Manager |
|
Workaround: N/A |
|
Discovered in Release: 2.5.0 |
|
2736102 |
Description: SHARP AM and SHARPD overrides backlog files after restart when log rotation is enabled. |
Keywords: Aggregation Manager, SHARPD, log file |
|
Workaround: N/A |
|
Discovered in Release: 2.5.0 |
|
2700530 |
Description: Terminating a job process during job initialization before sending a job request to Aggregation Manager, might result in job resource leakage in the SHARP Aggregation Manager. |
Workaround: N/A |
|
Keywords: SHARPD, Aggregation Manager |
|
Discovered in Release: 2.5.0 |
|
2726821 |
Description: Terminating SHARPD while the job process is still running will result in job resource leakage in SHARP Aggregation Manager. |
Workaround: Terminate SHARPD after terminating the job processes. |
|
Keywords: SHARPD, Aggregation Manager |
|
2795902 |
Description: SHARPD might allocate handlers on GPU when running with UCX. |
Keywords: SHARPD, SMX, UCX |
|
Workaround: N/A |
|
Discovered in Release: 2.5.0 |
|
Workaround: Disable UCX |
|
2770210 |
Description: Syslog verbosity depends on log file verbosity. |
Keywords: SHARPD, Aggregation Manager |
|
Discovered in Release: 2.5.0 |
|
Workaround: None |
|
2825519 |
Description: Aggregation Manager continue to run after SM failover. |
Keywords: Aggregation Manager |
|
Discovered in Release: 2.5.0 |
|
Workaround: Stop AM daemon manually |
|
2754175 |
Description: SHARP Aggregation Manger might allocate bad links for jobs after receiving timeouts from Aggregation Nodes. |
Workaround: Restart corresponding switch or restart SHARP Aggregation Manager. |
|
Keywords: Aggregation Manager |
|
Discovered in Release: 2.5.0 |
|
2796317 |
Description: SHARP jobs may hang when running in reservations mode (i.e. SHARP allocation is enabled), and reservation is created with limited PKEY, and configuring reservation PKEY on tree is enabled. |
Workaround: The PKEY used for creating the reservation should be "full" (the most significant bit should be on e.g. 0x805c instead of 0x5a). |
|
Keywords: Aggregation Manager, Reservations, PKEY, UFM |
|
Discovered in Release: 2.5.0 |