Bug Fixes History

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.7.0

The following table provides a list of bugs fixed in this SHARP version.

Internal Ref.

Issue

3686321

Description: When upgrading UFM from previous versions to UFM 6.15.x, sharp_am persistent directory as mentioned in the configuration file directs to a path that does not exist.

This leads to failure in saving reservation and job information, so in case of a restart of sharp_am, it won’t be able to retrieve required information and return to its previous state.

Keywords: sharp_am, UFM, upgrade

Discovered in Version: 3.5.0

Fixed in Release: 3.6.0

3724093

Description: Fixed the issue where libsharp, when communicating with sharp_am via UCX, automatically selects the first available IB adapter instead of the instructed adapter for the data path.

Keywords: libsharp, UCX

Discovered in Version: 3.5.1

Fixed in Release: 3.6.0

3665349

Description: Fixed an issue where sharp_am failed to detect an abnormal termination of an application executing a SHARP job, which resulted in the failure to properly clean up its resources.

Keywords: sharp_am, libsharp

Discovered in Version: 3.6.0

Fixed in Release: 3.6.0

3646010

Description: Fixed an issue in sharp_am where it failed to support virtual ports when OpenSM topology policies were employed, and sharp_am was configured to utilize only one of the sub-topologies.

Keywords: sharp_am, Virtual Ports, OpenSM, Topology Policy

Discovered in Version: 3.6.0

Fixed in Release: 3.6.0

3609384

Description: Fixed issues concerning Sharp_AM connection creation with rank zero clients of active jobs during a restart when UCX is enabled.

Keywords: sharp_am, libsharp, restart

Discovered in Version: 3.4.0

Fixed in Release: 3.5.0

3541153

Description: Fixed an issue where client application is abnormally terminated before the sharp_coll_finalize method, sharp_am is supposed to automatically detect and clean the job resources. However, with UCX, only one such termination is detected per cycle, leading to incomplete job cleaning. Similarly, when using NCCL and hosts with multiple GPUs/HCAs, each HCA gets its own SHARP job, which results in sharp_am taking several cycles to detect all the jobs that require cleaning. As a consequence, hosts operating in the previous application cannot initiate a new SHARP job until sharp_am detects and cleans all the necessary jobs.

Keywords: sharp_am, NCCL, UCX

Discovered in Version: 3.4.0

Fixed in Release: 3.5.0

3400293

Description: Fixed an issue in libsharp where it failed to respond to messages from the SM while searching for Service Records, causing the SM to print timeout messages.

Keywords: sharp_am; openSM

Discovered in Version: 3.1.0

Fixed in Release: 3.4.0

3479721

Description: Fixed the issue where sharp_am did not handle hypercube topologies well, causing it to incorrectly treat different switches as duplicates.

Keywords: sharp_am; hypercube

Discovered in Version: 3.3.0

Fixed in Release: 3.4.0

3496440

Description: Fixed the issue in sharp_am where excessive log messages were printed for each disconnected or restarted compute host. Now, the information is printed in a consolidated manner in the form of summaries of disconnected hosts or a list of those hosts in a single log message.

However, for more comprehensive details, the complete list of hosts is still available and printed at the DEBUG level.

Keywords: sharp_am

Discovered in Version: 3.3.0

Fixed in Release: 3.4.0

3336788

Description: Fixed the issue in Firmware where MAD error responses might have been received in libsharp.

Keywords: sharp_am; libsharp

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0 (Quantum-2 Firmware 31.2010.6064 )

3343503

Description: Fixed the issue where sharp_am installed from MLNX_OFED used an invalid range of job IDs, resulting in occasional errors when trying to establish new SHARP jobs.

Keywords: MLNX_OFED; sharp_am

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0

3368381

Description: Fixed the issue of when no sufficient amount of retries was made to resend failed libsharp GroupJoin MADs, SHARP jobs failed before they even started.

Keywords: libsharp; MADs

Discovered in Release: 3.0.0

Fixed in Release: 3.3.0

3393902

Description: Fixed the issue where re-created virtual ports were not recognized by sharp_am, thus the correct tree was not built for them. This resulted in SAT jobs getting ibv_poll_cq failure in libsharp.

Keywords: Virtual port; sharp_am; libsharp; SAT; ibv_poll_cq

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0

3404474

Description: Fixed an issue where failure of application allocation of all hosts done via /app/sharp/resources REST-API returned a successful job instead of error.

Keywords: REST API; allocation

Discovered in Release: 3.2.0

Fixed in Release: 3.3.0

3406186

Description: Fixed an issue where SHARP AM failed handling reports from OpenSM if some switch ports were down or isolated.

Keywords: Aggregation Manager; Aggregation Node; OpenSM

Discovered in Release: 3.2.0

Fixed in Release: 3.3.0

3236363

Description: Fixed the way physical link failures between switches are handled. In the event of a link failure, a SHARP job utilizing the link has to be stopped; however, this will bear no effect on the other present or future jobs.

Keywords: Aggregation Manager; sharp_am; Link Failure

Discovered in Release: 3.1.0

Fixed in Release: 3.2.0

3230585

Description: Fixed the issue of when operating in Dynamic trees mode, ibdiagnet may have printed warning messages about the existence of multiple distinct trees with the same tree ID.

Keywords: Dynamic tree; ibdiagnet

Discovered in Version: 3.1.0

Fixed in Release: 3.2.0

3226743

Description: Fixed the issue of when a management host was not connected to a leaf switch, sharp_am might have printed a number of warning messages about trees that could not reach all aggregation nodes.

As of SHARP v3.2.0, the active management host is automatically identified and is not treated as a potential compute host.

However, please note that this does not include standby management hosts for which a warning message would still appear. These management hosts can be mentioned in a list of GUIDs to ignore via the parameter ignore_host_guids_file.

Keywords: Aggregation Manager; sharp_am; leaf; GUID

Discovered in Release: 3.0.1

Fixed in Release: 3.2.0

3274564

Description: Fixed an issue where sharp_benchmark bash script failed to operate on all bash versions.

Keywords: sharp_benchmark

Discovered in Release: 3.1.1

Fixed in Release: 3.2.0

3262936

Description: Fixed the issue where a crash took place during sharp_am reboot while physical links were hanging between switches in the fabric.

Keywords: sharp_am; physical links; crash

Discovered in Release: 3.1.0

Fixed in Release: 3.1.1 LTS

3192770

Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV.

Keywords: SR-IOV

Discovered in Release: 3.0.0

Fixed in Release: 3.1.0

3163697

Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux.

Keywords: File descriptor; libsharp; HCOLL; HPC-X

Discovered in Release: 3.0.0

Fixed in Release: 3.1.0

3192770

Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV.

Keywords: SR-IOV

Discovered in Release: 3.0.0

Fixed in Release: 3.0.1

3163697

Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux.

Keywords: File descriptor; libsharp; HCOLL

Discovered in Release: 3.0.0

Fixed in Release: 3.0.1

2995739

Description: Sharp_am daemon is no longer removed when performing rpm upgrade and is overridden instead.

Keywords: Aggregation Manager; rpm

Discovered in Release: 2.6.1

Fixed in Release: 2.7.0

2972970

Description: Fixed the issue where completion of SHARP installation using sharp_daemons_setup.sh script depended on python availability.

Keywords: Aggregation Manager

Discovered in Release: 2.6.1

Fixed in Release: 2.7.0

2749073

Description: SHARP AM reports the rediscovery of aggregation nodes on every topology change.

Keywords: Aggregation Manager

Workaround: N/A

Discovered in Release: 2.5.0

2736102

Description: SHARP AM and SHARPD overrides backlog files after restart when log rotation is enabled.

Keywords: Aggregation Manager, SHARPD, log file

Workaround: N/A

Discovered in Release: 2.5.0

2700530

Description: Terminating a job process during job initialization before sending a job request to Aggregation Manager, might result in job resource leakage in the SHARP Aggregation Manager.

Workaround: N/A

Keywords: SHARPD, Aggregation Manager

Discovered in Release: 2.5.0

2726821

Description: Terminating SHARPD while the job process is still running will result in job resource leakage in SHARP Aggregation Manager.

Workaround: Terminate SHARPD after terminating the job processes.

Keywords: SHARPD, Aggregation Manager

2795902

Description: SHARPD might allocate handlers on GPU when running with UCX.

Keywords: SHARPD, SMX, UCX

Workaround: N/A

Discovered in Release: 2.5.0

Workaround: Disable UCX

2770210

Description: Syslog verbosity depends on log file verbosity.

Keywords: SHARPD, Aggregation Manager

Discovered in Release: 2.5.0

Workaround: None

2825519

Description: Aggregation Manager continue to run after SM failover.

Keywords: Aggregation Manager

Discovered in Release: 2.5.0

Workaround: Stop AM daemon manually

2754175

Description: SHARP Aggregation Manger might allocate bad links for jobs after receiving timeouts from Aggregation Nodes.

Workaround: Restart corresponding switch or restart SHARP Aggregation Manager.

Keywords: Aggregation Manager

Discovered in Release: 2.5.0

2796317

Description: SHARP jobs may hang when running in reservations mode (i.e. SHARP allocation is enabled), and reservation is created with limited PKEY, and configuring reservation PKEY on tree is enabled.

Workaround: The PKEY used for creating the reservation should be "full" (the most significant bit should be on e.g. 0x805c instead of 0x5a).

Keywords: Aggregation Manager, Reservations, PKEY, UFM

Discovered in Release: 2.5.0

© Copyright 2024, NVIDIA. Last updated on May 6, 2024.