NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.5.0
1.0

Bug Fixes History

The following table provides a list of bugs fixed in this SHARP version.

Internal Ref.

Issue

3400293

Description: Fixed an issue in libsharp where it failed to respond to messages from the SM while searching for Service Records, causing the SM to print timeout messages.

Keywords: sharp_am; openSM

Discovered in Version: 3.1.0

Fixed in Release: 3.4.0

3479721

Description: Fixed the issue where sharp_am did not handle hypercube topologies well, causing it to incorrectly treat different switches as duplicates.

Keywords: sharp_am; hypercube

Discovered in Version: 3.3.0

Fixed in Release: 3.4.0

3496440

Description: Fixed the issue in sharp_am where excessive log messages were printed for each disconnected or restarted compute host. Now, the information is printed in a consolidated manner in the form of summaries of disconnected hosts or a list of those hosts in a single log message.

However, for more comprehensive details, the complete list of hosts is still available and printed at the DEBUG level.

Keywords: sharp_am

Discovered in Version: 3.3.0

Fixed in Release: 3.4.0

3336788

Description: Fixed the issue in Firmware where MAD error responses might have been received in libsharp.

Keywords: sharp_am; libsharp

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0 (Quantum-2 Firmware 31.2010.6064 )

3343503

Description: Fixed the issue where sharp_am installed from MLNX_OFED used an invalid range of job IDs, resulting in occasional errors when trying to establish new SHARP jobs.

Keywords: MLNX_OFED; sharp_am

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0

3368381

Description: Fixed the issue of when no sufficient amount of retries was made to resend failed libsharp GroupJoin MADs, SHARP jobs failed before they even started.

Keywords: libsharp; MADs

Discovered in Release: 3.0.0

Fixed in Release: 3.3.0

3393902

Description: Fixed the issue where re-created virtual ports were not recognized by sharp_am, thus the correct tree was not built for them. This resulted in SAT jobs getting ibv_poll_cq failure in libsharp.

Keywords: Virtual port; sharp_am; libsharp; SAT; ibv_poll_cq

Discovered in Version: 3.2.0

Fixed in Release: 3.3.0

3404474

Description: Fixed an issue where failure of application allocation of all hosts done via /app/sharp/resources REST-API returned a successful job instead of error.

Keywords: REST API; allocation

Discovered in Release: 3.2.0

Fixed in Release: 3.3.0

3406186

Description: Fixed an issue where SHARP AM failed handling reports from OpenSM if some switch ports were down or isolated.

Keywords: Aggregation Manager; Aggregation Node; OpenSM

Discovered in Release: 3.2.0

Fixed in Release: 3.3.0

3236363

Description: Fixed the way physical link failures between switches are handled. In the event of a link failure, a SHARP job utilizing the link has to be stopped; however, this will bear no effect on the other present or future jobs.

Keywords: Aggregation Manager; sharp_am; Link Failure

Discovered in Release: 3.1.0

Fixed in Release: 3.2.0

3230585

Description: Fixed the issue of when operating in Dynamic trees mode, ibdiagnet may have printed warning messages about the existence of multiple distinct trees with the same tree ID.

Keywords: Dynamic tree; ibdiagnet

Discovered in Version: 3.1.0

Fixed in Release: 3.2.0

3226743

Description: Fixed the issue of when a management host was not connected to a leaf switch, sharp_am might have printed a number of warning messages about trees that could not reach all aggregation nodes.

As of SHARP v3.2.0, the active management host is automatically identified and is not treated as a potential compute host.

However, please note that this does not include standby management hosts for which a warning message would still appear. These management hosts can be mentioned in a list of GUIDs to ignore via the parameter ignore_host_guids_file.

Keywords: Aggregation Manager; sharp_am; leaf; GUID

Discovered in Release: 3.0.1

Fixed in Release: 3.2.0

3274564

Description: Fixed an issue where sharp_benchmark bash script failed to operate on all bash versions.

Keywords: sharp_benchmark

Discovered in Release: 3.1.1

Fixed in Release: 3.2.0

3262936

Description: Fixed the issue where a crash took place during sharp_am reboot while physical links were hanging between switches in the fabric.

Keywords: sharp_am; physical links; crash

Discovered in Release: 3.1.0

Fixed in Release: 3.1.1 LTS

3192770

Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV.

Keywords: SR-IOV

Discovered in Release: 3.0.0

Fixed in Release: 3.1.0

3163697

Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux.

Keywords: File descriptor; libsharp; HCOLL; HPC-X

Discovered in Release: 3.0.0

Fixed in Release: 3.1.0

3192770

Description: Fixed the issue where SHARP jobs failed when using virtual interfaces configured with SR-IOV.

Keywords: SR-IOV

Discovered in Release: 3.0.0

Fixed in Release: 3.0.1

3163697

Description: Fixed the issue of when the client application used more than 1024 file descriptors (range limit defined by FD_SETSIZE), libsharp was prevented from using any more file descriptors. Using poll() instead of select() enables using the full range of allowed file descriptors by Linux.

Keywords: File descriptor; libsharp; HCOLL

Discovered in Release: 3.0.0

Fixed in Release: 3.0.1

2995739

Description: Sharp_am daemon is no longer removed when performing rpm upgrade and is overridden instead.

Keywords: Aggregation Manager; rpm

Discovered in Release: 2.6.1

Fixed in Release: 2.7.0

2972970

Description: Fixed the issue where completion of SHARP installation using sharp_daemons_setup.sh script depended on python availability.

Keywords: Aggregation Manager

Discovered in Release: 2.6.1

Fixed in Release: 2.7.0

2749073

Description: SHARP AM reports the rediscovery of aggregation nodes on every topology change.

Keywords: Aggregation Manager

Workaround: N/A

Discovered in Release: 2.5.0

2736102

Description: SHARP AM and SHARPD overrides backlog files after restart when log rotation is enabled.

Keywords: Aggregation Manager, SHARPD, log file

Workaround: N/A

Discovered in Release: 2.5.0

2700530

Description: Terminating a job process during job initialization before sending a job request to Aggregation Manager, might result in job resource leakage in the SHARP Aggregation Manager.

Workaround: N/A

Keywords: SHARPD, Aggregation Manager

Discovered in Release: 2.5.0

2726821

Description: Terminating SHARPD while the job process is still running will result in job resource leakage in SHARP Aggregation Manager.

Workaround: Terminate SHARPD after terminating the job processes.

Keywords: SHARPD, Aggregation Manager

2795902

Description: SHARPD might allocate handlers on GPU when running with UCX.

Keywords: SHARPD, SMX, UCX

Workaround: N/A

Discovered in Release: 2.5.0

Workaround: Disable UCX

2770210

Description: Syslog verbosity depends on log file verbosity.

Keywords: SHARPD, Aggregation Manager

Discovered in Release: 2.5.0

Workaround: None

2825519

Description: Aggregation Manager continue to run after SM failover.

Keywords: Aggregation Manager

Discovered in Release: 2.5.0

Workaround: Stop AM daemon manually

2754175

Description: SHARP Aggregation Manger might allocate bad links for jobs after receiving timeouts from Aggregation Nodes.

Workaround: Restart corresponding switch or restart SHARP Aggregation Manager.

Keywords: Aggregation Manager

Discovered in Release: 2.5.0

2796317

Description: SHARP jobs may hang when running in reservations mode (i.e. SHARP allocation is enabled), and reservation is created with limited PKEY, and configuring reservation PKEY on tree is enabled.

Workaround: The PKEY used for creating the reservation should be "full" (the most significant bit should be on e.g. 0x805c instead of 0x5a).

Keywords: Aggregation Manager, Reservations, PKEY, UFM

Discovered in Release: 2.5.0

© Copyright 2023, NVIDIA. Last updated on Nov 16, 2023.