NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.12.0

Bug Fixes in this Version

Internal Reference Number

Issue

4507679

Description: Fixed an issue where SHARP_am was ignoring SharpError traps due to missing syndrome information, leading to a flood of repeated traps. SHARP_am now handles these traps correctly and suppresses redundant resends.

Keywords: SHARP_am; traps

Discovered in Version: 3.11.0

Fixed in Release: 3.12.0

4507678

Description: Fixed an issue where SHARP jobs failed to start due to improper handling of timeouts in QP Allocation and Confirmation MADs. The updated logic adds retries with shorter timeouts and re-attempts the full QP allocation sequence, improving start-job resiliency and reducing failure risk from network timeouts.

Keywords: libsharp; timeouts

Discovered in Version: 3.11.0

Fixed in Release: 3.12.0

© Copyright 2025, NVIDIA. Last updated on Aug 25, 2025.