NVIDIA Docs Hub NVIDIA Networking Accelerator Software NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.2.0 Release Notes Change History

Release Notes Change History

Changes and New Features History

Feature/Change	Description
Rev 3.1.1 LTS
SHARP Cleanup	Added the ability to clean up all SHARP-related definitions either to spare resources or to contribute to the recovery from an error.
General	Updated MLNX_OFED and firmware versions in General Information section.
Bug Fixes	See Bug Fixes section.
Rev 3.1.0
Aggregation Manager (AM)	Addedsupport for dynamic creation of treesinstead of static allocation whenSHARPis initialized.
Rev 3.0.1
Bug Fixes	See Bug Fixes section.
Rev 3.0.0
General	Added support for executing multiple jobs that aggregate data through the same set of switches, while each job utilizes a different set of links.
	SHARP logic is now application-aware with UFM capabilities. SHARP jobs can be assigned an App-ID, which can be used as a reference to the customer application performing these jobs. For further information, please refer to UFM SLURM Integration Appendix in UFM UM.
	Added the option to limit the SHARP resources that applications are allowed to consume. For further information, please refer to UFM SLURM Integration Appendix in UFM UM.
AM	Modified the default resources provided to LLT & SAT jobs. This enables operation of a larger amount of SAT jobs in parallel to few LLT jobs (please see the first three entries in the table below).
libsharp	SHARP jobs are now executed in exclusive lock mode by default (please see SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE in the table below).
Rev 2.7.0
Switches	Added support for NVIDIA Quantum-2 switches with NDR speed
Adapter Cards	Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed
SHARPD	sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process
AM	Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs
AM	Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them
General	Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations
Rev 2.6.1
General	Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1
General	Added information about updatable configuration parameters in the configuration file and help menu
Network	Added support for keep-alive on connections to SHARPD
Network	Added support for asynchronous connections
Network	Disabled UCX listener as default in SHARP Aggregation Manager
AM	Added support for the non-default subnet prefix
AM	Added support for DF+ topologies with more than two-level islands
SHARPD	Added support for caching AM address
Rev 2.5.0
Resource Management	Added support for exclusive lock requests for streaming aggregation jobs.
Network	Enabled connection keep-alive between SHARPD and Aggregation Manager.
Rev 2.4.3
General	Added support for identifying Aggregation Nodes based on SMDB.
General	Improved minhop tables calculation.
General	Added a new API for querying events.
Rev 2.1.4
sharp_am/sharpd/libsharp_coll: Streaming Aggregation	Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch.
libsharp_coll: GPU Accelerator	Added support for NVIDIA GPU buffers.
sharp_am: OOB	Added support for identifying the topology type from the OpenSM SMDB file.
sharp_am: Reboot	Fixed an issue where recovery failed after reboot of all switches in the cluster.
Rev 2.0.0
sharp_am/sharpd/libsharp_coll	Added support for the following NVIDIA Quantum switch capabilities: Performing data operations on new data types (unsigned short, short, and short floating point data types) 1K OST payload
sharp_am/sharpd: Resource Management	Added support for enabling and disabling reproducibility on the job level.
sharp_am/sharpd: Subnet Management	Added support for controlling the SA key for SA operations.
libsharp_coll: GPUDirect	Added support for CUDA GPUDirect and GPUDirect RDMA.
Rev 1.8.1
Aggregation Manager (sharp_am): Resiliency	Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup.
Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements	Socket-based is now activated by default when installed from RPM/MLNX_OFED.

Parameters Change History

Parameter	Component	Description
Rev 3.1.1 LTS
clean_and_exit	sharp_am	New parameter: A boolean parameter. When set to TRUE, sharp_am does not operate normally, but instead cleans SHARP resources from all switches and exits. Default: False - Operate normally.
Rev 3.1.0
dynamic_tree_allocation	sharp_am	New parameter: A boolean parameter, tells whether trees should be allocated dynamically for each SHARP job or have trees allocated during sharp_am initialization. Default: False
max_trees_to_build	sharp_am	Update: In case dynamic_tree_allocation is set to True, this parameter will have no effect on the number of trees allocated; sharp_am would determine that value based on the amount of possible trees the switches can have. However, in the dynamic trees mode, this parameter affects the number of skeleton trees that sharp_am will use. It is recommended that the minimal value be the same as the number of root switches in the fabric. In case dynamic_tree_allocation is set to False, this parameter can be used to fulfil its purpose. Default:
SHARP_COLL_IB_TIMEOUT	libsharp	New parameter: Transport timeout on SHARP QP Default: 18
SHARP_COLL_IB_RETRY_COUNT	libsharp	New parameter: Transport retries on SHARP QP Default: 7
SHARP_COLL_IB_RNR_TIMER	libsharp	New parameter: RNR timeout on SHARP QP Default: 12
SHARP_COLL_IB_RNR_RETRY	libsharp	New parameter: RNR retries on SHARP QP Default: 7
SHARP_COLL_IB_SL	libsharp	New parameter: SL Default: 0
SHARP_COLL_ENABLE_MCAST_TARGET	libsharp	Update: Modified the default value from True to False. Default: False
Rev 3.0.0
per_prio_default_quota	sharp_am	Update: This parameter controls only the default percentage provided to LLT jobs. Its default value is modified from 3 to 20
per_prio_default_sat_quota	sharp_am	New parameter: Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single SAT job by its priority. If no explicit quota request is submitted, this parameter will set the quota percentage to be used. Format: prio_0_quota, [prio_1_quota, ..., prio_9_quota] Note that if only one value is set, it will be applied to all priorities. Default: 3
sat_jobs_default_absolute_osts	sharp_am	New parameter: Default number of OSTs to be allocated for SAT jobs per aggregation node per tree. Zero value means that no absolute value should be used, and the default percentage value is used instead. Note that the number of OSTs also affects the number of groups. Default: 0
app_resources_default_limit	sharp_am	New parameter: A numerical parameter, applicable only when reservation_mode is set to true. Sets the default max number of trees allowed to be used in parallel by a single app. This default value can be overridden per app upon reservation request. A value of 0 means no allowed resources, which means an app cannot execute any sharp job. Default: 1
force_app_id_match	sharp_am	New parameter: A boolean parameter, applicable only when reservation_mode is set to true. When set to true, an application ID must be provided upon job request, and it must match the application ID provided upon reservation request. Otherwise, the job will be denied. Default: False
SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE	libsharp	Update: Changed default value from 0 (no exclusive lock) to 2 (force exclusive lock)
Rev 2.7.0
recovery_retry_interval	sharp_am	New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300
enable_seamless_restart	sharp_am	New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True
seamless_restart_trees_file	sharp_am	New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump
seamless_restart_max_retries	sharp_am	New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3
max_tree_radix	sharp_am	Update: Change default to 252
Ib_sat_max_mtu	sharp_am	Update: Change default to 5, to support MAD value that represents 4K MTU.
per_prio_default_quota	sharp_am	Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch.
Rev 2.6.1
dump_dir	sharp_am	Update: Changed default to /var/log
smx_enabled_protocols	sharp_am	Update: Changed default from 7 to 6 (disable UCX by default)
ib_mad_timeout	sharp_am	Update: Change deault from 200 to 500
dump_dir	sharp_am	Update: Change default to /var/log
sr_mad_timeout	sharpd	New parameter: Control timeout for ServiceRecord queries Default: 10000 millieconds
sr_mad_retries	sharpd	New parameter: Control number of retries for ServiceRecord queries Default: 3 retires
Rev 2.5.0
smx_keepalive_interval	sharp_am/sharpd	New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds
smx_incoming_conn_keepalive_interval	sharp_am	New parameter: Keep alive interval for incoming connections 0 to disable Default: 300 seconds
enable_exclusive_lock	sharp_am	New parameter: Enable/Disable exclusive lock feature. Default: True
enable_topology_api	sharp_am	New parameter: Enable/Disable Toplogy API feature Default: True
max_trees_to_build	sharp_am	New parameter: Control number of trees for AM to build Default: 126
Rev 2.4.3
ib_max_mads_on_wire	sharp_am	Modified behavior: Changed default from 100 to 4096
ib_qpc_local_ack_timeout	sharp_am	Modified behavior: Changed default from 0x1F to 0x12
ib_sat_qpc_local_ack_timeout	sharp_am	Modified behavior: Changed default from 0x1F to 0x12
ib_qpc_timeout_retry_limit	sharp_am	Modified behavior: Changed default from 7 to 6
ib_sat_qpc_timeout_retry_limit	sharp_am	Modified behavior: Changed default from 7 to 6
Rev 2.0.0
control_path_version	sharp_am	New parameter Default
max_compute_ports_per_agg_node	sharp_am	Modified behavior: When set to 0, AN radix is set to maximal radix value. Default: 0
default_reproducibility	sharp_am	New parameter: Control default reproducibility mode for jobs. Default: TURE
ib_sa_key	sharp_am	New parameter: Control SA key for SA operations. Default: 0x1
coll_job_quota_max_payload_per_ost	sharp_job_quota	Modified behavior: Change default value to 1024.
SHARP_COLL_MAX_PAYLOAD_SIZE	Libsharp_coll	Removed
SHARP_COLL_NUM_SHARP_COLL_REQ	Libsharp_coll	Removed
SHARP_COLL_ENABLE_REPRODUCIBLE_MODE	Libsharp_coll	New parameter: Control job reproducibility mode: 0 – Use default. 1 – No reproducibility. 2 – Reproducibility.
SHARP_COLL_ENABLE_CUDA	Libsharp_coll	New parameter: Enables CUDA GPU direct.
SHARP_COLL_ENABLE_GPU_DIRECT_RDMA	Libsharp_coll	New parameter: Enables GPU direct RDMA.
Rev 1.8.1
pending_mode_timeout	sharp_am	New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup.
job_info_polling_interval	sharp_am	New parameter: Defines job status polling interval when waiting for jobs to complete upon startup.

On This Page

Release Notes Change History

Changes and New Features History

Parameters Change History