Release Notes Change History
Feature/Change | Description |
Rev 3.0.1 | |
Bug Fixes | See Bug Fixes section. |
Rev 3.0.0 | |
General | Added support for executing multiple jobs that aggregate data through the same set of switches, while each job utilizes a different set of links. |
SHARP logic is now application-aware with UFM capabilities. SHARP jobs can be assigned an App-ID, which can be used as a reference to the customer application performing these jobs. For further information, please refer to UFM SLURM Integration Appendix in UFM UM. | |
Added the option to limit the SHARP resources that applications are allowed to consume. For further information, please refer to UFM SLURM Integration Appendix in UFM UM. | |
AM | Modified the default resources provided to LLT & SAT jobs. This enables operation of a larger amount of SAT jobs in parallel to few LLT jobs (please see the first three entries in the table below). |
libsharp | SHARP jobs are now executed in exclusive lock mode by default (please see SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE in the table below). |
Rev 2.7.0 | |
Switches | Added support for NVIDIA Quantum-2 switches with NDR speed |
Adapter Cards | Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed |
SHARPD | sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
AM | Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs |
Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them | |
General | Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations |
Rev 2.6.1 | |
General | Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1 |
General | Added information about updatable configuration parameters in the configuration file and help menu |
Network | Added support for keep-alive on connections to SHARPD |
Network | Added support for asynchronous connections |
Network | Disabled UCX listener as default in SHARP Aggregation Manager |
AM | Added support for the non-default subnet prefix |
AM | Added support for DF+ topologies with more than two-level islands |
SHARPD | Added support for caching AM address |
Rev 2.5.0 | |
Resource Management | Added support for exclusive lock requests for streaming aggregation jobs. |
Network | Enabled connection keep-alive between SHARPD and Aggregation Manager. |
Rev 2.4.3 | |
General | Added support for identifying Aggregation Nodes based on SMDB. |
General | Improved minhop tables calculation. |
General | Added a new API for querying events. |
Rev 2.1.4 | |
sharp_am/sharpd/libsharp_coll: Streaming Aggregation | Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch. |
libsharp_coll: GPU Accelerator | Added support for NVIDIA GPU buffers. |
sharp_am: OOB | Added support for identifying the topology type from the OpenSM SMDB file. |
sharp_am: Reboot | Fixed an issue where recovery failed after reboot of all switches in the cluster. |
Rev 2.0.0 | |
sharp_am/sharpd/libsharp_coll | Added support for the following NVIDIA Quantum switch capabilities:
|
sharp_am/sharpd: Resource Management | Added support for enabling and disabling reproducibility on the job level. |
sharp_am/sharpd: Subnet Management | Added support for controlling the SA key for SA operations. |
libsharp_coll: GPUDirect | Added support for CUDA GPUDirect and GPUDirect RDMA. |
Rev 1.8.1 | |
Aggregation Manager (sharp_am): Resiliency | Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup. |
Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements | Socket-based is now activated by default when installed from RPM/MLNX_OFED. |
Parameter | Component | Description |
Rev 3.0.0 | ||
per_prio_default_quota | sharp_am | Update: This parameter controls only the default percentage provided to LLT jobs. Its default value is modified from 3 to 20 |
per_prio_default_sat_quota | sharp_am | New parameter: Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single SAT job by its priority. If no explicit quota request is submitted, this parameter will set the quota percentage to be used. Format: prio_0_quota, [prio_1_quota, ..., prio_9_quota] Note that if only one value is set, it will be applied to all priorities. Default: 3 |
sat_jobs_default_absolute_osts | sharp_am | New parameter: Default number of OSTs to be allocated for SAT jobs per aggregation node per tree. Zero value means that no absolute value should be used, and the default percentage value is used instead. Note that the number of OSTs also affects the number of groups. Default: 0 |
app_resources_default_limit | sharp_am | New parameter: A numerical parameter, applicable only when reservation_mode is set to true. Sets the default max number of trees allowed to be used in parallel by a single app. This default value can be overridden per app upon reservation request. A value of 0 means no allowed resources, which means an app cannot execute any sharp job. Default: 1 |
force_app_id_match | sharp_am | New parameter: A boolean parameter, applicable only when reservation_mode is set to true. When set to true, an application ID must be provided upon job request, and it must match the application ID provided upon reservation request. Otherwise, the job will be denied. Default: False |
SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE | libsharp | Update: Changed default value from 0 (no exclusive lock) to 2 (force exclusive lock) |
Rev 2.7.0 | ||
recovery_retry_interval | sharp_am | New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300 |
enable_seamless_restart | sharp_am | New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True |
seamless_restart_trees_file | sharp_am | New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump |
seamless_restart_max_retries | sharp_am | New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3 |
max_tree_radix | sharp_am | Update: Change default to 252 |
Ib_sat_max_mtu | sharp_am | Update: Change default to 5, to support MAD value that represents 4K MTU. |
per_prio_default_quota | sharp_am | Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch. |
Rev 2.6.1 | ||
dump_dir | sharp_am | Update: Changed default to /var/log |
smx_enabled_protocols | sharp_am | Update: Changed default from 7 to 6 (disable UCX by default) |
ib_mad_timeout | sharp_am | Update: Change deault from 200 to 500 |
dump_dir | sharp_am | Update: Change default to /var/log |
sr_mad_timeout | sharpd | New parameter: Control timeout for ServiceRecord queries Default: 10000 millieconds |
sr_mad_retries | sharpd | New parameter: Control number of retries for ServiceRecord queries Default: 3 retires |
Rev 2.5.0 | ||
smx_keepalive_interval | sharp_am/sharpd | New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds |
smx_incoming_conn_keepalive_interval | sharp_am | New parameter: Keep alive interval for incoming connections 0 to disable Default: 300 seconds |
enable_exclusive_lock | sharp_am | New parameter: Enable/Disable exclusive lock feature. Default: True |
enable_topology_api | sharp_am | New parameter: Enable/Disable Toplogy API feature Default: True |
max_trees_to_build | sharp_am | New parameter: Control number of trees for AM to build Default: 126 |
Rev 2.4.3 | ||
ib_max_mads_on_wire | sharp_am | Modified behavior: Changed default from 100 to 4096 |
ib_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_sat_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
ib_sat_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
Rev 2.0.0 | ||
control_path_version | sharp_am | New parameter |
max_compute_ports_per_agg_node | sharp_am | Modified behavior: When set to 0, AN radix is set to maximal radix value. Default: 0 |
default_reproducibility | sharp_am | New parameter: Control default reproducibility mode for jobs. Default: TURE |
ib_sa_key | sharp_am | New parameter: Control SA key for SA operations. Default: 0x1 |
coll_job_quota_max_payload_per_ost | sharp_job_quota | Modified behavior: Change default value to 1024. |
SHARP_COLL_MAX_PAYLOAD_SIZE | Libsharp_coll | Removed |
SHARP_COLL_NUM_SHARP_COLL_REQ | Libsharp_coll | Removed |
SHARP_COLL_ENABLE_REPRODUCIBLE_MODE | Libsharp_coll | New parameter: Control job reproducibility mode: 0 – Use default. 1 – No reproducibility. 2 – Reproducibility. |
SHARP_COLL_ENABLE_CUDA | Libsharp_coll | New parameter: Enables CUDA GPU direct. |
SHARP_COLL_ENABLE_GPU_DIRECT_RDMA | Libsharp_coll | New parameter: Enables GPU direct RDMA. |
Rev 1.8.1 | ||
pending_mode_timeout | sharp_am | New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup. |
job_info_polling_interval | sharp_am | New parameter: Defines job status polling interval when waiting for jobs to complete upon startup. |