Feature/Change |
Description |
Rev 2.7.0 |
|
Switches |
Added support for NVIDIA Quantum-2 switches with NDR speed |
Adapter Cards |
Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed |
SHARPD |
sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
AM |
Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs |
Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them |
|
General |
Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations |
Rev 2.6.1 |
|
General |
Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1 |
General |
Added information about updatable configuration parameters in the configuration file and help menu |
Network |
Added support for keep-alive on connections to SHARPD |
Network |
Added support for asynchronous connections |
Network |
Disabled UCX listener as default in SHARP Aggregation Manager |
AM |
Added support for the non-default subnet prefix |
AM |
Added support for DF+ topologies with more than two-level islands |
SHARPD |
Added support for caching AM address |
Rev 2.5.0 |
|
Resource Management |
Added support for exclusive lock requests for streaming aggregation jobs. |
Network |
Enabled connection keep-alive between SHARPD and Aggregation Manager. |
Rev 2.4.3 |
|
General |
Added support for identifying Aggregation Nodes based on SMDB. |
General |
Improved minhop tables calculation. |
General |
Added a new API for querying events. |
Rev 2.1.4 |
|
sharp_am/sharpd/libsharp_coll: Streaming Aggregation |
Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch. |
libsharp_coll: GPU Accelerator |
Added support for NVIDIA GPU buffers. |
sharp_am: OOB |
Added support for identifying the topology type from the OpenSM SMDB file. |
sharp_am: Reboot |
Fixed an issue where recovery failed after reboot of all switches in the cluster. |
Rev 2.0.0 |
|
sharp_am/sharpd/libsharp_coll |
Added support for the following NVIDIA Quantum switch capabilities:
|
sharp_am/sharpd: Resource Management |
Added support for enabling and disabling reproducibility on the job level. |
sharp_am/sharpd: Subnet Management |
Added support for controlling the SA key for SA operations. |
libsharp_coll: GPUDirect |
Added support for CUDA GPUDirect and GPUDirect RDMA. |
Rev 1.8.1 |
|
Aggregation Manager (sharp_am): Resiliency |
Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup. |
Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements |
Socket-based is now activated by default when installed from RPM/MLNX_OFED. |
Parameter |
Component |
Description |
Rev 2.7.0 |
||
recovery_retry_interval |
sharp_am |
New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300 |
enable_seamless_restart |
sharp_am |
New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True |
seamless_restart_trees_file |
sharp_am |
New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump |
seamless_restart_max_retries |
sharp_am |
New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3 |
max_tree_radix |
sharp_am |
Update: Change default to 252 |
Ib_sat_max_mtu |
sharp_am |
Update: Change default to 5, to support MAD value that represents 4K MTU. |
per_prio_default_quota |
sharp_am |
Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch. |
Rev 2.6.1 |
||
dump_dir |
sharp_am |
Update: Changed default to /var/log |
smx_enabled_protocols |
sharp_am |
Update: Changed default from 7 to 6 (disable UCX by default) |
ib_mad_timeout |
sharp_am |
Update: Change deault from 200 to 500 |
dump_dir |
sharp_am |
Update: Change default to /var/log |
sr_mad_timeout |
sharpd |
New parameter: Control timeout for ServiceRecord queries Default: 10000 millieconds |
sr_mad_retries |
sharpd |
New parameter: Control number of retries for ServiceRecord queries Default: 3 retires |
Rev 2.5.0 |
||
smx_keepalive_interval |
sharp_am/sharpd |
New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds |
smx_incoming_conn_keepalive_interval |
sharp_am |
New parameter: Keep alive interval for incoming connections 0 to disable Default: 300 seconds |
enable_exclusive_lock |
sharp_am |
New parameter: Enable/Disable exclusive lock feature. Default: True |
enable_topology_api |
sharp_am |
New parameter: Enable/Disable Toplogy API feature Default: True |
max_trees_to_build |
sharp_am |
New parameter: Control number of trees for AM to build Default: 126 |
Rev 2.4.3 |
||
ib_max_mads_on_wire |
sharp_am |
Modified behavior: Changed default from 100 to 4096 |
ib_qpc_local_ack_timeout |
sharp_am |
Modified behavior: Changed default from 0x1F to 0x12 |
ib_sat_qpc_local_ack_timeout |
sharp_am |
Modified behavior: Changed default from 0x1F to 0x12 |
ib_qpc_timeout_retry_limit |
sharp_am |
Modified behavior: Changed default from 7 to 6 |
ib_sat_qpc_timeout_retry_limit |
sharp_am |
Modified behavior: Changed default from 7 to 6 |
Rev 2.0.0 |
||
control_path_version |
sharp_am |
New parameter Default |
max_compute_ports_per_agg_node |
sharp_am |
Modified behavior: When set to 0, AN radix is set to maximal radix value. Default: 0 |
default_reproducibility |
sharp_am |
New parameter: Control default reproducibility mode for jobs. Default: TURE |
ib_sa_key |
sharp_am |
New parameter: Control SA key for SA operations. Default: 0x1 |
coll_job_quota_max_payload_per_ost |
sharp_job_quota |
Modified behavior: Change default value to 1024. |
SHARP_COLL_MAX_PAYLOAD_SIZE |
Libsharp_coll |
Removed |
SHARP_COLL_NUM_SHARP_COLL_REQ |
Libsharp_coll |
Removed |
SHARP_COLL_ENABLE_REPRODUCIBLE_MODE |
Libsharp_coll |
New parameter: Control job reproducibility mode: 0 – Use default. 1 – No reproducibility. 2 – Reproducibility. |
SHARP_COLL_ENABLE_CUDA |
Libsharp_coll |
New parameter: Enables CUDA GPU direct. |
SHARP_COLL_ENABLE_GPU_DIRECT_RDMA |
Libsharp_coll |
New parameter: Enables GPU direct RDMA. |
Rev 1.8.1 |
||
pending_mode_timeout |
sharp_am |
New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup. |
job_info_polling_interval |
sharp_am |
New parameter: Defines job status polling interval when waiting for jobs to complete upon startup. |