Changes and New Features History
Feature/Change | Description |
---|---|
Rev 2.7.0 | |
Switches | Added support for NVIDIA Quantum-2 switches with NDR speed |
Adapter Cards | Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed |
SHARPD | sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
AM | Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs |
Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them | |
General | Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations |
Rev 2.6.1 | |
General | Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1 |
General | Added information about updatable configuration parameters in the configuration file and help menu |
Network | Added support for keep-alive on connections to SHARPD |
Network | Added support for asynchronous connections |
Network | Disabled UCX listener as default in SHARP Aggregation Manager |
AM | Added support for the non-default subnet prefix |
AM | Added support for DF+ topologies with more than two-level islands |
SHARPD | Added support for caching AM address |
Rev 2.5.0 | |
Resource Management | Added support for exclusive lock requests for streaming aggregation jobs. |
Network | Enabled connection keep-alive between SHARPD and Aggregation Manager. |
Rev 2.4.3 | |
General | Added support for identifying Aggregation Nodes based on SMDB. |
General | Improved minhop tables calculation. |
General | Added a new API for querying events. |
Rev 2.1.4 | |
sharp_am/sharpd/libsharp_coll: Streaming Aggregation | Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch. |
libsharp_coll: GPU Accelerator | Added support for NVIDIA GPU buffers. |
sharp_am: OOB | Added support for identifying the topology type from the OpenSM SMDB file. |
sharp_am: Reboot | Fixed an issue where recovery failed after reboot of all switches in the cluster. |
Rev 2.0.0 | |
sharp_am/sharpd/libsharp_coll | Added support for the following NVIDIA Quantum switch capabilities:
|
sharp_am/sharpd: Resource Management | Added support for enabling and disabling reproducibility on the job level. |
sharp_am/sharpd: Subnet Management | Added support for controlling the SA key for SA operations. |
libsharp_coll: GPUDirect | Added support for CUDA GPUDirect and GPUDirect RDMA. |
Rev 1.8.1 | |
Aggregation Manager (sharp_am): Resiliency | Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup. |
Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements | Socket-based is now activated by default when installed from RPM/MLNX_OFED. |
Parameters Change History
Parameter | Component | Description |
---|---|---|
Rev 2.7.0 | ||
recovery_retry_interval | sharp_am | New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300 |
enable_seamless_restart | sharp_am | New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True |
seamless_restart_trees_file | sharp_am | New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump |
seamless_restart_max_retries | sharp_am | New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3 |
max_tree_radix | sharp_am | Update: Change default to 252 |
Ib_sat_max_mtu | sharp_am | Update: Change default to 5, to support MAD value that represents 4K MTU. |
per_prio_default_quota | sharp_am | Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch. |
Rev 2.6.1 | ||
dump_dir | sharp_am | Update: Changed default to /var/log |
smx_enabled_protocols | sharp_am | Update: Changed default from 7 to 6 (disable UCX by default) |
ib_mad_timeout | sharp_am | Update: Change deault from 200 to 500 |
dump_dir | sharp_am | Update: Change default to /var/log |
sr_mad_timeout | sharpd | New parameter: Control timeout for ServiceRecord queries Default: 10000 millieconds |
sr_mad_retries | sharpd | New parameter: Control number of retries for ServiceRecord queries Default: 3 retires |
Rev 2.5.0 | ||
smx_keepalive_interval | sharp_am/sharpd | New parameter: Keep alive interval in seconds 0 to disable keep alive. Default: 60 seconds |
smx_incoming_conn_keepalive_interval | sharp_am | New parameter: Keep alive interval for incoming connections 0 to disable Default: 300 seconds |
enable_exclusive_lock | sharp_am | New parameter: Enable/Disable exclusive lock feature. Default: True |
enable_topology_api | sharp_am | New parameter: Enable/Disable Toplogy API feature Default: True |
max_trees_to_build | sharp_am | New parameter: Control number of trees for AM to build Default: 126 |
Rev 2.4.3 | ||
ib_max_mads_on_wire | sharp_am | Modified behavior: Changed default from 100 to 4096 |
ib_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_sat_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
ib_sat_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
Rev 2.0.0 | ||
control_path_version | sharp_am | New parameter |
max_compute_ports_per_agg_node | sharp_am | Modified behavior: When set to 0, AN radix is set to maximal radix value. Default: 0 |
default_reproducibility | sharp_am | New parameter: Control default reproducibility mode for jobs. Default: TURE |
ib_sa_key | sharp_am | New parameter: Control SA key for SA operations. Default: 0x1 |
coll_job_quota_max_payload_per_ost | sharp_job_quota | Modified behavior: Change default value to 1024. |
SHARP_COLL_MAX_PAYLOAD_SIZE | Libsharp_coll | Removed |
SHARP_COLL_NUM_SHARP_COLL_REQ | Libsharp_coll | Removed |
SHARP_COLL_ENABLE_REPRODUCIBLE_MODE | Libsharp_coll | New parameter: Control job reproducibility mode: 0 – Use default. 1 – No reproducibility. 2 – Reproducibility. |
SHARP_COLL_ENABLE_CUDA | Libsharp_coll | New parameter: Enables CUDA GPU direct. |
SHARP_COLL_ENABLE_GPU_DIRECT_RDMA | Libsharp_coll | New parameter: Enables GPU direct RDMA. |
Rev 1.8.1 | ||
pending_mode_timeout | sharp_am | New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup. |
job_info_polling_interval | sharp_am | New parameter: Defines job status polling interval when waiting for jobs to complete upon startup. |