Release Notes Revision History
Internal Ref. | Issue |
2678295 | Description: SHARP AM does not support reconnecting HCA port to different switch, and as a result, streaming aggregation jobs might fail |
Keywords: Aggregation Manager | |
Discovered in Release: 2.4.3 | |
Fixed in version: 2.5.0 | |
2439054 | Description: SHARP AM might hang when there is stress on SMX sub-component due to the operation system sockets buffer sizes. |
Keywords: Aggregation Manager, SMX | |
Discovered in Release: 2.4.3 | |
Fixed in version: 2.5.0 | |
1179747 | Description: Changing smx_sock_interface configuration parameter is not supported. |
Keywords: SHARP Daemon | |
Fixed in version: 2.5.0 |
Feature/Change | Description |
Rev 2.5.0 | |
Resource Management | Added support for exclusive lock requests for streaming aggregation jobs. |
Network | Enabled connection keep-alive between SHARPD and Aggregation Manager. |
Rev 2.4.3 | |
General | Added support for identifying Aggregation Nodes based on SMDB. |
General | Improved minhop tables calculation. |
General | Added a new API for querying events. |
Rev 2.1.4 | |
sharp_am/sharpd/libsharp_coll: Streaming Aggregation | Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch. |
libsharp_coll: GPU Accelerator | Added support for NVIDIA GPU buffers. |
sharp_am: OOB | Added support for identifying the topology type from the OpenSM SMDB file. |
sharp_am: Reboot | Fixed an issue where recovery failed after reboot of all switches in the cluster. |
Rev 2.0.0 | |
sharp_am/sharpd/libsharp_coll | Added support for the following NVIDIA Quantum switch capabilities:
|
sharp_am/sharpd: Resource Management | Added support for enabling and disabling reproducibility on the job level. |
sharp_am/sharpd: Subnet Management | Added support for controlling the SA key for SA operations. |
libsharp_coll: GPUDirect | Added support for CUDA GPUDirect and GPUDirect RDMA. |
Rev 1.8.1 | |
Aggregation Manager (sharp_am): Resiliency | Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup. |
Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements | Socket-based is now activated by default when installed from RPM/MLNX_OFED. |
Parameter | Component | Description |
Rev 2.5.0 | ||
smx_keepalive_interval | sharp_am/sharpd | New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds |
smx_incoming_conn_keepalive_interval | sharp_am | New parameter: Keep alive interval for incoming connections 0 to disable Default: 300 seconds |
enable_exclusive_lock | sharp_am | New parameter: Enable/Disable exclusive lock feature. Default: True |
enable_topology_api | sharp_am | New parameter: Enable/Disable Toplogy API feature Default: True |
max_trees_to_build | sharp_am | New parameter: Control number of trees for AM to build Default: 126 |
Rev 2.4.3 | ||
ib_max_mads_on_wire | sharp_am | Modified behavior: Changed default from 100 to 4096 |
ib_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_sat_qpc_local_ack_timeout | sharp_am | Modified behavior: Changed default from 0x1F to 0x12 |
ib_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
ib_sat_qpc_timeout_retry_limit | sharp_am | Modified behavior: Changed default from 7 to 6 |
Rev 2.0.0 | ||
control_path_version | sharp_am | New parameter |
max_compute_ports_per_agg_node | sharp_am | Modified behavior: When set to 0, AN radix is set to maximal radix value. Default: 0 |
default_reproducibility | sharp_am | New parameter: Control default reproducibility mode for jobs. Default: TURE |
ib_sa_key | sharp_am | New parameter: Control SA key for SA operations. Default: 0x1 |
coll_job_quota_max_payload_per_ost | sharp_job_quota | Modified behavior: Change default value to 1024. |
SHARP_COLL_MAX_PAYLOAD_SIZE | Libsharp_coll | Removed |
SHARP_COLL_NUM_SHARP_COLL_REQ | Libsharp_coll | Removed |
SHARP_COLL_ENABLE_REPRODUCIBLE_MODE | Libsharp_coll | New parameter: Control job reproducibility mode: 0 – Use default. 1 – No reproducibility. 2 – Reproducibility. |
SHARP_COLL_ENABLE_CUDA | Libsharp_coll | New parameter: Enables CUDA GPU direct. |
SHARP_COLL_ENABLE_GPU_DIRECT_RDMA | Libsharp_coll | New parameter: Enables GPU direct RDMA. |
Rev 1.8.1 | ||
pending_mode_timeout | sharp_am | New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup. |
job_info_polling_interval | sharp_am | New parameter: Defines job status polling interval when waiting for jobs to complete upon startup. |