NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v2.6.1
1.0

Release Notes Revision History

Internal Ref.

Issue

2678295

Description: SHARP AM does not support reconnecting HCA port to different switch, and as a result, streaming aggregation jobs might fail

Keywords: Aggregation Manager

Discovered in Release: 2.4.3

Fixed in version: 2.5.0

2439054

Description: SHARP AM might hang when there is stress on SMX sub-component due to the operation system sockets buffer sizes.

Keywords: Aggregation Manager, SMX

Discovered in Release: 2.4.3

Fixed in version: 2.5.0

1179747

Description: Changing smx_sock_interface configuration parameter is not supported.

Keywords: SHARP Daemon

Fixed in version: 2.5.0

Feature/Change

Description

Rev 2.5.0

Resource Management

Added support for exclusive lock requests for streaming aggregation jobs.

Network

Enabled connection keep-alive between SHARPD and Aggregation Manager.

Rev 2.4.3

General

Added support for identifying Aggregation Nodes based on SMDB.

General

Improved minhop tables calculation.

General

Added a new API for querying events.

Rev 2.1.4

sharp_am/sharpd/libsharp_coll: Streaming Aggregation

Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch.

libsharp_coll: GPU Accelerator

Added support for NVIDIA GPU buffers.

sharp_am: OOB

Added support for identifying the topology type from the OpenSM SMDB file.

sharp_am: Reboot

Fixed an issue where recovery failed after reboot of all switches in the cluster.

Rev 2.0.0

sharp_am/sharpd/libsharp_coll

Added support for the following NVIDIA Quantum switch capabilities:

  • Performing data operations on new data types (unsigned short, short, and short floating point data types)

  • 1K OST payload

sharp_am/sharpd: Resource Management

Added support for enabling and disabling reproducibility on the job level.

sharp_am/sharpd: Subnet Management

Added support for controlling the SA key for SA operations.

libsharp_coll: GPUDirect

Added support for CUDA GPUDirect and GPUDirect RDMA.

Rev 1.8.1

Aggregation Manager (sharp_am): Resiliency

Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup.

Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements

Socket-based is now activated by default when installed from RPM/MLNX_OFED.

Parameter

Component

Description

Rev 2.5.0

smx_keepalive_interval

sharp_am/sharpd

New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds

smx_incoming_conn_keepalive_interval

sharp_am

New parameter: Keep alive interval for incoming connections 0 to disable

Default: 300 seconds

enable_exclusive_lock

sharp_am

New parameter: Enable/Disable exclusive lock feature.

Default: True

enable_topology_api

sharp_am

New parameter: Enable/Disable Toplogy API feature

Default: True

max_trees_to_build

sharp_am

New parameter: Control number of trees for AM to build

Default: 126

Rev 2.4.3

ib_max_mads_on_wire

sharp_am

Modified behavior: Changed default from 100 to 4096

ib_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_sat_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

ib_sat_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

Rev 2.0.0

control_path_version

sharp_am

New parameter
Default

max_compute_ports_per_agg_node

sharp_am

Modified behavior: When set to 0, AN radix is set to maximal radix value.

Default: 0

default_reproducibility

sharp_am

New parameter: Control default reproducibility mode for jobs.

Default: TURE

ib_sa_key

sharp_am

New parameter: Control SA key for SA operations.

Default: 0x1

coll_job_quota_max_payload_per_ost

sharp_job_quota

Modified behavior: Change default value to 1024.

SHARP_COLL_MAX_PAYLOAD_SIZE

Libsharp_coll

Removed

SHARP_COLL_NUM_SHARP_COLL_REQ

Libsharp_coll

Removed

SHARP_COLL_ENABLE_REPRODUCIBLE_MODE

Libsharp_coll

New parameter: Control job reproducibility mode:

0 – Use default.

1 – No reproducibility.

2 – Reproducibility.

SHARP_COLL_ENABLE_CUDA

Libsharp_coll

New parameter: Enables CUDA GPU direct.

SHARP_COLL_ENABLE_GPU_DIRECT_RDMA

Libsharp_coll

New parameter: Enables GPU direct RDMA.

Rev 1.8.1

pending_mode_timeout

sharp_am

New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup.

job_info_polling_interval

sharp_am

New parameter: Defines job status polling interval when waiting for jobs to complete upon startup.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.