NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 2.7.0
1.0

Release Notes Change History

Feature/Change

Description

Rev 2.6.1

General

Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1

General

Added information about updatable configuration parameters in the configuration file and help menu

Network

Added support for keep-alive on connections to SHARPD

Network

Added support for asynchronous connections

Network

Disabled UCX listener as default in SHARP Aggregation Manager

AM

Added support for the non-default subnet prefix

AM

Added support for DF+ topologies with more than two-level islands

SHARPD

Added support for caching AM address

Rev 2.5.0

Resource Management

Added support for exclusive lock requests for streaming aggregation jobs.

Network

Enabled connection keep-alive between SHARPD and Aggregation Manager.

Rev 2.4.3

General

Added support for identifying Aggregation Nodes based on SMDB.

General

Improved minhop tables calculation.

General

Added a new API for querying events.

Rev 2.1.4

sharp_am/sharpd/libsharp_coll: Streaming Aggregation

Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch.

libsharp_coll: GPU Accelerator

Added support for NVIDIA GPU buffers.

sharp_am: OOB

Added support for identifying the topology type from the OpenSM SMDB file.

sharp_am: Reboot

Fixed an issue where recovery failed after reboot of all switches in the cluster.

Rev 2.0.0

sharp_am/sharpd/libsharp_coll

Added support for the following NVIDIA Quantum switch capabilities:

  • Performing data operations on new data types (unsigned short, short, and short floating point data types)

  • 1K OST payload

sharp_am/sharpd: Resource Management

Added support for enabling and disabling reproducibility on the job level.

sharp_am/sharpd: Subnet Management

Added support for controlling the SA key for SA operations.

libsharp_coll: GPUDirect

Added support for CUDA GPUDirect and GPUDirect RDMA.

Rev 1.8.1

Aggregation Manager (sharp_am): Resiliency

Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup.

Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements

Socket-based is now activated by default when installed from RPM/MLNX_OFED.

Parameter

Component

Description

Rev 2.6.1

dump_dir

sharp_am

Update: Changed default to /var/log

smx_enabled_protocols

sharp_am

Update: Changed default from 7 to 6 (disable UCX by default)

ib_mad_timeout

sharp_am

Update: Change deault from 200 to 500

dump_dir

sharp_am

Update: Change default to /var/log

sr_mad_timeout

sharpd

New parameter: Control timeout for ServiceRecord queries

Default: 10000 millieconds

sr_mad_retries

sharpd

New parameter: Control number of retries for ServiceRecord queries

Default: 3 retires

Rev 2.5.0

smx_keepalive_interval

sharp_am/sharpd

New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds

smx_incoming_conn_keepalive_interval

sharp_am

New parameter: Keep alive interval for incoming connections 0 to disable

Default: 300 seconds

enable_exclusive_lock

sharp_am

New parameter: Enable/Disable exclusive lock feature.

Default: True

enable_topology_api

sharp_am

New parameter: Enable/Disable Toplogy API feature

Default: True

max_trees_to_build

sharp_am

New parameter: Control number of trees for AM to build

Default: 126

Rev 2.4.3

ib_max_mads_on_wire

sharp_am

Modified behavior: Changed default from 100 to 4096

ib_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_sat_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

ib_sat_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

Rev 2.0.0

control_path_version

sharp_am

New parameter
Default

max_compute_ports_per_agg_node

sharp_am

Modified behavior: When set to 0, AN radix is set to maximal radix value.

Default: 0

default_reproducibility

sharp_am

New parameter: Control default reproducibility mode for jobs.

Default: TURE

ib_sa_key

sharp_am

New parameter: Control SA key for SA operations.

Default: 0x1

coll_job_quota_max_payload_per_ost

sharp_job_quota

Modified behavior: Change default value to 1024.

SHARP_COLL_MAX_PAYLOAD_SIZE

Libsharp_coll

Removed

SHARP_COLL_NUM_SHARP_COLL_REQ

Libsharp_coll

Removed

SHARP_COLL_ENABLE_REPRODUCIBLE_MODE

Libsharp_coll

New parameter: Control job reproducibility mode:

0 – Use default.

1 – No reproducibility.

2 – Reproducibility.

SHARP_COLL_ENABLE_CUDA

Libsharp_coll

New parameter: Enables CUDA GPU direct.

SHARP_COLL_ENABLE_GPU_DIRECT_RDMA

Libsharp_coll

New parameter: Enables GPU direct RDMA.

Rev 1.8.1

pending_mode_timeout

sharp_am

New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup.

job_info_polling_interval

sharp_am

New parameter: Defines job status polling interval when waiting for jobs to complete upon startup.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.