Release Notes Change History

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.0.0

Feature/Change

Description

Rev 2.7.0

Switches

Added support for NVIDIA Quantum-2 switches with NDR speed

Adapter Cards

Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed

SHARPD

sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process

AM

Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs

Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them

General

Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations

Rev 2.6.1

General

Added support for running libsharp_coll from SHARP 2.6.1 with SHARPD from SHARP 2.4.0 – 2.6.1

General

Added information about updatable configuration parameters in the configuration file and help menu

Network

Added support for keep-alive on connections to SHARPD

Network

Added support for asynchronous connections

Network

Disabled UCX listener as default in SHARP Aggregation Manager

AM

Added support for the non-default subnet prefix

AM

Added support for DF+ topologies with more than two-level islands

SHARPD

Added support for caching AM address

Rev 2.5.0

Resource Management

Added support for exclusive lock requests for streaming aggregation jobs.

Network

Enabled connection keep-alive between SHARPD and Aggregation Manager.

Rev 2.4.3

General

Added support for identifying Aggregation Nodes based on SMDB.

General

Improved minhop tables calculation.

General

Added a new API for querying events.

Rev 2.1.4

sharp_am/sharpd/libsharp_coll: Streaming Aggregation

Added support for Streaming Aggregation over ConnectX-6 adapter card and Quantum switch.

libsharp_coll: GPU Accelerator

Added support for NVIDIA GPU buffers.

sharp_am: OOB

Added support for identifying the topology type from the OpenSM SMDB file.

sharp_am: Reboot

Fixed an issue where recovery failed after reboot of all switches in the cluster.

Rev 2.0.0

sharp_am/sharpd/libsharp_coll

Added support for the following NVIDIA Quantum switch capabilities:

  • Performing data operations on new data types (unsigned short, short, and short floating point data types)

  • 1K OST payload

sharp_am/sharpd: Resource Management

Added support for enabling and disabling reproducibility on the job level.

sharp_am/sharpd: Subnet Management

Added support for controlling the SA key for SA operations.

libsharp_coll: GPUDirect

Added support for CUDA GPUDirect and GPUDirect RDMA.

Rev 1.8.1

Aggregation Manager (sharp_am): Resiliency

Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup.

Mellanox SHARP Daemon (sharpd): Out-of-Box Improvements

Socket-based is now activated by default when installed from RPM/MLNX_OFED.

Parameter

Component

Description

Rev 2.7.0

recovery_retry_interval

sharp_am

New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees.

Default: 300

enable_seamless_restart

sharp_am

New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs.

Default: True

seamless_restart_trees_file

sharp_am

New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’.

Default: sharp_am_trees_structure.dump

seamless_restart_max_retries

sharp_am

New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run.

Default: 3

max_tree_radix

sharp_am

Update: Change default to 252

Ib_sat_max_mtu

sharp_am

Update: Change default to 5, to support MAD value that represents 4K MTU.

per_prio_default_quota

sharp_am

Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch.

Rev 2.6.1

dump_dir

sharp_am

Update: Changed default to /var/log

smx_enabled_protocols

sharp_am

Update: Changed default from 7 to 6 (disable UCX by default)

ib_mad_timeout

sharp_am

Update: Change deault from 200 to 500

dump_dir

sharp_am

Update: Change default to /var/log

sr_mad_timeout

sharpd

New parameter: Control timeout for ServiceRecord queries

Default: 10000 millieconds

sr_mad_retries

sharpd

New parameter: Control number of retries for ServiceRecord queries

Default: 3 retires

Rev 2.5.0

smx_keepalive_interval

sharp_am/sharpd

New parameter: Keep alive interval in seconds 0 to disable keep alive.Default: 60 seconds

smx_incoming_conn_keepalive_interval

sharp_am

New parameter: Keep alive interval for incoming connections 0 to disable

Default: 300 seconds

enable_exclusive_lock

sharp_am

New parameter: Enable/Disable exclusive lock feature.

Default: True

enable_topology_api

sharp_am

New parameter: Enable/Disable Toplogy API feature

Default: True

max_trees_to_build

sharp_am

New parameter: Control number of trees for AM to build

Default: 126

Rev 2.4.3

ib_max_mads_on_wire

sharp_am

Modified behavior: Changed default from 100 to 4096

ib_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_sat_qpc_local_ack_timeout

sharp_am

Modified behavior: Changed default from 0x1F to 0x12

ib_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

ib_sat_qpc_timeout_retry_limit

sharp_am

Modified behavior: Changed default from 7 to 6

Rev 2.0.0

control_path_version

sharp_am

New parameter

Default

max_compute_ports_per_agg_node

sharp_am

Modified behavior: When set to 0, AN radix is set to maximal radix value.

Default: 0

default_reproducibility

sharp_am

New parameter: Control default reproducibility mode for jobs.

Default: TURE

ib_sa_key

sharp_am

New parameter: Control SA key for SA operations.

Default: 0x1

coll_job_quota_max_payload_per_ost

sharp_job_quota

Modified behavior: Change default value to 1024.

SHARP_COLL_MAX_PAYLOAD_SIZE

Libsharp_coll

Removed

SHARP_COLL_NUM_SHARP_COLL_REQ

Libsharp_coll

Removed

SHARP_COLL_ENABLE_REPRODUCIBLE_MODE

Libsharp_coll

New parameter: Control job reproducibility mode:

0 – Use default.

1 – No reproducibility.

2 – Reproducibility.

SHARP_COLL_ENABLE_CUDA

Libsharp_coll

New parameter: Enables CUDA GPU direct.

SHARP_COLL_ENABLE_GPU_DIRECT_RDMA

Libsharp_coll

New parameter: Enables GPU direct RDMA.

Rev 1.8.1

pending_mode_timeout

sharp_am

New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup.

job_info_polling_interval

sharp_am

New parameter: Defines job status polling interval when waiting for jobs to complete upon startup.

© Copyright 2023, NVIDIA. Last updated on Feb 15, 2024.