NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.7.0
NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.7.0

Changes and New Features

Feature/Change

Description

Expanded SHARP jobs capacity

Previously restricted to a maximum of 1023 simultaneous operations, the capacity for SHARP jobs running concurrently has now been enhanced. The new limit is determined by the size of the cluster and the available switch resources

Added support for security qkey

SHARP now supports the activation of a security QKey on compute nodes, ensuring a heightened level of security during operation.

SHARP Telemetry Reports

Added support for telemetry reports, delivering valuable insights at specified intervals.

Bug Fixes

See Bug Fixes in this Version

Parameter

Component

Description

rdma_sr_enable

sharp_am

New parameter: A boolean parameter. Tells whether sharp_am should provide its own service record via rdmacm service, enabling libsharp to find sharp_am even when a security QKey is enabled.

Default: True.

telemetry_interval

sharp_am

New parameter: A decimal parameter. Tells the interval in seconds between sharp_am telemetry updates.

A value of 0 means no telemetry reports. Valid range of values: 0, 10-3600

Default: 60 seconds.

telemetry_file_path

sharp_am

New parameter: A string parameter. Tells the full path of the sharp_am telemetry file output.

An empty path or (null) means no telemetry reports.

Default in UFM: /opt/ufm/log/sharp_am_telemetry.dump

Default in non UFM systems: (null)

smx_sock_addr_family

sharp_am

Determines which address family will be used by SMX's sockets.

New option is added, the current possible options are: auto, ipv4, ipv6.

The new "auto" option means that both IPv4 and IPv6 can be used if applicable, and if only one of them is configured on the management host, then the configured address will be used.

Default: auto.

SHARP_SMX_SOCK_ADDR_FAMILY

libsharp

Parameter Removed.

This environment variable controlled the socket address family that libsharp used (IPv4/IPv6).

The parameter is removed, since now the selection is automatic, according to the sharp_am supported address family.

SHARP_USE_USER_QKEY

libsharp

New parameter: A boolean parameter. Tells whether libsharp should use user QKey for MAD QPs.

In case that a compute node is configured with security qkey enabled, then sharp should use a user Qkey and this environment variable should be set to true.

Default: False

SHARP_SR_QUERY_SOURCE

libsharp

New parameter: Defines the source that should be used in order to fetch the sharp_am service record.

Possible values:

0 - Fetch only from the SA (opensm), this was the only supported option before sharp version 3.7.0.

1 - Fetch only from Sharp_am itself (requires that sharp_am is configured with rdma_sr_enable=true).

2 - Try both options, try first from SA (OpenSM) and if not successful, try from Sharp_am.

Default: 2 - Try both options.

© Copyright 2024, NVIDIA. Last updated on May 6, 2024.