NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.0.0
NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.0.0

Changes and New Features

Feature/Change

Description

General

Added support for executing multiple jobs that aggregate data through the same set of switches, while each job utilizes a different set of links.

SHARP logic is now application-aware with UFM capabilities. SHARP jobs can be assigned an App-ID, which can be used as a reference to the customer application performing these jobs.

For further information, please refer to UFM SLURM Integration Appendix in UFM UM.

Added the option to limit the SHARP resources that applications are allowed to consume.

For further information, please refer to UFM SLURM Integration Appendix in UFM UM.

AM

Modified the default resources provided to LLT & SAT jobs. This enables operation of a larger amount of SAT jobs in parallel to few LLT jobs (please see the first three entries in the table below).

libsharp

SHARP jobs are now executed in exclusive lock mode by default (please see SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE in the table below).

Parameter

Component

Description

per_prio_default_quota

sharp_am

Update: This parameter controls only the default percentage provided to LLT jobs. Its default value is modified from 3 to 20

per_prio_default_sat_quota

sharp_am

New parameter: Default percentage of quota (OSTs, Buffers and Groups) per aggregation node per tree, to be requested for a single SAT job by its priority.

If no explicit quota request is submitted, this parameter will set the quota percentage to be used.

Format: prio_0_quota, [prio_1_quota, ..., prio_9_quota]

Note that if only one value is set, it will be applied to all priorities.

Default: 3

sat_jobs_default_absolute_osts

sharp_am

New parameter: Default number of OSTs to be allocated for SAT jobs per aggregation node per tree.

Zero value means that no absolute value should be used, and the default percentage value is used instead.

Note that the number of OSTs also affects the number of groups.

Default: 0

app_resources_default_limit

sharp_am

New parameter: A numerical parameter, applicable only when reservation_mode is set to true. Sets the default max number of trees allowed to be used in parallel by a single app. This default value can be overridden per app upon reservation request.

A value of 0 means no allowed resources, which means an app cannot execute any sharp job.

Default: 1

force_app_id_match

sharp_am

New parameter: A boolean parameter, applicable only when reservation_mode is set to true. When set to true, an application ID must be provided upon job request, and it must match the application ID provided upon reservation request. Otherwise, the job will be denied.

Default: False

SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE

libsharp

Update: Changed default value from 0 (no exclusive lock) to 2 (force exclusive lock)

© Copyright 2023, NVIDIA. Last updated on Feb 15, 2024.