NVIDIA SHARP Collective Library

1.0

NVIDIA SHARP distribution provides a collective library implementation with high level API to easily integrate into other communication runtime stacks, such as MPI, NCCL and others.

NVIDIA SHARP Resource Tuning for Low Latency Operations

The following SHARP library flags can be used when running NVIDIA SHARP collectives.

Flag

Description

SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST

Maximum payload per OST ( outstanding transactions). Value 0 means “allocate default value”. For example:

  • 256B on Switch-IB 2

  • 2048B on Quantum

Default: 0 (max: 1024)

Collective request larger than this size will be pipelined.

SHARP_COLL_JOB_QUOTA_OSTS

Maximum job (per tree) OST quota request. Value 0 means “allocate default quota”.

Default: 0

SHARP_COLL_JOB_QUOTA_MAX_GROUPS

Maximum number of groups (comms) quota request. Value 0 means “allocate default value”.

Default: 0

SHARP_COLL_OSTS_PER_GROUP

Number of OSTs per group.

Default: Dynamic (minimum: 2)

SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT

Maximum QPs/port quota request. Value 0 means “allocate default value”.

NVIDIA SHARP Streaming Aggregation

The following NVIDIA SHARP library flags can be used to enable Streaming Aggregation Tree (SAT) and tuning.

Flag

Description

SHARP_COLL_ENABLE_SAT

Enables SAT capabilities.

Default: 0 (Disabled)

The Maximum message size SAT protocol support is 1073741792 Bytes (32B less than 1GB).

SHARP_COLL_SAT_THRESHOLD

Message size threshold to use SAT on generic allreduce collective requests.

Default: 16384

SHARP_COLL_SAT_LOCK_BATCH_SIZE

SAT lock batch size. Set this to “1” if multiple communicators want to use SAT resources.

Default: Infinity

SHARP_COLL_LOCK_ON_COMM_INIT

Get SAT Lock resource during communicator init if lock batch size is Infinity. Return failure if failed to lock

Default: 0 (Disabled), 1(Enabled) with NCCL SHARP plugin

SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD

Lazy group resource allocation.

0 - Disable lazy allocation, allocate group resource at communicator create time

#n - Allocate sharp group resource after #n collective calls requested on the group

SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE

SAT (Streaming Aggregation Tree) exclusive lock mode for job. Possible values are:

  • 0 - no exclusive lock

  • 1 - try exclusive lock

  • 2 - force exclusive lock

SHARP Miscellaneous Tuning

Flag

Description

SHARP_COLL_ENABLE_CUDA

Enables CUDA GPU support.

Default: 2 (0- disable, 1- enable, 2- try)

SHARP_COLL_PIPELINE_DEPTH

Size of fragmentation pipeline for larger collective payload.

Default: 64

SHARP_COLL_ENABLE_MCAST_TARGET

Enabless MCAST target on NVIDIA SHARP collective operations.

Default: 1 (enabled)

SHARP_COLL_MCAST_TARGET_GROUP_SIZE_THRESHOLD

Group size threshold to enable mcast target.

Default: 2

SHARP_COLL_POLL_BATCH

Defines the number of CQ completions to poll on at once.

Default: 4 (maximum:16)

SHARP_COLL_ERROR_CHECK_INTERVAL

Interval in milliseconds that indicates the time between the error checks

If you set the interval as 0, error check is not performed.

SHARP_COLL_JOB_NUM_TREES

Number of SHARP trees to request. 0 means requesting the number of trees based on the number of rails and the number of channels.

Default: 0

SHARP_COLL_GROUPS_PER_COMM

Number of NVIDIA SHARP groups per user communicator.

Default: 1

SHARP_COLL_JOB_PRIORITY

Job priority.

Default: 0

SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING

Enable PCI relaxed order memory access

Default: 0 (Disable)

Warning

For the complete list of SHARP_COLL tuning options, run the sharp_coll_dump_config utility:
$HPCX_SHARP_DIR/bin/sharp_coll_dump_config -f

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.