NVIDIA Docs Hub Homepage NVIDIA Networking Accelerator Software NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.5.2 LTS NVIDIA SHARP Collective Library

NVIDIA SHARP Collective Library

NVIDIA SHARP distribution provides a collective library implementation with high level API to easily integrate into other communication runtime stacks, such as MPI, NCCL and others.

The SHARP collective library offers collective operations such as Barrier, Allreduce, Reduce, Bcast, Reduce-scatter, and Allgather. It accommodates datatypes including 16/32/64-bit Integer/Floating-point, as well as 16-bit Bfloat and 8-bit Integer.

NVIDIA SHARP Library Flags

NVIDIA SHARP Configuration Flags

As of NVIDIA SHARP version 2.7.0, sharpd daemon no longer exists, and its activity is now performed from application process.

The previous sharpd configuration is now done from the application command-line instead using the following flags.

Flag	Description
`SHARP_LOG_VERBOSTIRY`	Log verbosity level 1 - Errors 2 - Warnings 3 - Info 4 - Debug 5 - Trace Default: 2
`SHARP_LOG_FILE`	Log file Default: stdout The log file name accepts the following modifiers in the file name to create a unique file %D date as DDMMYYYY %T thread ID %H host name
`SHARP_SMX_SOCK_INTERFACE`	Network interface to be used by SMX: empty string (default) - Use interface used for AM connection Default: (null)
`SHARP_SMX_SOCK_ADDR_FAMILY`	Determines which address family will be used in SMX's sockets. The value needs to be one of the following: { ipv4, ipv6 } IPv4 support is required even when choosing the ipv6 option. Default: ipv6
`SHARP_SMX_UCX_INTERFACE`	Network interface to be used by SMX for UCX connections: empty string (default) - Use interface used for AM connection Default: (null)

NVIDIA SHARP Resource Tuning for Low Latency Operations

The following SHARP library flags can be used when running NVIDIA SHARP collectives.

Flag	Description
`SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST`	Maximum payload per OST ( outstanding transactions) . Value 0 means "allocate default value". Valid values: 0 (default) 128-1024
`SHARP_COLL_JOB_QUOTA_OSTS`	Maximum job (per tree) OST quota request. Value 0 means "allocate default quota". Default: 0
`SHARP_COLL_JOB_QUOTA_MAX_GROUPS`	Maximum number of groups (comms) quota request. Value 0 means "allocate default value". Default: 0
`SHARP_COLL_OSTS_PER_GROUP`	Number of OSTs per group. Default: 8
`SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT`	Maximum QPs/port quota request. Value 0 means "allocate default value".

NVIDIA SHARP Streaming Aggregation

The following NVIDIA SHARP library flags can be used to enable Streaming Aggregation Tree (SAT) and tuning.

Flag	Description
`SHARP_COLL_ENABLE_SAT`	Enables SAT capabilities. Default: 0 (Disabled) The Maximum message size SAT protocol support is 1073741792 Bytes (32B less than 1GB).
`SHARP_COLL_SAT_THRESHOLD`	Message size threshold to use SAT on generic allreduce collective requests. Default: 16384
`SHARP_COLL_SAT_LOCK_BATCH_SIZE`	SAT lock batch size. Set this to “1” if multiple communicators want to use SAT resources. Valid range: 1-65535. Default: 65535 (Infinity)
`SHARP_COLL_LOCK_ON_COMM_INIT`	Get SAT Lock resource during communicator init if lock batch size is Infinity. Return failure if failed to lock Default: 0 (Disabled), 1(Enabled) with NCCL SHARP plugin
`SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD`	Lazy group resource allocation. 0 - Disable lazy allocation, allocate group resource at communicator create time #n - Allocate sharp group resource after #n collective calls requested on the group Default: 1
`SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE`	SAT (Streaming Aggregation Tree) exclusive lock mode for job. Possible values: 0 - no exclusive lock 1 - try exclusive lock 2 (default)- force exclusive lock

Flag

Description

SHARP_COLL_ENABLE_SAT

Enables SAT capabilities.

Default: 0 (Disabled)

The Maximum message size SAT protocol support is 1073741792 Bytes (32B less than 1GB).

SHARP_COLL_SAT_THRESHOLD

Message size threshold to use SAT on generic allreduce collective requests.

Default: 16384

SHARP_COLL_SAT_LOCK_BATCH_SIZE

SAT lock batch size. Set this to “1” if multiple communicators want to use SAT resources.

Valid range: 1-65535.

Default: 65535 (Infinity)

SHARP_COLL_LOCK_ON_COMM_INIT

Get SAT Lock resource during communicator init if lock batch size is Infinity. Return failure if failed to lock

Default: 0 (Disabled), 1(Enabled) with NCCL SHARP plugin

SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD

Lazy group resource allocation.

0 - Disable lazy allocation, allocate group resource at communicator create time

#n - Allocate sharp group resource after #n collective calls requested on the group

Default: 1

SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE

SAT (Streaming Aggregation Tree) exclusive lock mode for job.

Possible values:

0 - no exclusive lock
1 - try exclusive lock
2 (default)- force exclusive lock

SHARP Miscellaneous Tuning

Flag	Description
`SHARP_COLL_ENABLE_CUDA`	Enables CUDA GPU support. Possible values: 0 - disable 1 - enable 2 (default) - try
`SHARP_COLL_PIPELINE_DEPTH`	Size of fragmentation pipeline for larger collective payload. Default: 64
`SHARP_COLL_ENABLE_MCAST_TARGET`	Enables MCAST target on NVIDIA SHARP collective operations. Possible values: 0 (default) - disable 1 - enable
`SHARP_COLL_MCAST_TARGET_GROUP_SIZE_THRESHOLD`	Group size threshold to enable mcast target. Default: 2
`SHARP_COLL_POLL_BATCH`	Defines the number of CQ completions to poll on at once. Valid range: 1-16 Default: 4
`SHARP_COLL_ERROR_CHECK_INTERVAL`	Interval in milliseconds that indicates the time between the error checks. If you set the interval as 0, error check is not performed. Default: 180,000
`SHARP_COLL_JOB_NUM_TREES`	Number of SHARP trees to request. 0 means requesting the number of trees based on the number of rails and the number of channels. Default: 0
`SHARP_COLL_GROUPS_PER_COMM`	Number of NVIDIA SHARP groups per user communicator. Default: 1
`SHARP_COLL_JOB_PRIORITY`	Job priority. Valid values: 0-10 Default: 0
`SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING`	Enable PCI relaxed order memory access. Possible values: 0 - disable 1 - enable 2 (default) - auto

Note

For the complete list of SHARP_COLL tuning options, run the sharp_coll_dump_config utility:
$HPCX_SHARP_DIR/bin/sharp_coll_dump_config