NVIDIA SHARP Collective Library
NVIDIA SHARP distribution provides a collective library implementation with high level API to easily integrate into other communication runtime stacks, such as MPI, NCCL and others.
NVIDIA SHARP Resource Tuning for Low Latency Operations
The following SHARP library flags can be used when running NVIDIA SHARP collectives.
Flag | Description |
SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST | Maximum payload per OST ( outstanding transactions). Value 0 means "allocate default value". For example:
Default: 0 (max: 1024) Collective request larger than this size will be pipelined. |
SHARP_COLL_JOB_QUOTA_OSTS | Maximum job (per tree) OST quota request. Value 0 means "allocate default quota". Default: 0 |
SHARP_COLL_JOB_QUOTA_MAX_GROUPS | Maximum number of groups (comms) quota request. Value 0 means "allocate default value". Default: 0 |
SHARP_COLL_OSTS_PER_GROUP | Number of OSTs per group. Default: Dynamic (minimum: 2) |
SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT | Maximum QPs/port quota request. Value 0 means "allocate default value". |
NVIDIA SHARP Streaming Aggregation
The following NVIDIA SHARP library flags can be used to enable Streaming Aggregation Tree (SAT) and tuning.
Flag | Description |
SHARP_COLL_ENABLE_SAT | Enables SAT capabilities. Default: 0 (Disabled) The Maximum message size SAT protocol support is 1073741792 Bytes (32B less than 1GB). |
SHARP_COLL_SAT_THRESHOLD | Message size threshold to use SAT on generic allreduce collective requests. Default: 16384 |
SHARP_COLL_SAT_LOCK_BATCH_SIZE | SAT lock batch size. Set this to “1” if multiple communicators want to use SAT resources. Default: Infinity |
SHARP_COLL_LOCK_ON_COMM_INIT | Get SAT Lock resource during communicator init if lock batch size is Infinity. Return failure if failed to lock Default: 0 (Disabled), 1(Enabled) with NCCL SHARP plugin |
SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD | Lazy group resource allocation. 0 - Disable lazy allocation, allocate group resource at communicator create time #n - Allocate sharp group resource after #n collective calls requested on the group |
SHARP_COLL_JOB_REQ_EXCLUSIVE_LOCK_MODE | SAT (Streaming Aggregation Tree) exclusive lock mode for job. Possible values are:
|
SHARP Miscellaneous Tuning
Flag | Description |
SHARP_COLL_ENABLE_CUDA | Enables CUDA GPU support. Default: 2 (0- disable, 1- enable, 2- try) |
SHARP_COLL_PIPELINE_DEPTH | Size of fragmentation pipeline for larger collective payload. Default: 64 |
SHARP_COLL_ENABLE_MCAST_TARGET | Enabless MCAST target on NVIDIA SHARP collective operations. Default: 1 (enabled) |
SHARP_COLL_MCAST_TARGET_GROUP_SIZE_THRESHOLD | Group size threshold to enable mcast target. Default: 2 |
SHARP_COLL_POLL_BATCH | Defines the number of CQ completions to poll on at once. Default: 4 (maximum:16) |
SHARP_COLL_ERROR_CHECK_INTERVAL | Interval in milliseconds that indicates the time between the error checks If you set the interval as 0, error check is not performed. |
SHARP_COLL_JOB_NUM_TREES | Number of SHARP trees to request. 0 means requesting the number of trees based on the number of rails and the number of channels. Default: 0 |
SHARP_COLL_GROUPS_PER_COMM | Number of NVIDIA SHARP groups per user communicator. Default: 1 |
SHARP_COLL_JOB_PRIORITY | Job priority. Default: 0 |
SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING | Enable PCI relaxed order memory access Default: 0 (Disable) |
For the complete list of SHARP_COLL tuning options, run the sharp_coll_dump_config utility:
$HPCX_SHARP_DIR/bin/sharp_coll_dump_config -f