NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.9.0

Using NVIDIA SHARP with UCC

NVIDIA SHARP library is integrated into the Unified Collective Communication ( UCC) library to offload collective operations in MPI applications.

The following flags should be used in the environment to enable the NVIDIA SHARP protocol in the UCC middleware.

The following HCOLL flags can be used when running NVIDIA SHARP collective with mpirun utility.

Flag

Description

OMPI_UCC_CL_BASIC_TLS

Currently, the UCC supported/compiled TLs list is: ucp,cuda,nccl,sharp

Note: By default, "sharp" is disabled.

Possible values:

ucp,sharp: Adding "sharp" TL along with "ucp"

ucp,cuda,sharp: Adding "sharp" TL along with "ucp" and "cuda"

UCC_TL_SHARP_MIN_TEAM_SIZE

Minimal UCC team size for which sharp can be used.

Default: 2

UCC_TL_SHARP_DEVICES

List of comma-separated HCAs to be used with SHARP TL.

UCC_TL_SHARP_UPROGRESS_NUM_POLLS

Number of unsuccessful polling loops in libsharp coll for blocking collective wait before calling user progress (UCC, OMPI).

Default: 999

UCC_TL_SHARP_ENABLE_LAZY_GROUP_ALLOC

Enables lazy group resource allocation

Default: N

Example of Allreduce with Default Settings with SHARP Enable

Copy
Copied!
            

$mpirun  --bind-to core --map-by node -np 8  -x LD_LIBRARY_PATH -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_ENABLE_SAT=1 -x   OMPI_UCC_CL_BASIC_TLS=ucp,sharp  $HPCX_OSU_DIR/osu_allreduce [elsa01:0:1852929 - context.c:687][2024-10-28 22:24:47] INFO job (ID: 139896884886908) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [elsa01:0:1852929 - context.c:889][2024-10-28 22:24:47] INFO sharp_job_id:3    resv_key: tree_type:LLT tree_idx:0  treeID:2 caps:0x66 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1) [elsa01:0:1852929 - context.c:896][2024-10-28 22:24:47] INFO sharp_job_id:3    tree_type:SAT tree_idx:1  treeID:514 caps:0x76 [elsa01:0:1852929 - comm.c:413][2024-10-28 22:24:47] INFO [group#:0] job_id:3 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:6 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0 [elsa01:0:1852929 - comm.c:413][2024-10-28 22:24:47] INFO [group#:1] job_id:3 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:6 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0         # OSU MPI Allreduce Latency Test v7.4 # Datatype: MPI_FLOAT. # Size       Avg Latency(us) 4                       4.95 8                       4.80 16                      5.08 32                      5.10 64                      5.62 128                     5.74 256                     7.27 512                     7.85 1024                    8.28 2048                    8.59 4096                   14.64 8192                   11.11 16384                  10.31 32768                  11.42 65536                  13.84 131072                 20.07 262144                 37.93 524288                 33.89 1048576                64.64

© Copyright 2024, NVIDIA. Last updated on Nov 11, 2024.