Using NVIDIA SHARP with Open MPI

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.7.0

NVIDIA SHARP library is integrated into HCOLL collective library to offload collective operations in MPI applications.

The following basic flags should be used in environment to enable NVIDIA SHARP protocol in the HCOLL middleware. For the rest of flags, please refer to NVIDIA SHARP Release Notes.

The following HCOLL flags can be used when running NVIDIA SHARP collective with mpirun utility.

Flag

Description

HCOLL_ENABLE_SHARP

Sets whether SHARP should be used.

Possible values:

  • 0 (default) – do not use NVIDIA SHARP

  • 1 - probe NVIDIA SHARP availability and use it

  • 2 - force to use NVIDIA SHARP

  • 3 - force to use NVIDIA SHARP for all MPI communicators

  • 4 - force to use NVIDIA SHARP for all MPI communicators and for all supported collectives (barrier, allreduce)

SHARP_COLL_LOG_LEVEL

NVIDIA SHARP coll logging level. Messages with a higher or equal level to the selected will be printed.

Possible values:

  • 0 - fatal

  • 1 - error

  • 2 (default) - warn

  • 3 - info

  • 4 - debug

  • 5 - trace

HCOLL_SHARP_NP

Number of nodes (node leaders) threshold in the communicator to create NVIDIA SHARP group and use NVIDIA SHARP collectives.

Default: 4

HCOLL_SHARP_UPROGRESS_NUM_POLLS

Number of unsuccessful polling loops in libsharp coll for blocking collective wait before calling user progress (HCOLL, OMPI).

Default: 999

HCOLL_ALLREDUCE_SHARP_MAX

(or)

HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX

Maximum allreduce size run through NVIDIA SHARP. A message size greater than the above the specified value by this parameter will fall back to non-SHARP-based algorithms (multicast based or non-multicast based).

The threshold is calculated based on the group resources.

Threshold = #OSTS * Payload_per_ost

Default: Dynamic

Example of Allreduce with Default Settings with SHARP Enable

Copy
Copied!
            

$ mpirun -np 128 -map-by ppr:1:node -x UCX_TLS=dc,shm,self -x HCOLL_ENABLE_SHARP=3 -x SHARP_COLL_ENABLE_SAT=1 $HPCX_OSU_DIR/osu_allreduce # OSU MPI Allreduce Latency Test v5.6.2 # Size Avg Latency(us) 4 7.44 8 8.43 16 7.81 32 8.55 64 9.06 128 8.44 256 9.41 512 8.50 1024 9.03 2048 10.43 4096 42.61 8192 37.93 16384 15.48 32768 16.26 65536 17.62 131072 23.09 262144 33.90 524288 58.98 1048576 101.53


© Copyright 2024, NVIDIA. Last updated on May 6, 2024.