image image image image image

On This Page

Control Flags

The following basic flags should be used in mpirun command line to enable Mellanox SHARP protocol in HCOLL middleware. For the rest of flags please refer to Mellanox SHARP Release Notes.

FlagsValues

HCOLL_ENABLE_SHARP

Default : 0

Possible values:

  • 0 – Do not use Mellanox SHARP (default)
  • 1 - probe Mellanox SHARP availability and use it
  • 2 - Force to use Mellanox SHARP
  • 3 - Force to use Mellanox SHARP for all MPI communicators
  • 4 - Force to use Mellanox SHARP for all MPI communicators and for all supported collectives(Barrier, Allreduce)

SHARP_COLL_LOG_LEVEL

Default : 2

Mellanox SHARP coll logging level. Messages with a level higher or equal to the selected will be printed.

Possible values:

  • 0 - fatal
  • 1 - error
  • 2 - warn
  • 3 - info
  • 4 - debug
  • 5 - trace

SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST

Default : 128 (Max: 256)

Maximum payload per OST quota request. value 0 mean allocate default value.

For example: 

% $OMPI_HOME/bin/mpirun --display-map --bind-to core --map-by node -H host01,host02,host03 -np 3 -mca pml yalla -mca
btl_openib_warn_default_gid_prefix 0 -mca rmaps_dist_device mlx5_0:1 -mca rmaps_base_mapping_policy dist:span -x MXM_RDMA_PORTS=mlx5_0:1 -x
HCOLL_MAIN_IB=mlx5_0:1 -x MXM_ASYNC_INTERVAL=1800s -x HCOLL_ENABLE_SHARP=1 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=128
<PATH/osu_allreduce> -i 10000 -x 1000 -f -m 256

The following HCOLL flags can be used when running Mellanox SHARP collective with mpirun utility:

FlagValues

HCOLL_SHARP_NP

Default : 2

Number of nodes(node leaders) threshold in communicator to create Mellanox SHARP group and use Mellanox SHARP collectives

HCOLL_SHARP_UPROGRESS_NUM_POLLS

Default: 999

Number of unsuccessful polling loops in libsharp coll for blocking collective wait before calling user progress (HCOLL, OMPI).

HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX

Default : 256

Maximum allreduce size run through Mellanox SHARP. Message size greater than above will fallback to non-SHARP based algorithms (multicast based or non-multicast based)

SHARP_COLL_MAX_PAYLOAD_SIZE

Default : 256 (Max)

Maximum payload size of Mellanox SHARP collective request Collective requests for larger than this size will be pipelined.

SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST

Default : 128 (Max: 256)

Maximum payload per OST quota request. value 0 mean allocate default value.

SHARP_COLL_GROUP_RESOURCE_POLICY

Default : 1

Mellanox SHARP Job resource sharing policy between the groups (communicators)

Values:

  • 1 - equal
  • 2 - take_all by first group
  • 3 - User input percent using SHARP_COLL_USER_GROUP_QUOTA_PERCENT

SHARP_COLL_USER_GROUP_QUOTA_PERCENT

% of job quota to be allocated for each Mellanox SHARP group.

SHARP_COLL_JOB_QUOTA_OSTS

Default : 0

Maximum job (per tree) OST quota request. value 0 mean allocate default quota.

SHARP_COLL_JOB_QUOTA_MAX_GROUPS

Default: 0

Maximum no. of groups (comms) quota request. Value 0 means allocate default value.

SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT

Maximum QPs/port quota request. Value 0 mean allocate default value.

SHARP_COLL_PIPELINE_DEPTH

Default : 8

Size of fragmentation pipeline for larger collective payload

SHARP_COLL_STATS_FILE

Default  = ""

Destination to send statistics to. Possible values are:

  • stdout - print to standard output.
  • stderr - print to standard error.
  • file:<filename> - save to a file (%h: host, %p: pid, %t: time, %u: user, %e: exe)

SHARP_COLL_STATS_TRIGGER

Default : exit

Trigger to dump statistics:

  • Exit - dump just before program exits.
  • signal:<signo> - dump when process is signaled (Not fully supported)

SHARP_COLL_STATS_DUMP_MODE

Default : 1

Stats dump modes

1 - dump per process stats

2 - dump accumulative (per job) stats

NOTE: For accumulative mode(2), its user responsibility to call sharp_coll_dump_stats() when OOB is still active

SHARP_COLL_ENABLE_MCAST_TARGET

Default: 1

Enables MCAST target on Mellanox SHARP collective ops.

SHARP_COLL_MCAST_TARGET_GROUP_SIZE_THRESHOLD

Default: 2

Group size threshold to enable mcast targe

SHARP_COLL_POLL_BATCH

Default: 4

Defines the number of CQ completions to poll on at once. Maximum:16

SHARP_COLL_ERROR_CHECK_INTERVAL

Default: 180000

Interval, in milli second, indicates the time between the error checks.\n"

"If you set the interval as 0, error check is not performed"

SHARP_COLL_JOB_NUM_TREES

Default: 0

Number of SHARP trees to request. 0 means to request number of trees based on number of rails and number of channels

SHARP_COLL_GROUPS_PER_COMM

Default: 1

Number of Mellanox SHARP groups per user communicator

SHARP_COLL_JOB_PRIORITY

Default: 0

Job priority

SHARP_COLL_OSTS_PER_GROUP

Default: 2

Number of OSTS per group

Running Mellanox SHARP with HCOLL - Example 

% $OMPI_HOME/bin/mpirun --bind-to core --map-by node -hostfile /tmp/hostfile -np 4 -mca pml yalla -mca btl_openib_warn_default_gid_prefix 0 -mca rmaps_dist_device mlx5_0:1 -mca rmaps_base_mapping_policy dist:span -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x MXM_ASYNC_INTERVAL=1800s -x MXM_LOG_LEVEL=ERROR -x HCOLL_ML_DISABLE_REDUCE=1 -x HCOLL_ENABLE_MCAST_ALL=1 -x HCOLL_MCAST_NP=1 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$HPCX_SHARP_DIR/lib -x LD_PRELOAD=$HPCX_SHARP_DIR/lib/libsharp.so:$HPCX_SHARP_DIR/lib/libsharp_coll.so -x HCOLL_ENABLE_SHARP=2 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_GROUP_RESOURCE_POLICY=1 -x SHARP_COLL_MAX_PAYLOAD_SIZE=256 -x HCOLL_SHARP_UPROGRESS_NUM_POLLS=999 -x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 -x SHARP_COLL_PIPELINE_DEPTH=32 -x SHARP_COLL_JOB_QUOTA_OSTS=32 -x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=4 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 taskset -c 1 numactl --membind=0  <PATH/osu_allreduce> -i 100 -x 100 -f -m 4096:4096

For the complete list of SHARP_COLL tuning options, run the sharp_coll_dump_config utility.

$HPCX_SHARP_DIR/bin/sharp_coll_dump_config -f