Control Flags
The following basic flags should be used in mpirun command line to enable Mellanox SHARP protocol in HCOLL middleware. For the rest of flags please refer to Mellanox SHARP Release Notes.
Flags | Values |
---|---|
HCOLL_ENABLE_SHARP | Default : 0 Possible values:
|
SHARP_COLL_LOG_LEVEL | Default : 2 Mellanox SHARP coll logging level. Messages with a level higher or equal to the selected will be printed. Possible values:
|
SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST | Default : 128 (Max: 256) Maximum payload per OST quota request. value 0 mean allocate default value. |
For example:
% $OMPI_HOME/bin/mpirun --display-map --bind-to core --map-by node -H host01,host02,host03 -np 3 -mca pml yalla -mca btl_openib_warn_default_gid_prefix 0 -mca rmaps_dist_device mlx5_0:1 -mca rmaps_base_mapping_policy dist:span -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x MXM_ASYNC_INTERVAL=1800s -x HCOLL_ENABLE_SHARP=1 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=128 <PATH/osu_allreduce> -i 10000 -x 1000 -f -m 256
The following HCOLL flags can be used when running Mellanox SHARP collective with mpirun utility:
Flag | Values |
---|---|
HCOLL_SHARP_NP | Default : 2 Number of nodes(node leaders) threshold in communicator to create Mellanox SHARP group and use Mellanox SHARP collectives |
HCOLL_SHARP_UPROGRESS_NUM_POLLS | Default: 999 Number of unsuccessful polling loops in libsharp coll for blocking collective wait before calling user progress (HCOLL, OMPI). |
HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX | Default : 256 Maximum allreduce size run through Mellanox SHARP. Message size greater than above will fallback to non-SHARP based algorithms (multicast based or non-multicast based) |
SHARP_COLL_MAX_PAYLOAD_SIZE | Default : 256 (Max) Maximum payload size of Mellanox SHARP collective request Collective requests for larger than this size will be pipelined. |
SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST | Default : 128 (Max: 256) Maximum payload per OST quota request. value 0 mean allocate default value. |
SHARP_COLL_GROUP_RESOURCE_POLICY | Default : 1 Mellanox SHARP Job resource sharing policy between the groups (communicators) Values:
|
SHARP_COLL_USER_GROUP_QUOTA_PERCENT | % of job quota to be allocated for each Mellanox SHARP group. |
SHARP_COLL_JOB_QUOTA_OSTS | Default : 0 Maximum job (per tree) OST quota request. value 0 mean allocate default quota. |
SHARP_COLL_JOB_QUOTA_MAX_GROUPS | Default: 0 Maximum no. of groups (comms) quota request. Value 0 means allocate default value. |
SHARP_COLL_JOB_QUOTA_MAX_QPS_PER_PORT | Maximum QPs/port quota request. Value 0 mean allocate default value. |
SHARP_COLL_PIPELINE_DEPTH | Default : 8 Size of fragmentation pipeline for larger collective payload |
SHARP_COLL_STATS_FILE | Default = "" Destination to send statistics to. Possible values are:
|
SHARP_COLL_STATS_TRIGGER | Default : exit Trigger to dump statistics:
|
SHARP_COLL_STATS_DUMP_MODE | Default : 1 Stats dump modes 1 - dump per process stats 2 - dump accumulative (per job) stats NOTE: For accumulative mode(2), its user responsibility to call sharp_coll_dump_stats() when OOB is still active |
SHARP_COLL_ENABLE_MCAST_TARGET | Default: 1 Enables MCAST target on Mellanox SHARP collective ops. |
SHARP_COLL_MCAST_TARGET_GROUP_SIZE_THRESHOLD | Default: 2 Group size threshold to enable mcast targe |
SHARP_COLL_POLL_BATCH | Default: 4 Defines the number of CQ completions to poll on at once. Maximum:16 |
SHARP_COLL_ERROR_CHECK_INTERVAL | Default: 180000 Interval, in milli second, indicates the time between the error checks.\n" "If you set the interval as 0, error check is not performed" |
SHARP_COLL_JOB_NUM_TREES | Default: 0 Number of SHARP trees to request. 0 means to request number of trees based on number of rails and number of channels |
SHARP_COLL_GROUPS_PER_COMM | Default: 1 Number of Mellanox SHARP groups per user communicator |
SHARP_COLL_JOB_PRIORITY | Default: 0 Job priority |
SHARP_COLL_OSTS_PER_GROUP | Default: 2 Number of OSTS per group |
Running Mellanox SHARP with HCOLL - Example
% $OMPI_HOME/bin/mpirun --bind-to core --map-by node -hostfile /tmp/hostfile -np 4 -mca pml yalla -mca btl_openib_warn_default_gid_prefix 0 -mca rmaps_dist_device mlx5_0:1 -mca rmaps_base_mapping_policy dist:span -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 -x MXM_ASYNC_INTERVAL=1800s -x MXM_LOG_LEVEL=ERROR -x HCOLL_ML_DISABLE_REDUCE=1 -x HCOLL_ENABLE_MCAST_ALL=1 -x HCOLL_MCAST_NP=1 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$HPCX_SHARP_DIR/lib -x LD_PRELOAD=$HPCX_SHARP_DIR/lib/libsharp.so:$HPCX_SHARP_DIR/lib/libsharp_coll.so -x HCOLL_ENABLE_SHARP=2 -x SHARP_COLL_LOG_LEVEL=3 -x SHARP_COLL_GROUP_RESOURCE_POLICY=1 -x SHARP_COLL_MAX_PAYLOAD_SIZE=256 -x HCOLL_SHARP_UPROGRESS_NUM_POLLS=999 -x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=4096 -x SHARP_COLL_PIPELINE_DEPTH=32 -x SHARP_COLL_JOB_QUOTA_OSTS=32 -x SHARP_COLL_JOB_QUOTA_MAX_GROUPS=4 -x SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST=256 taskset -c 1 numactl --membind=0 <PATH/osu_allreduce> -i 100 -x 100 -f -m 4096:4096
For the complete list of SHARP_COLL
tuning options, run the sharp_coll_dump_config utility.
$HPCX_SHARP_DIR/bin/sharp_coll_dump_config -f