NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v2.6.1

Testing NVIDIA SHARP Setup

Run ibdiagnet utility with SHARP diagnostics option.


$ibdiagnet --sharp --fabric_summary

Check fabric summary table in ibdiagnet output for the number of identified aggregation nodes. For example:


Fabric Summary   Total Nodes : 24 IB Switches : 4 IB Channel Adapters : 16 IB Aggregation Nodes : 4 IB Routers : 0   Total number of links : 24 Links at 4x50 : 24   Master SM: Port=1 LID=1 GUID=0x248a070300a28c4d devid=4119 Priority:0 Node_Type=CA Node_Description=pnemo HCA-2 Standby SM : No Standby SM

Check summary table in ibdiagnet output for errors in SHARP diagnostics stage. For example:


Summary -I- Stage Warnings Errors Comment -I- Discovery 0 0 -I- Lids Check 0 0 -I- Links Check 0 0 -I- Subnet Manager 0 0 -I- Port Counters 0 0 -I- Nodes Information 0 0 -I- Speed / Width checks 0 0 -I- Alias GUIDs 0 0 -I- Virtualization 0 0 -I- Partition Keys 0 0 -I- Temperature Sensing 0 0 -I- SHARP 0 0

Check in SHARP diagnostics output file (/var/tmp/ibdiagnet2/ibdiagnet2.sharp) that SHARP aggregation trees are configured in the subnet.

For example: count number of configured aggregation trees constructed by Aggregation Manager using grep command:


$cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp | grep -c TreeID 126

NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.



$sharp_hello -h usage: sharp_hello <-d | --ib_dev> <device> [OPTIONS] OPTIONS: [-d | --ib_dev] - HCA to use [-v | --verbose] - libsharp coll verbosity level(default:2) Levels: (0-fatal 1-err 2-warn 3-info 4-debug 5-trace) [-V | --version] - print program version [-h | --help] - show this usage

Example #1


$ sharp_hello -d mlx5_0:1 -v 3 [thor001:0:15042 - context.c:581] INFO job (ID: 12159720107860141553) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [thor001:0:15042 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1) [thor001:0:15042 - comm.c:393] INFO [group#:0] group id:a tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x3f020000000a) mlid:c007 Test Passed.

Example #2


$ SHARP_COLL_ENABLE_SAT=1 sharp_hello -d mlx5_0:1 -v 3   [swx-dgx01:0:59023 - context.c:581] INFO job (ID: 15134963379905498623) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [swx-dgx01:0:59023 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1) [swx-dgx01:0:59023 - context.c:755] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16 [swx-dgx01:0:59023 - comm.c:393] INFO [group#:0] group id:3c tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0xd6060000003c) mlid:c004 [swx-dgx01:0:59023 - comm.c:393] INFO [group#:1] group id:3c tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0 Test Passed

NVIDIA SHARP distribution provides a source code for the benchmark to test native SHARP low-level performance for allreduce and barrier operations.

Source code:


$module load hpcx $HPCX_SHARP_DIR/share/sharp/examples/mpi/coll/

Build and run instructions:


$module load hpcx $HPCX_SHARP_DIR/opt/Mellanox/sharp/share/sharp/examples/mpi/coll/README

NVIDIA SHARP Benchmark Script

NVIDIA SHARP distribution provides a test script which executes OSU (allreduce, barrier) benchmark running with and without NVIDIA SHARP. To run the NVIDIA SHARP benchmark script, the following packages are required to be installed.

  • ssh

  • pdsh

  • environment-modules.x86_64

You can find this script at $HPCX_SHARP_DIR/sbin/sharp_benchmark.sh after loading the HPC-X module. This script should be launched from a host running SM and Aggregation Manager. It receives a list of compute nodes from SLURM allocation or from “hostlist” environment variable. “hostlist” is a comma-separated list which requires hca environment variables to be supplied. It runs OSU allreduce and barrier benchmarks with and without NVIDIA SHARP.



This script includes OSU benchmarks for MPI_Allreduce and MPI_Barrier blocking collective operations. Both benchmarks run with and without using SHARP technology.   Usage: sharp_benchmark.sh [-t] [-d] [-h] [-f] -t - tests list (e.g. sharp:barrier) -d - dry run -h - display this help and exit -f - supress error in prerequsites checking   Configuration: Runtime: sharp_ppn - number of processes per compute node (default 1) sharp_ib_dev - Infiniband device used for communication. Format <device_name>:<port_number>. For example: sharp_ib_dev="mlx5_0:1" This is a mandatory parameter. If it's absent, sharp_benchmark.sh tries to use the first active device on local machine sharp_groups_num - number of groups per communicator. (default is the number of devices in sharp_ib_dev) sharp_num_trees - number of trees to request. (default num tress based on the #rails and #channels) sharp_job_members_type - type of sharp job members list. (default is SHARP_MEMBER_LIST_PROCESSES_DATA) sharp_hostlist - hostnames of compute nodes used in the benchmark. The list may include normal host names, a range of hosts in hostlist format. Under SLURM allocation, SLURM_NODELIST is used as a default sharp_test_iters - number of test iterations (default 10000) sharp_test_skip_iters - number of test iterations (default 1000) sharp_test_max_data - max data size used for testing (default and maximum 4096) Environment: SHARP_INI_FILE - takes configuration from given file instead of /labhome/danielk/.sharp_benchmark.ini SHARP_TMP_DIR - store temporary files here instead of /tmp HCOLL_INSTALL - use specified hcoll install instead from hpcx   Examples: sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh # run using "mlx5_0:1" IB port. Rest parameters are loaded from /labhome/danielk/.sharp_benchmark.ini or default SHARP_INI_FILE=~/benchmark.ini sharp_benchmark.sh # Override default configuration file SHARP_INI_FILE=~/benchmark.ini sharp_hostlist=ajna0[2-3] sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh # Use specific host list sharp_ppn=1 sharp_hostlist=ajna0[1-8] sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh -d # Print commands without actual run   Dependencies: This script uses "python-hostlist" package. Visit https://www.nsc.liu.se/~kent/python-hostlist/ for details

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.