Testing NVIDIA SHARP Setup
Run ibdiagnet utility with SHARP diagnostics option.
$ibdiagnet --sharp
Check fabric summary table in ibdiagnet output for the number of identified aggregation nodes. For example:
Fabric Summary
Total Nodes : 24
IB Switches : 4
IB Channel Adapters : 16
IB Aggregation Nodes : 4
IB Routers : 0
Total number of links : 24
Links at 4x50 : 24
Master SM: Port=1
LID=1
GUID=0x248a070300a28c4d
devid=4119
Priority:0
Node_Type=CA Node_Description=pnemo HCA-2
Standby SM : No Standby SM
Check summary table in ibdiagnet output for errors in SHARP diagnostics stage. For example:
Summary
-I- Stage Warnings Errors Comment
-I- Discovery 0
0
-I- Lids Check 0
0
-I- Links Check 0
0
-I- Subnet Manager 0
0
-I- Port Counters 0
0
-I- Nodes Information 0
0
-I- Speed / Width checks 0
0
-I- Alias GUIDs 0
0
-I- Virtualization 0
0
-I- Partition Keys 0
0
-I- Temperature Sensing 0
0
-I- SHARP 0
0
Check in SHARP diagnostics output file (/var/tmp/ibdiagnet2/ibdiagnet2.sharp) that SHARP aggregation trees are configured in the subnet.
For example: count number of configured aggregation trees constructed by Aggregation Manager using grep command:
$cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp | grep -c TreeID
126
Note that when operating in dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In dynamic trees mode, this is a valid situation and these warnings should be ignored.
Warning example:
-W- <> - In Node <> found root tree (parent qpn <>) which is already exists for
treeID: <>
NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.
Help
$sharp_hello -h
usage: sharp_hello <-d | --ib_dev> <device> [OPTIONS]
OPTIONS:
[-d | --ib_dev] - HCA to use
[-v | --verbose] - libsharp coll verbosity level(default
:2
)
Levels: (0
-fatal 1
-err 2
-warn 3
-info 4
-debug 5
-trace)
[-V | --version] - print program version
[-h | --help] - show this
usage
Example #1
$ sharp_hello -d mlx5_0:1
-v 3
[thor001:0
:15042
- context.c:581
] INFO job (ID: 12159720107860141553
) resource request quota: ( osts:0
user_data_per_ost:0
max_groups:0
max_qps:1
max_group_channels:1
, num_trees:1
)
[thor001:0
:15042
- context.c:751
] INFO tree_info: type:LLT tree idx:0
treeID:0x0
caps:0x6
quota: ( osts:167
user_data_per_ost:1024
max_groups:167
max_qps:1
max_group_channels:1
)
[thor001:0
:15042
- comm.c:393
] INFO [group#:0
] group id:a tree idx:0
tree_type:LLT rail_idx:0
group size:1
quota: (osts:2
user_data_per_ost:1024
) mgid: (subnet prefix:0xff12a01bfe800000
interface
id:0x3f020000000a
) mlid:c007
Test Passed.
Example #2
$ SHARP_COLL_ENABLE_SAT=1
sharp_hello -d mlx5_0:1
-v 3
[swx-dgx01:0
:59023
- context.c:581
] INFO job (ID: 15134963379905498623
) resource request quota: ( osts:0
user_data_per_ost:0
max_groups:0
max_qps:1
max_group_channels:1
, num_trees:1
)
[swx-dgx01:0
:59023
- context.c:751
] INFO tree_info: type:LLT tree idx:0
treeID:0x0
caps:0x6
quota: ( osts:167
user_data_per_ost:1024
max_groups:167
max_qps:1
max_group_channels:1
)
[swx-dgx01:0
:59023
- context.c:755
] INFO tree_info: type:SAT tree idx:1
treeID:0x3f
caps:0x16
[swx-dgx01:0
:59023
- comm.c:393
] INFO [group#:0
] group id:3c tree idx:0
tree_type:LLT rail_idx:0
group size:1
quota: (osts:2
user_data_per_ost:1024
) mgid: (subnet prefix:0xff12a01bfe800000
interface
id:0xd6060000003c
) mlid:c004
[swx-dgx01:0
:59023
- comm.c:393
] INFO [group#:1
] group id:3c tree idx:1
tree_type:SAT rail_idx:0
group size:1
quota: (osts:64
user_data_per_ost:0
) mgid: (subnet prefix:0x0
interface
id:0x0
) mlid:0
Test Passed
The NVIDIA SHARP distribution provides source code for the benchmark, which tests native SHARP low-level performance for allreduce and barrier operations.
Source code:
$module load hpcx
$HPCX_SHARP_DIR/share/sharp/examples/mpi/coll/
Build and run instructions:
$module load hpcx
$HPCX_SHARP_DIR/opt/Mellanox/sharp/share/sharp/examples/mpi/coll/README