Testing NVIDIA SHARP Setup

Aggregation Trees Diagnostics

Run ibdiagnet utility with SHARP diagnostics option.

Copy
Copied!

            
            $ibdiagnet --sharp

Check fabric summary table in ibdiagnet output for the number of identified aggregation nodes. For example:

Copy
Copied!

            
            Fabric Summary
 
Total Nodes            : 24
IB Switches            : 4
IB Channel Adapters    : 16
IB Aggregation Nodes   : 4
IB Routers             : 0
 
Total number of links  : 24
Links at 4x50          : 24
 
Master SM: Port=1 LID=1 GUID=0x248a070300a28c4d devid=4119 Priority:0 Node_Type=CA Node_Description=pnemo HCA-2
Standby SM : No Standby SM

Check summary table in ibdiagnet output for errors in SHARP diagnostics stage. For example:

Copy
Copied!

            
            Summary
-I- Stage                    Warnings   Errors     Comment  
-I- Discovery                0          0         
-I- Lids Check               0          0         
-I- Links Check              0          0         
-I- Subnet Manager           0          0         
-I- Port Counters            0          0         
-I- Nodes Information        0          0         
-I- Speed / Width checks     0          0        
-I- Alias GUIDs              0          0         
-I- Virtualization           0          0         
-I- Partition Keys           0          0         
-I- Temperature Sensing      0          0         
-I- SHARP                    0          0

Check in SHARP diagnostics output file (/var/tmp/ibdiagnet2/ibdiagnet2.sharp) that SHARP aggregation trees are configured in the subnet.

For example: count number of configured aggregation trees constructed by Aggregation Manager using grep command:

Copy
Copied!

            
            $cat /var/tmp/ibdiagnet2/ibdiagnet2.sharp | grep -c TreeID 
126

Note that when operating in dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In dynamic trees mode, this is a valid situation and these warnings should be ignored.

Warning example:

Copy
Copied!

            
            -W- <> - In Node <> found root tree (parent qpn <>) which is already exists for treeID: <>

NVIDIA SHARP Hello

NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.

Help

Copy
Copied!

            
            $sharp_hello -h
usage:  sharp_hello <-d | --ib_dev> <device> [OPTIONS]
OPTIONS:
        [-d | --ib_dev]      - HCA to use
        [-v | --verbose]     - libsharp coll verbosity level(default:2)
                                  Levels: (0-fatal 1-err 2-warn 3-info 4-debug 5-trace)
        [-V | --version]     - print program version
        [-h | --help]        - show this usage

Example #1

Copy
Copied!

            
            $ sharp_hello -d mlx5_0:1 -v 3
[thor001:0:15042 - context.c:581] INFO job (ID: 12159720107860141553) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor001:0:15042 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor001:0:15042 - comm.c:393] INFO [group#:0] group id:a tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x3f020000000a) mlid:c007
Test Passed.

Example #2

Copy
Copied!

            
            $ SHARP_COLL_ENABLE_SAT=1 sharp_hello -d mlx5_0:1 -v 3
 
[swx-dgx01:0:59023 - context.c:581] INFO job (ID: 15134963379905498623) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[swx-dgx01:0:59023 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[swx-dgx01:0:59023 - context.c:755] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:0] group id:3c tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0xd6060000003c) mlid:c004
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:1] group id:3c tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed

NVIDIA SHARP Benchmark

NVIDIA SHARP distribution provides a source code for the benchmark to test native SHARP low-level performance for allreduce and barrier operations.

Source code:

Copy
Copied!

            
            $module load hpcx
$HPCX_SHARP_DIR/share/sharp/examples/mpi/coll/

Build and run instructions:

Copy
Copied!

            
            $module load hpcx
$HPCX_SHARP_DIR/opt/Mellanox/sharp/share/sharp/examples/mpi/coll/README

NVIDIA SHARP Benchmark Script

NVIDIA SHARP distribution provides a test script which executes OSU (allreduce, barrier) benchmark running with and without NVIDIA SHARP. To run the NVIDIA SHARP benchmark script, the following packages are required to be installed.

ssh
pdsh
environment-modules.x86_64

You can find this script at $HPCX_SHARP_DIR/sbin/sharp_benchmark.sh after loading the HPC-X module. This script should be launched from a host running SM and Aggregation Manager. It receives a list of compute nodes from SLURM allocation or from “hostlist” environment variable. “hostlist” is a comma-separated list which requires hca environment variables to be supplied. It runs OSU allreduce and barrier benchmarks with and without NVIDIA SHARP.