Aggregation Trees Diagnostics
Run ibdiagnet utility with SHARP diagnostics option.
Check fabric summary table in ibdiagnet output for the number of identified aggregation nodes. For example:
Check summary table in ibdiagnet output for errors in SHARP diagnostics stage. For example:
Check in SHARP diagnostics output file (/var/tmp/ibdiagnet2/ibdiagnet2.sharp) that SHARP aggregation trees are configured in the subnet.
For example: count number of configured aggregation trees constructed by Aggregation Manager using grep command:
Note that when operating in dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In dynamic trees mode, this is a valid situation and these warnings should be ignored.
NVIDIA SHARP Hello
NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.
NVIDIA SHARP Benchmark
NVIDIA SHARP distribution provides a source code for the benchmark to test native SHARP low-level performance for allreduce and barrier operations.
Build and run instructions:
NVIDIA SHARP Benchmark Script
NVIDIA SHARP distribution provides a test script which executes OSU (allreduce, barrier) benchmark running with and without NVIDIA SHARP. To run the NVIDIA SHARP benchmark script, the following packages are required to be installed.
You can find this script at $HPCX_SHARP_DIR/sbin/sharp_benchmark.sh after loading the HPC-X module. This script should be launched from a host running SM and Aggregation Manager. It receives a list of compute nodes from SLURM allocation or from “hostlist” environment variable. “hostlist” is a comma-separated list which requires hca environment variables to be supplied. It runs OSU allreduce and barrier benchmarks with and without NVIDIA SHARP.