NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.12.0

Appendix C: HCA-to-TOR Benchmark Tool

SHARP can measure the bandwidth of the connections between a host HCA and its directly connected switch.

This bandwidth test is useful for link validation and for identifying faults in individual cables.

In contrast to typical bandwidth testing tools that assess traffic between two hosts (from one host to a switch and then to another host), SHARP benchmarks a single connection between a host and a switch, making it easier to identify any faulty connections.

The benchmark tool is provided in the SHARP installation folder at bin /sharp_coll_test.

It is also included with the SHARP library in both DOCA-HOST and HPC-X. You can find it in the SHARP bin directory at: $HPCX_SHARP_DIR/bin/sharp_coll_test.

To access the application help:

Copy
Copied!
            

$HPCX_SHARP_DIR/bin/sharp_coll_test -h Usage: sharp_coll_test[_mpi] [OPTIONS] Options: -h, --help Show this help message and exit -d, --ib-dev Use IB device <dev:port> (default first device found) -j, --jobid Explicit Job ID -i, --iters Number of iterations to run perf benchmark -x, --skips Number of warmup iterations to run perf benchmark -M, --mem_type Memory type(host,cuda,null) used in the communication buffers format: <src memtype>:<recv memtype> -s, --size Set the minimum and/or the maximum message size. format:[MIN:]MAX Default:<4:32M> -f, --stepfactor increment factor> multiplication factor between sizes. Default : 2

Note that the mem_type parameter allows you to run tests that generate data from either the host CPU, the GPU, or data created directly by the HCA, facilitating a pure network check.

When vheckdata directly from the HCA, the message size must be at least 16K.

Example Test Run Using the Host CPU

Copy
Copied!
            

$HPCX_SHARP_DIR/bin/sharp_coll_test -M host -d mlx5_0 -s 64M:64M   HCA < -- > TOR Bandwidth test. mem_type: HOST:HOST #size(bytes) Avg lat(us) Min lat(us) Max lat(us) Avg BW(Gb/s) iters 67108864 1510.23 1466.12 1570.44 355.49 100


Example Test Run Using the Host GPU

Copy
Copied!
            

$ $HPCX_SHARP_DIR/bin/sharp_coll_test -M cuda -d mlx5_0 -s 64M:64M   HCA < -- > TOR Bandwidth test. mem_type: CUDA:CUDA #size(bytes) Avg lat(us) Min lat(us) Max lat(us) Avg BW(Gb/s) iters 67108864 1489.44 1431.78 1533.91 360.45 100


Example Test Run Sending Data Directly from the HCA:

Copy
Copied!
            

$ $HPCX_SHARP_DIR/bin/sharp_coll_test -M null -d mlx5_0 -s 64M:64M   HCA < -- > TOR Bandwidth test. mem_type: HCA:HCA #size(bytes) Avg lat(us) Min lat(us) Max lat(us) Avg BW(Gb/s) iters 67108864 1409.41 1392.64 1432.58 380.92 100


In systems configured with partition keys (PKEYs), SHARP should be set to reservation mode. This mode restricts SHARP jobs to compute nodes included within specific reservations. The benchmark tool, therefore, can only be executed on nodes that are both part of a reservation and associated with the appropriate PKEY in their reservation data.

The sharp_coll_test tool allows for multiple simultaneous instances, enabling extensive parallel testing. However, for optimal performance and to prevent excessive load on the sharp_am, it’s best to stagger test starts to avoid large surges in requests.

For environments using TCP/IP communication between clients and sharp_am, it is recommended to limit the number of sharp_coll_test processes starting simultaneously to approximately 240 (across 40 compute nodes, each with 8 HCAs). For systems using UCX communication, further configuration adjustments in the sharp.cfg file are advised to restrict this limit to around 80 tests (10 compute nodes, each with 8 HCAs).

To enable this limit under UCX, add the following line to the sharp.cfg file (the position within the file does not matter) and restart sharp_am:

Copy
Copied!
            

enable_async_send FALSE

© Copyright 2025, NVIDIA. Last updated on Aug 25, 2025.