NVIDIA Docs Hub NVIDIA Networking Accelerator Software NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.9.0 HCA-to-TOR Benchmark Tool

HCA-to-TOR Benchmark Tool

SHARP can measure the bandwidth of the connections between a host HCA and its directly connected switch.

This bandwidth test is useful for link validation and for identifying faults in individual cables.

In contrast to typical bandwidth testing tools that assess traffic between two hosts (from one host to a switch and then to another host), SHARP benchmarks a single connection between a host and a switch, making it easier to identify any faulty connections.

Running the Benchmark Tool

The benchmark tool is provided in the SHARP installation folder at bin /sharp_coll_test.

It is also included with the SHARP library in both DOCA-HOST and HPC-X. You can find it in the SHARP bin directory at: $HPCX_SHARP_DIR/bin/sharp_coll_test.

To access the application help:

Copy
Copied!

            
            $HPCX_SHARP_DIR/bin/sharp_coll_test -h
Usage:   sharp_coll_test[_mpi]  [OPTIONS]
Options:
-h, --help              Show this help message and exit
-d, --ib-dev            Use IB device <dev:port> (default first device found)
-j, --jobid             Explicit Job ID
-i, --iters             Number of iterations to run perf benchmark
-x, --skips             Number of warmup iterations to run perf benchmark
-M, --mem_type          Memory type(host,cuda,null) used in the communication buffers format: <src memtype>:<recv memtype>
-s, --size              Set the minimum and/or the maximum message size. format:[MIN:]MAX Default:<4:32M>
-f, --stepfactor        increment factor> multiplication factor between sizes. Default : 2

Note that the mem_type parameter allows you to run tests that generate data from either the host CPU, the GPU, or data created directly by the HCA, facilitating a pure network check.

When vheckdata directly from the HCA, the message size must be at least 16K.

Example Test Run Using the Host CPU

Copy
Copied!

            
            $HPCX_SHARP_DIR/bin/sharp_coll_test -M host -d mlx5_0  -s 64M:64M
 
HCA < -- > TOR Bandwidth test. mem_type: HOST:HOST
   #size(bytes)     Avg lat(us)     Min lat(us)     Max lat(us)    Avg BW(Gb/s)      iters
       67108864         1510.23         1466.12         1570.44          355.49        100

Example Test Run Using the Host GPU

Copy
Copied!

            
            $ $HPCX_SHARP_DIR/bin/sharp_coll_test -M cuda -d mlx5_0  -s 64M:64M
 
HCA < -- > TOR Bandwidth test. mem_type: CUDA:CUDA
   #size(bytes)     Avg lat(us)     Min lat(us)     Max lat(us)    Avg BW(Gb/s)      iters
       67108864         1489.44         1431.78         1533.91          360.45        100

Example Test Run Sending Data Directly from the HCA:

Copy
Copied!

            
            $ $HPCX_SHARP_DIR/bin/sharp_coll_test -M null -d mlx5_0  -s 64M:64M
 
HCA < -- > TOR Bandwidth test. mem_type: HCA:HCA
   #size(bytes)     Avg lat(us)     Min lat(us)     Max lat(us)    Avg BW(Gb/s)      iters
       67108864         1409.41         1392.64         1432.58          380.92        100

Operating in a PKEY-Based System

In systems configured with partition keys (PKEYs), SHARP should be set to reservation mode. This mode restricts SHARP jobs to compute nodes included within specific reservations. The benchmark tool, therefore, can only be executed on nodes that are both part of a reservation and associated with the appropriate PKEY in their reservation data.

Limitations and Recommendations

The sharp_coll_test tool allows for multiple simultaneous instances, enabling extensive parallel testing. However, for optimal performance and to prevent excessive load on the sharp_am, it’s best to stagger test starts to avoid large surges in requests.

For environments using TCP/IP communication between clients and sharp_am, it is recommended to limit the number of sharp_coll_test processes starting simultaneously to approximately 240 (across 40 compute nodes, each with 8 HCAs). For systems using UCX communication, further configuration adjustments in the sharp.cfg file are advised to restrict this limit to around 80 tests (10 compute nodes, each with 8 HCAs).

To enable this limit under UCX, add the following line to the sharp.cfg file (the position within the file does not matter) and restart sharp_am:

Copy
Copied!

            
            enable_async_send FALSE

On This Page