image image image image image

On This Page

ClusterKit is a multifaceted node assessment tool for high performance clusters. Currently, ClusterKit is capable of testing latency, bandwidth, effective bandwidth, memory bandwidth, GFLOPS by node, per-rack collective performance, as well as bandwidth and latency between GPUs and local/remote memory. ClusterKit employs well known techniques and tests to arrive at these performance metrics and is intended to give the user a general look at the health and performance of a cluster.

Running ClusterKit

After loading the HPC-X package, and in a job allocation, issue one of the following commands.

To test a specific network device: 

mpirun  -x UCX_NET_DEVICES=mlx5_4:1  $HPCX_CLUSTERKIT_DIR/bin/clusterkit

To allow UCX to choose the network device/devices: 

mpirun  $HPCX_CLUSTERKIT_DIR/bin/clusterkit

Note that multi-rail is enabled by default.

When not using a job scheduler, the mpirun command line arguments that specify the hosts should be added.

The application will run with the default set of tests.  Run with --help to see all command line options. During the program run, interim results for each test are printed, so you can track the progress. This is particularly important for very large clusters, with thousands of nodes.

Towards the end of the program output, you will see the name of the output directory, which is based on the time and date, and should be similar to the following. 

Output directory: 20190915_061634/

The output directory is automatically created, and .json and .txt results are written for each test.

The .txt files are human readable, the .json files are for importing into the UFM-hosted viewer. For small scale, the .txt files generally suffice, but for larger clusters, the UFM-hosted viewer is recommended for viewing the .json files. 

Running ClusterKit via Script

Clusterkit can also be run using the supplied convenience script. This script provides a simple interface to configure some internal UCX parameters.

./ [options] <parameters>

        -v|--verbose                  Set verbose mode
        -f|--hostfile <hostfile>      File with newline separated hostnames to run tests on.
        -r|--hpcx_dir <path>          Path to HPCX installation root folder (or use env HPCX_DIR)

        -p|--ppn <number>             Select number of processes per hostname (default: 1)
        -d|--hca_list "string"        Comma separated list of HCAs to use (default: autoselect)
        -t|--transport_list "string"  List of RDMA transports to use (rc,dc,ud) (default: autoselect best)
        -z|--traffic <nn>             Run traffic for 'nn' minutes
        -s|--ssh                      Use ssh for process launching (default: autoselect)
        -h|--help                     Show help message
        -n|--dry-run                  Dry run (do nothing, only print)
        -m|--map-by                   [node|core|socket]  (Used in MPI argument: -- map-by ppr:ppn:map-by)
        -y|--bycore                   Run on ALL cores, not just a single core per node
        -k|--test_intra_node          Run intra-node tests for bandwidth and latency (default: skip intra-node)
        -U|--unidirectional           Run unidirectional bandwidth tests (default: bidirectional)
        -e|--mapper                   shell script that maps local MPI rank to a core and one or more HCAs
                                      e.g. for testing machines with multiple HCAs, where each HCA needs to be tested
        -g|--gpu                      Run GPU lat/bw/neighbor tests
        -G|--gpudirect                Run GPU tests with GPU-Direct.
        -w|--rdma_write               Use RDMA-write to pass data to the remote host.
        -o|--rdma_read                Use RDMA-read to access data from the remote host.
        -P|--performance              Set CPU scaling governor to 'performance'. Set back to 'powersave' after execution
        -a|--output                   Generate zip of heatmaps and tgz of JSON files from output. Overrides -k.
        output options:
                -l|--normalize        Normalize latency results                       default: false
                -C|--clean            Erase output cache directory                    default: false
        -x|--exe_opt                  Options for clusterkit.
        -i|--mpi_opt                  Options for mpirun.

    To pass additional MPI options, use the mpi_opt environment variable.
    To pass additional options to the clusterkit executable, use the ext_opt environment variable.

               % ./ --ssh --hostfile hostfile.txt

               % ./ --hca_list "mlx5_0:1,mlx5_2:1" --hostfile hostfile.txt

               % exe_opt="--gpudirect     "             ./ --hca_list "mlx5_0:1,mlx5_2:1" --hostfile hostfile.txt

               % mpi_opt="-x UCX_RNDV_SCHEME=get_zcopy" ./ --hca_list "mlx5_0:1,mlx5_2:1" --hostfile hostfile.txt