ClusterKit
ClusterKit is a multipurpose node assessment tool for high-performance clusters. It is capable of performing general assessments, GPU communication tests, collectives evaluations, bisectional bandwidth measurements, as well as CPU and GPU stress tests.
General assessments: Latency; bandwidth; effective bandwidth; memory bandwidth; ordered ring bandwidth; and random ring bandwidth
GPU communication tests: Memory bandwidth; GPU-GPU latency and bandwidth; GPU-Host latency and bandwidth; and NCCL bandwidth and latency
Collectives evaluations: Barrier; allreduce; broadcast; alltoall; and NCCL
Bisectional bandwidth
CPU/GPU stress
It is recommended to install ClusterKit on a shared directory.
If such directory does not exist - make sure that all scripts are available on all the hosts in the exact same directory.
SLURM or passwordless ssh connectivity across the hosts.
ClusterKit is a multifaceted node assessment tool for high performance clusters. Currently, ClusterKit is capable of testing latency, bandwidth, effective bandwidth, memory bandwidth, GFLOPS by node, per-rack collective performance, as well as bandwidth and latency between GPUs and local/remote memory. ClusterKit employs well known techniques and tests to arrive at these performance metrics and is intended to give the user a general look at the health and performance of a cluster.
After loading the HPC-X package, and in a job allocation, issue one of the following commands.
mpirun -x UCX_NET_DEVICES=mlx5_4:1
$HPCX_CLUSTERKIT_DIR/bin/clusterkit
ClusterKit runs by default in pairwise test cases, which requires at least two nodes to run.
mpirun $HPCX_CLUSTERKIT_DIR/bin/clusterkit
Note that multi-rail is enabled by default.
When not using a job scheduler, the mpirun command line arguments that specify the hosts should be added.
The application will run with the default set of tests. Run with --help to see all command line options. During the program run, interim results for each test are printed, so you can track the progress. This is particularly important for very large clusters, with thousands of nodes.
Towards the end of the program output, you will see the name of the output directory, which is based on the time and date, and should be similar to the following.
Output directory: 20190915_061634/
The output directory is automatically created, and .json and .txt results are written for each test.
The .txt files are human readable, the .json files are for importing into the UFM-hosted viewer. For small scale, the .txt files generally suffice, but for larger clusters, the UFM-hosted viewer is recommended for viewing the .json files.
Clusterkit can also be run using the supplied clusterkit.sh convenience script. This script provides a simple interface to configure some internal UCX parameters.
./clusterkit.sh [options] <parameters>
Parameters:
-v|--verbose Set verbose mode
-f|--hostfile <hostfile> File with newline separated hostnames to run tests on.
-r|--hpcx_dir <path> Path to HPCX installation root folder (or use env HPCX_DIR)
Options:
-p|--ppn <number> Select number of processes per hostname (default
: 1
)
-d|--hca_list "string"
Comma separated list of HCAs to use (default
: autoselect)
-t|--transport_list "string"
List of RDMA transports to use (rc,dc,ud) (default
: autoselect best)
-z|--traffic <nn> Run traffic for
'nn'
minutes
-s|--ssh Use ssh for
process launching (default
: autoselect)
-h|--help Show help message
-n|--dry-run Dry run (do
nothing, only print)
-m|--map-by [node|core|socket] (Used in MPI argument: -- map-by ppr:ppn:map-by)
-y|--bycore Run on ALL cores, not just a single core per node
-k|--test_intra_node Run intra-node tests for
bandwidth and latency (default
: skip intra-node)
-U|--unidirectional Run unidirectional bandwidth tests (default
: bidirectional)
-e|--mapper shell script that maps local MPI rank to a core and one or more HCAs
e.g. for
testing machines with multiple HCAs, where each HCA needs to be tested
-g|--gpu Run GPU lat/bw/neighbor tests
-G|--gpudirect Run GPU tests with GPU-Direct.
-w|--rdma_write Use RDMA-write to pass data to the remote host.
-o|--rdma_read Use RDMA-read to access data from the remote host.
-P|--performance Set CPU scaling governor to 'performance'
. Set back to 'powersave'
after execution
-a|--output Generate zip of heatmaps and tgz of JSON files from output. Overrides -k.
output options:
-l|--normalize Normalize latency results default
: false
-C|--clean Erase output cache directory default
: false
-x|--exe_opt Options for
clusterkit.
-i|--mpi_opt Options for
mpirun.
To pass additional MPI options, use the mpi_opt environment variable.
To pass additional options to the clusterkit executable, use the ext_opt environment variable.
Examples:
% ./clusterkit.sh --ssh --hostfile hostfile.txt
% ./clusterkit.sh --hca_list "mlx5_0:1,mlx5_2:1"
--hostfile hostfile.txt
% exe_opt="--gpudirect "
./clusterkit.sh --hca_list "mlx5_0:1,mlx5_2:1"
--hostfile hostfile.txt
% mpi_opt="-x UCX_RNDV_SCHEME=get_zcopy"
./clusterkit.sh --hca_list "mlx5_0:1,mlx5_2:1"
--hostfile hostfile.txt