ClusterKit is a multifaceted node assessment tool for high performance clusters. Currently, ClusterKit is capable of testing latency, bandwidth, effective bandwidth, memory bandwidth, GFLOPS by node, per-rack collective performance, as well as bandwidth and latency between GPUs and local/remote memory. ClusterKit employs well known techniques and tests to arrive at these performance metrics and is intended to give the user a general look at the health and performance of a cluster.
After loading the HPC-X package, and in a job allocation, issue one of the following commands.
To test a specific network device:
To allow UCX to choose the network device/devices:
Note that multi-rail is enabled by default.
When not using a job scheduler, the mpirun command line arguments that specify the hosts should be added.
The application will run with the default set of tests. Run with --help to see all command line options. During the program run, interim results for each test are printed, so you can track the progress. This is particularly important for very large clusters, with thousands of nodes.
Towards the end of the program output, you will see the name of the output directory, which is based on the time and date, and should be similar to the following.
The output directory is automatically created, and .json and .txt results are written for each test.
The .txt files are human readable, the .json files are for importing into the UFM-hosted viewer. For small scale, the .txt files generally suffice, but for larger clusters, the UFM-hosted viewer is recommended for viewing the .json files.
Running ClusterKit via Script
Clusterkit can also be run using the supplied clusterkit.sh convenience script. This script provides a simple interface to configure some internal UCX parameters.