Certain tests in ClusterKit are pairwise, meaning there is a specific logic for processing their results, as described below.

For each particular type of test, ClusterKit repeats it a number of times across appropriate population of ranks, using given message sizes (default or specified) and collects performance statistics.

In pairwise experiments and FULL test mode, ClusterKit repeats the experiment on pairs it selects in each round.

For a system with n nodes, there can be up to n(n-1)/2 distinct pairs, allowing for n/2 communicating pairs to be tested simultaneously.

ClusterKit operates over n-1 rounds, selecting a different destination for the same source in each pair.

A 'tolerance' is a specified percentage derived from the extreme values of the observed performance distribution, used to characterize a component’s performance as 'unacceptable.'

Message latencies that are 2.1 times (by default) above the minimum are considered 'bad,'

Message bandwidths (BWs) less than 93% (by default) of the maximum are also deemed 'bad.'

Different values can be supplied.

Pairs whose performance falls outside the tolerance are selected to participate in subsequent evaluation rounds for re-testing. In each subsequent round, the total number of pairs is halved. In the final round, pairs that continue to perform outside the tolerance are declared definitively 'bad.'

The -x option for the ClusterKit binary disables the retesting logic, causing ClusterKit to test all possible pairs once and then stop.

In pairwise experiments conducted in QUICK mode, n/2 communicating pairs go through one round of evaluation. To run in QUICK mode, pass -q ( --quick ) to the ClusterKit binary.

In CUSTOM mode, we specify the pairs to test in each round using an input file.

To run a pairwise test in CUSTOM mode, pass -f <file> ( --fromfile=<file>) to the ClusterKit binary.

The file consists of lines, with each line formatted as:

Copy Copied! <round_num> <node_1> <node_2>

All pairs (links between <node_1> and <node_2> ) with the same <round_num> will be tested in parallel. <round_num> should be in non-descending order. For example:

Copy Copied! 1 machine02 machine10 1 machine03 machine07 2 machine02 machine03

This will test (machine02, machine10) and (machine03, machine07) in round 1 and (machine02, machine03) in round 2.