Test Descriptions and Options
All command-line options mentioned in the test descriptions are applicable to the ClusterKit binary (see Running ClusterKit).
Bandwidth Test (-d bw)
The bandwidth test utilizes nonblocking MPI_Isend and MPI_Irecv calls.
Options:
- Iterations: - -b<iters>,- --biters=<iters>(Default: 16)
- Message Size: - -B<size>,- --bsize=<size>(Default: 32 MB)
- Unidirectional: - -U,- --unidirectional(send data in one direction only; default is bidirectional)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance - see ClusterKit Evaluation Logic for Pairwise Tests)
Latency Test (-d lat)
The latency test is performed with a series of MPI_Send and MPI_Recv calls, where one partner sends a message to the other, which then sends a message back. This process is repeated <iters> times.
Options:
- Iterations: - -l<iters>,- --liters=<iters>(Default: 1024)
- Message Size: - -L<size>,- --lsize=<size>(Default: 0 Bytes)
- Tolerance: - -t <tol>,- --ltol=<tol>(specify tolerance - see ClusterKit Evaluation Logic for Pairwise Tests)
GPU-GPU Latency Test (-d gpu_gpu_lat)
Measures latency of GPU-to-GPU communication with MPI_ISend and MPI_IRecv.
Options:
- Iterations: - -k,- --gpulati=<iters>(Default: 1024)
- Message Size: - -K,- --gpulats=<size>(Default: 0 Bytes)
- Tolerance: - -t <tol>,- --ltol=<tol>(specify tolerance - see ClusterKit Evaluation Logic for Pairwise Tests)
- Per-GPU test: - -z,- --bygpu(test corresponding GPU pairs: GPU0-to-GPU0, GPU1-to-GPU1, etc.)
- Use GPUDIRECT: - -G,- --gpudirect(use GPUDIRECT; default is to copy from GPU memory to host)
GPU-GPU Bandwidth Test (-d gpu_gpu_bw)
Measures bandwidth of GPU-to-GPU communication with MPI_ISend and MPI_IRecv.
Options:
- Iterations: - -a,- --gpubwi=<iters>(Default: 64)
- Message Size: - -A,- --gpubws=<size>(Default: 1 MB)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance - see ClusterKit Evaluation Logic for Pairwise Tests)
- Per-GPU test: - -z,- --bygpu(test corresponding GPU pairs from different nodes: GPU0-to-GPU0, GPU1-to-GPU1, etc.)
- Use GPUDIRECT: - -G,- --gpudirect(use GPUDIRECT; default is to copy from GPU memory to host)
NCCL GPU-GPU Bandwidth Test (-d nccl_bw)
Measures bandwidth of GPU-to-GPU communication with NCCL communications primitives.
Options:
- Iterations: - -a,- --gpubwi=<iters>(default: 64)
- Message Size: - -A,- --gpubws=<size>(default: 1 MB)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance - see ClusterKit Evaluation Logic for Pairwise Tests)
NCCL GPU-GPU Latency Test (-d gpu_gpu_lat)
Measures latency of GPU-to-GPU communication with NCCL communications primitives.
Options:
- Iterations: - -k,- --gpulati=<iters>(default: 1024)
- Message Size: - -K,- --gpulats=<size>(default: 0 Bytes)
Collective Tests
Collective tests perform selected collective operations across all nodes in a defined scope.
Types of tests:
- barrier 
- Allreduce 
- bcast 
- Alltoall (set as an argument to -d option) 
Options:
- Iterations: - -n,- --niter=<iters>(default: 10000)
NCCL Collective Tests
Performs NCCL collective operations among nodes in the same scope.
Types of Tests:
- nccl_bcast
- nccl_allreduce
- nccl_reduce
- nccl_allgather
- nccl_reducescatter
Options:
- Iterations: - -n,- --niter=<iters>(default: 10,000)
Bisectional Bandwidth Test (-d bisect_bw)
Measures bisectional bandwidth by enabling communication between corresponding nodes in different scopes, assessing potential interference.
Options:
- Iterations: - -b<iters>,- --biters=<iters>(default: 16)
- Message Size: - -B<size>,- --bsize=<size>(default: 32 MB)
- Unidirectional: - -U,- --unidirectional(sends data in one direction only)
- Scope Order: - --scope_order=<scope_order>(sets order of scopes for testing)
Scope Order File Format: The file consists of lines formatted as follows:
            
            <pass_num>,<scope1>,<scope2>
    
Example:
            
             1,scope01,scope02 1,scope03,scope04 2,scope02,scope03 3,scope01,scope04 3,scope02,scope03
    
This instructs ClusterKit to execute 3 passes, testing specified connections.
Memory Bandwidth Test (-d mb)
The memory bandwidth test can be conducted with one of the following operations:
- ADD: - a[i] = b[i] + c[i]
- COPY: - a[i] = b[i]
- SCALE: - a[i] = D * b[i]
- TRIAD: - a[i] = b[i] + D * c[i]
Options:
- Iterations: - -I <iters>,- --mbiters=<iters>(default: 16)
- Array Size: - -I <size>,- --mbsize=<size>(default: 4 * L3 cache size)
- Test Type: - -m <type>,- --memtest=add|copy|scale|triad(default: TRIAD)
Effective Bandwidth Ordered Test (-d beff_o)
Rings of doubling size are formed, starting at 2, and messages are passed in one direction based on rank ordering.
Options:
- Iterations: - -e,- --beffi=<iters>(default: 512)
- Message Size: - -E,- --beffs=<size>(default: 32 MB)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance. Nodes showing results worse than max * tolerance will be considered ‘bad’)
Effective Bandwidth Random Test (-d beff_or)
Similar to the ordered test, but rings are created randomly.
Options:
- Iterations: - -e,- --beffi=<iters>(default: 512)
- Message Size: - -E,- --beffs=<size>(default: 32 MB)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance. Nodes showing results worse than max * tolerance will be considered ‘bad’)
GPU Memory Bandwidth Test (-d gpumb)
Measures bandwidth for host-to-GPU and GPU-to-host memory transfers.
Options:
- Iterations: - -j,- --gpumbi=<iters>(default: 16)
- Message Size: - -J,- --gpumbs=<size>(default: 0 bytes)
- Tolerance: - -u <tol>,- --btol=<tol>(specify tolerance. Nodes showing results worse than max * tolerance will be considered ‘bad’)
GPU Neighbor Latency Test (-d gpu_neighbor_lat)
A restricted variant of the GPU-GPU latency test that measures communication only between GPUs on neighboring nodes.
Options:
- Iterations: - -k,- --gpulati=<iters>(default: 1024)
- Message Size: - -K,- --gpulats=<size>(default: 0 bytes)
- Use GPUDIRECT: - -G,- --gpudirect(use GPUDIRECT - default is to copy from GPU memory to host)
GPU Neighbor Bandwidth Test (-d gpu_neighbor_bw)
A restricted variant of the GPU-GPU bandwidth test that measures communication only between GPUs on neighboring nodes.
Options:
- Iterations: - -a,- --gpubwi=<iters>(default: 64)
- Message Size: - -A,- --gpubws=<size>(default: 1 MB)
- Use GPUDIRECT: - -G,- --gpudirect(use GPUDIRECT - default is to copy from GPU memory to host)