Microbenchmarks#

The NVIDIA HPC Benchmarks package includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.

NCCL tests

NVSHMEM performance tests

OSU MPI benchmark

GEMM (matrix-matrix multiplication) benchmark

Microbenchmarks and their corresponding run scripts are located in the microbenchmarks directory.

NCCL tests#

The NCCL tests microbenchmark check both the performance and the correctness of NCCL operations. The NCCL tests are built from the NCCL-tests repository.

The NVIDIA HPC Benchmarks package includes the microbenchmarks/nccl_tests.sh script, which provides a convenient wrapper for executing various NCCL communication operations across multiple nodes and GPUs.

The script microbenchmarks/nccl_tests.sh accepts the following parameters:

--op <operation> Specifies the NCCL operation to test (Required)

--test-params "<params>" Parameters to pass to the underlying NCCL test executable (Optional)

--gpu-affinity <string> colon separated list of gpu indices (Optional)

--cpu-affinity <string> colon separated list of cpu index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--ucx-tls <string> UCX transport layer specification (Optional)

Supported NCCL operations:

allreduce - NCCL All-Reduce test

allgather - NCCL All-Gather test

alltoall - NCCL All-to-All test

broadcast - NCCL Broadcast test

gather - NCCL Gather test

hypercube - NCCL Hypercube test

reduce - NCCL Reduce test

reduce_scatter - NCCL Reduce-Scatter test

scatter - NCCL Scatter test

sendrecv - NCCL Send/Receive test

Note: The sendrecv operation requires exactly two MPI processes. Using more than two processes will result in an exception.

Examples:

Run NCCL All-Reduce test on 4 nodes with 8 GPUs per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nccl_tests.sh --op allreduce --test-params "-b 8 -e 128M -f 2 -g 1"

Run NCCL All-Gather test on NVIDIA GB200 NVL72 with custom parameters and affinity settings:

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nccl_tests.sh --op allgather --test-params "-b 1K -e 1G -f 2" \
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1

NVSHMEM performance tests#

The NVSHMEM performance tests microbenchmark assess system performance for NVSHMEM (NVIDIA Symmetric Hierarchical Memory) operations. The NVSHMEM tests are built from the NVSHMEM performance tests package.

The NVIDIA HPC Benchmarks package includes two NVSHMEM test scripts:

microbenchmarks/nvshmem_host_tests.sh - NVSHMEM host-initiated operations

microbenchmarks/nvshmem_device_tests.sh - NVSHMEM device-initiated operations

Both scripts accept the following parameters:

--op <operation> Specifies the NVSHMEM operation to test (Required)

--test-params "<params>" Parameters to pass to the underlying NVSHMEM test executable (Optional)

--gpu-affinity <string> colon separated list of gpu indices (Optional)

--cpu-affinity <string> colon separated list of cpu index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--ucx-tls <string> UCX transport layer specification (Optional)

NVSHMEM Host Tests Operations:

Collective Operations:

alltoall - NVSHMEM Host All-to-All on stream test
broadcast - NVSHMEM Host Broadcast on stream test

Point-to-Point Operations:

bw - NVSHMEM Host bandwidth test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

NVSHMEM Device Tests Operations:

Collective Operations:

alltoall - NVSHMEM All-to-All latency test
bcast - NVSHMEM Broadcast latency test

Point-to-Point Operations:

get-bw - NVSHMEM Get Bandwidth test
put-bw - NVSHMEM Put Bandwidth test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

Examples:

Run NVSHMEM host alltoall test on 4 nodes with 8 GPUs per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nvshmem_host_tests.sh --op alltoall --test-params "-s 1024 -e 1048576"

Run NVSHMEM device put bandwidth test on NVIDIA GB200 NVL72 with affinity settings:

srun -N 2 --ntasks-per-node=1 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nvshmem_device_tests.sh --op put-bw --test-params "-s 1 -e 1048576" \
    --cpu-affinity 0-71 \
    --mem-affinity 0

OSU MPI benchmark#

The OSU MPI benchmarks are a comprehensive suite of micro-benchmarks designed to evaluate the performance of MPI implementations. The OSU MPI benchmarks are built from the OSU MPI benchmark suite.

The NVIDIA HPC Benchmarks package includes the microbenchmarks/osu_mpi_tests.sh script, which provides a convenient wrapper for executing various OSU MPI benchmark operations across multiple nodes and processes.

The script microbenchmarks/osu_mpi_tests.sh accepts the following parameters:

--op <operation> Specifies the OSU MPI operation to test (Required)

--test-params "<params>" Parameters to pass to the underlying OSU MPI test executable (Optional)

--gpu-affinity <string> colon separated list of gpu indices (Optional)

--cpu-affinity <string> colon separated list of cpu index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--ucx-tls <string> UCX transport layer specification (Optional)

Supported OSU MPI Benchmark Operations:

Collective Operations:

allgather - OSU MPI Allgather test
allgatherv - OSU MPI Allgatherv test
allreduce - OSU MPI Allreduce test
alltoall - OSU MPI Alltoall test
alltoallv - OSU MPI Alltoallv test
alltoallw - OSU MPI Alltoallw test
barrier - OSU MPI Barrier test
bcast - OSU MPI Bcast test
gather - OSU MPI Gather test
gatherv - OSU MPI Gatherv test
iallgather - OSU MPI Iallgather test
iallgatherv - OSU MPI Iallgatherv test
ialltoall - OSU MPI Ialltoall test
ialltoallv - OSU MPI Ialltoallv test
ialltoallw - OSU MPI Ialltoallw test
ibarrier - OSU MPI Ibarrier test
ibcast - OSU MPI Ibcast test
igather - OSU MPI Igather test
igatherv - OSU MPI Igatherv test
ineighbor-allgather - OSU MPI Ineighbor_allgather test
ineighbor-allgatherv - OSU MPI Ineighbor_allgatherv test
ineighbor-alltoall - OSU MPI Ineighbor_alltoall test
ineighbor-alltoallv - OSU MPI Ineighbor_alltoallv test
ineighbor-alltoallw - OSU MPI Ineighbor_alltoallw test
ireduce - OSU MPI Ireduce test
ireduce-scatter - OSU MPI Ireduce_scatter test
ireduce-scatter-block - OSU MPI Ireduce_scatter_block test
iscatter - OSU MPI Iscatter test
iscatterv - OSU MPI Iscatterv test
neighbor-allgather - OSU MPI Neighbor_allgather test
neighbor-allgatherv - OSU MPI Neighbor_allgatherv test
neighbor-alltoall - OSU MPI Neighbor_alltoall test
neighbor-alltoallv - OSU MPI Neighbor_alltoallv test
neighbor-alltoallw - OSU MPI Neighbor_alltoallw test
reduce - OSU MPI Reduce test
reduce-scatter - OSU MPI Reduce_scatter test
reduce-scatter-block - OSU MPI Reduce_scatter_block test
scatter - OSU MPI Scatter test
scatterv - OSU MPI Scatterv test

Point-to-Point Operations:

bibw - OSU MPI Bidirectional Bandwidth test
bw - OSU MPI Bandwidth test
latency - OSU MPI Latency test
latency-mp - OSU MPI Latency Multi-Pair test
latency-mt - OSU MPI Latency Multi-Thread test
mbw-mr - OSU MPI Multiple Bandwidth / Multiple Round test
multi-lat - OSU MPI Multi-pair Latency test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

Congestion Tests:

bw-fan-in - OSU MPI Bandwidth Fan-in test
bw-fan-out - OSU MPI Bandwidth Fan-out test

One-Sided Operations:

acc-latency - OSU MPI Accumulate Latency test
cas-latency - OSU MPI Compare-and-Swap Latency test
fop-latency - OSU MPI Fetch-and-Op Latency test
get-acc-latency - OSU MPI Get-Accumulate Latency test
get-bw - OSU MPI Get Bandwidth test
get-latency - OSU MPI Get Latency test
put-bibw - OSU MPI Put Bidirectional Bandwidth test
put-bw - OSU MPI Put Bandwidth test
put-latency - OSU MPI Put Latency test

Examples:

Run OSU MPI Allreduce test on 4 nodes with 8 processes per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/osu_mpi_tests.sh --op allreduce --test-params "-m 1024:1048576"

Run OSU MPI Allgather test on NVIDIA GB200 NVL72 with affinity settings:

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/osu_mpi_tests.sh --op allgather --test-params "-m 1024:1048576" \
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1

GEMM (matrix-matrix multiplication) benchmark#

The GEMM benchmark evaluates GPU performance for matrix-matrix multiplication kernels used in the NVIDIA HPL benchmark. It operates in two modes:

Performance measurement only (the parameter of gemm executable --type=perf)
Performance measurement with parallel correctness verification (the parameter of gemm executable --type=check)

The NVIDIA HPC Benchmark package includes the gemm_tests.sh script, which can be used for execution of the GEMM benchmark across multiple nodes and GPUs.

The script gemm_tests.sh accepts the following parameters:

--gpu-affinity <string> colon separated list of gpu indices (Optional)

--cpu-affinity <string> colon separated list of cpu index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--params "<list>" `` the list of parametes of ``gemm executable.

The GEMM benchmark executable gemm, located in the gemm_tests directory, accepts the following parameters:

--mode=VAL Set type of the benchmark performance (perf) or correctness (check)

--type=VAL Set type of GEMM (fp64 / fp64emu), default fp64. fp64emu is supported on NVIDIA Blackwell-architecture GPUs or newer

--mantissa-bits=VAL Set number of mantissa bits used in FP64 emulation GEMM, default 53

--m=VAL Set m dimension, default 65536

--n=VAL Set n dimension, default 16384

--k=VAL Set k dimension, default 1024

--lda=VAL Set leading dimension of matrix A

--ldb=VAL Set leading dimension of matrix B

--ldc=VAL Set leading dimension of matrix C

--zero-beta Use beta equal to 0 in GEMM operation C = A * B + beta * C

--sec=VAL Test duration in seconds

--tol=VAL Tolerance value, default 1e-14

--iter=VAL Number of iterations

--warmup=VAL Warmup iterations, default 32

--verbose Print every GEMM call

--help Show this help

Additionally, the NVIDIA HPC Benchmark package includes the utility script gemm_parse_results.sh for analyzing GEMM benchmark output. The sctipts accepts the following parameters:

--method Whether to use min or mean to calculate statistics (default mean)

--list Print sorted list of ranks by performance

--hist Pring histogram of performance

--hist-bin-size VAL Granularity of histogram bins (default 200)

--drops Print data points with a significant drop in performance

--correctness Print any correct errors (requires running with ./gemm --mode=check)

--stats Print min/max/mean/median of the performance values

--help Print help

<list of inpit files> The list of files generated by the GEMM microbenchmark

Examples:

Run the GEMM Benchmark in performance mode on nodes 8 nodes with 8 GPUs:

srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix --output="gemm-perf-%J-%N-%t.out" \
    ./microbenchmarks/gemm_tests.sh --params "--mode=perf --iter=128 --m=4096 --n=8192 --k=2048 --verbose"

Run the GEMM benchmark on each GPU of NVIDIA GB200 NVL72 during 60 seconds:

srun -N 18 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix --output="gemm-check-%J-%N-%t.out" \
    ./microbenchmarks/gemm_tests.sh --params "--mode=check --sec=60 --m=4096 --n=8192 --k=2048 --verbose"
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1
    --gpu-affinity 0:1:2:3

here, --cpu-affinity maps processes to specific cores on the local node, while --mem-affinity maps them to NUMA nodes on the same node.

Example of analyzing GEMM benchmark output:

./microbenchmarks/gemm_parse_results.sh --hist gemm-check-*

./microbenchmarks/gemm_parse_results.sh --correctness gemm-check-*