Microbenchmarks#
The NVIDIA HPC Benchmarks package includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.
GEMM (matrix-matrix multiplication) benchmark
Microbenchmarks and their corresponding run scripts are located in the microbenchmarks
directory.
NCCL tests#
The NCCL tests microbenchmark check both the performance and the correctness of NCCL operations. The NCCL tests are built from the NCCL-tests repository.
The NVIDIA HPC Benchmarks package includes the microbenchmarks/nccl_tests.sh
script, which provides a convenient wrapper for executing various NCCL communication operations across multiple nodes and GPUs.
The script microbenchmarks/nccl_tests.sh
accepts the following parameters:
--op <operation>
Specifies the NCCL operation to test (Required)
--test-params "<params>"
Parameters to pass to the underlying NCCL test executable (Optional)
--gpu-affinity <string>
colon separated list of gpu indices (Optional)
--cpu-affinity <string>
colon separated list of cpu index ranges (Optional)
--mem-affinity <string>
colon separated list of memory indices (Optional)
--ucx-affinity <string>
colon separated list of UCX devices (Optional)
--ucx-tls <string>
UCX transport layer specification (Optional)
Supported NCCL operations:
allreduce
- NCCL All-Reduce test
allgather
- NCCL All-Gather test
alltoall
- NCCL All-to-All test
broadcast
- NCCL Broadcast test
gather
- NCCL Gather test
hypercube
- NCCL Hypercube test
reduce
- NCCL Reduce test
reduce_scatter
- NCCL Reduce-Scatter test
scatter
- NCCL Scatter test
sendrecv
- NCCL Send/Receive testNote: The
sendrecv
operation requires exactly two MPI processes. Using more than two processes will result in an exception.
Examples:
Run NCCL All-Reduce test on 4 nodes with 8 GPUs per node:
srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/nccl_tests.sh --op allreduce --test-params "-b 8 -e 128M -f 2 -g 1"Run NCCL All-Gather test on NVIDIA GB200 NVL72 with custom parameters and affinity settings:
srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/nccl_tests.sh --op allgather --test-params "-b 1K -e 1G -f 2" \ --cpu-affinity 0-35:36-71:72-107:108-143 \ --mem-affinity 0:0:1:1
NVSHMEM performance tests#
The NVSHMEM performance tests microbenchmark assess system performance for NVSHMEM (NVIDIA Symmetric Hierarchical Memory) operations. The NVSHMEM tests are built from the NVSHMEM performance tests package.
The NVIDIA HPC Benchmarks package includes two NVSHMEM test scripts:
microbenchmarks/nvshmem_host_tests.sh
- NVSHMEM host-initiated operations
microbenchmarks/nvshmem_device_tests.sh
- NVSHMEM device-initiated operations
Both scripts accept the following parameters:
--op <operation>
Specifies the NVSHMEM operation to test (Required)
--test-params "<params>"
Parameters to pass to the underlying NVSHMEM test executable (Optional)
--gpu-affinity <string>
colon separated list of gpu indices (Optional)
--cpu-affinity <string>
colon separated list of cpu index ranges (Optional)
--mem-affinity <string>
colon separated list of memory indices (Optional)
--ucx-affinity <string>
colon separated list of UCX devices (Optional)
--ucx-tls <string>
UCX transport layer specification (Optional)
NVSHMEM Host Tests Operations:
- Collective Operations:
alltoall
- NVSHMEM Host All-to-All on stream testbroadcast
- NVSHMEM Host Broadcast on stream test
- Point-to-Point Operations:
bw
- NVSHMEM Host bandwidth test
Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.
NVSHMEM Device Tests Operations:
- Collective Operations:
alltoall
- NVSHMEM All-to-All latency testbcast
- NVSHMEM Broadcast latency test
- Point-to-Point Operations:
get-bw
- NVSHMEM Get Bandwidth testput-bw
- NVSHMEM Put Bandwidth test
Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.
Examples:
Run NVSHMEM host alltoall test on 4 nodes with 8 GPUs per node:
srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/nvshmem_host_tests.sh --op alltoall --test-params "-s 1024 -e 1048576"Run NVSHMEM device put bandwidth test on NVIDIA GB200 NVL72 with affinity settings:
srun -N 2 --ntasks-per-node=1 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/nvshmem_device_tests.sh --op put-bw --test-params "-s 1 -e 1048576" \ --cpu-affinity 0-71 \ --mem-affinity 0
OSU MPI benchmark#
The OSU MPI benchmarks are a comprehensive suite of micro-benchmarks designed to evaluate the performance of MPI implementations. The OSU MPI benchmarks are built from the OSU MPI benchmark suite.
The NVIDIA HPC Benchmarks package includes the microbenchmarks/osu_mpi_tests.sh
script, which provides a convenient wrapper for executing various OSU MPI benchmark operations across multiple nodes and processes.
The script microbenchmarks/osu_mpi_tests.sh
accepts the following parameters:
--op <operation>
Specifies the OSU MPI operation to test (Required)
--test-params "<params>"
Parameters to pass to the underlying OSU MPI test executable (Optional)
--gpu-affinity <string>
colon separated list of gpu indices (Optional)
--cpu-affinity <string>
colon separated list of cpu index ranges (Optional)
--mem-affinity <string>
colon separated list of memory indices (Optional)
--ucx-affinity <string>
colon separated list of UCX devices (Optional)
--ucx-tls <string>
UCX transport layer specification (Optional)
Supported OSU MPI Benchmark Operations:
- Collective Operations:
allgather
- OSU MPI Allgather testallgatherv
- OSU MPI Allgatherv testallreduce
- OSU MPI Allreduce testalltoall
- OSU MPI Alltoall testalltoallv
- OSU MPI Alltoallv testalltoallw
- OSU MPI Alltoallw testbarrier
- OSU MPI Barrier testbcast
- OSU MPI Bcast testgather
- OSU MPI Gather testgatherv
- OSU MPI Gatherv testiallgather
- OSU MPI Iallgather testiallgatherv
- OSU MPI Iallgatherv testialltoall
- OSU MPI Ialltoall testialltoallv
- OSU MPI Ialltoallv testialltoallw
- OSU MPI Ialltoallw testibarrier
- OSU MPI Ibarrier testibcast
- OSU MPI Ibcast testigather
- OSU MPI Igather testigatherv
- OSU MPI Igatherv testineighbor-allgather
- OSU MPI Ineighbor_allgather testineighbor-allgatherv
- OSU MPI Ineighbor_allgatherv testineighbor-alltoall
- OSU MPI Ineighbor_alltoall testineighbor-alltoallv
- OSU MPI Ineighbor_alltoallv testineighbor-alltoallw
- OSU MPI Ineighbor_alltoallw testireduce
- OSU MPI Ireduce testireduce-scatter
- OSU MPI Ireduce_scatter testireduce-scatter-block
- OSU MPI Ireduce_scatter_block testiscatter
- OSU MPI Iscatter testiscatterv
- OSU MPI Iscatterv testneighbor-allgather
- OSU MPI Neighbor_allgather testneighbor-allgatherv
- OSU MPI Neighbor_allgatherv testneighbor-alltoall
- OSU MPI Neighbor_alltoall testneighbor-alltoallv
- OSU MPI Neighbor_alltoallv testneighbor-alltoallw
- OSU MPI Neighbor_alltoallw testreduce
- OSU MPI Reduce testreduce-scatter
- OSU MPI Reduce_scatter testreduce-scatter-block
- OSU MPI Reduce_scatter_block testscatter
- OSU MPI Scatter testscatterv
- OSU MPI Scatterv test
- Point-to-Point Operations:
bibw
- OSU MPI Bidirectional Bandwidth testbw
- OSU MPI Bandwidth testlatency
- OSU MPI Latency testlatency-mp
- OSU MPI Latency Multi-Pair testlatency-mt
- OSU MPI Latency Multi-Thread testmbw-mr
- OSU MPI Multiple Bandwidth / Multiple Round testmulti-lat
- OSU MPI Multi-pair Latency test
Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.
- Congestion Tests:
bw-fan-in
- OSU MPI Bandwidth Fan-in testbw-fan-out
- OSU MPI Bandwidth Fan-out test
- One-Sided Operations:
acc-latency
- OSU MPI Accumulate Latency testcas-latency
- OSU MPI Compare-and-Swap Latency testfop-latency
- OSU MPI Fetch-and-Op Latency testget-acc-latency
- OSU MPI Get-Accumulate Latency testget-bw
- OSU MPI Get Bandwidth testget-latency
- OSU MPI Get Latency testput-bibw
- OSU MPI Put Bidirectional Bandwidth testput-bw
- OSU MPI Put Bandwidth testput-latency
- OSU MPI Put Latency test
Examples:
Run OSU MPI Allreduce test on 4 nodes with 8 processes per node:
srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/osu_mpi_tests.sh --op allreduce --test-params "-m 1024:1048576"Run OSU MPI Allgather test on NVIDIA GB200 NVL72 with affinity settings:
srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \ ./microbenchmarks/osu_mpi_tests.sh --op allgather --test-params "-m 1024:1048576" \ --cpu-affinity 0-35:36-71:72-107:108-143 \ --mem-affinity 0:0:1:1
GEMM (matrix-matrix multiplication) benchmark#
- The GEMM benchmark evaluates GPU performance for matrix-matrix multiplication kernels used in the NVIDIA HPL benchmark. It operates in two modes:
Performance measurement only (the parameter of
gemm
executable--type=perf
)Performance measurement with parallel correctness verification (the parameter of
gemm
executable--type=check
)
The NVIDIA HPC Benchmark package includes the gemm_tests.sh
script, which can be used for execution of the GEMM benchmark across multiple nodes and GPUs.
The script gemm_tests.sh
accepts the following parameters:
--gpu-affinity <string>
colon separated list of gpu indices (Optional)
--cpu-affinity <string>
colon separated list of cpu index ranges (Optional)
--mem-affinity <string>
colon separated list of memory indices (Optional)
--ucx-affinity <string>
colon separated list of UCX devices (Optional)
--params "<list>" `` the list of parametes of ``gemm
executable.
The GEMM benchmark executable gemm
, located in the gemm_tests
directory, accepts the following parameters:
--mode=VAL
Set type of the benchmark performance (perf
) or correctness (check
)
--type=VAL
Set type of GEMM (fp64
/fp64emu
), defaultfp64
.fp64emu
is supported on NVIDIA Blackwell-architecture GPUs or newer
--mantissa-bits=VAL
Set number of mantissa bits used in FP64 emulation GEMM, default 53
--m=VAL
Set m dimension, default 65536
--n=VAL
Set n dimension, default 16384
--k=VAL
Set k dimension, default 1024
--lda=VAL
Set leading dimension of matrix A
--ldb=VAL
Set leading dimension of matrix B
--ldc=VAL
Set leading dimension of matrix C
--zero-beta
Use beta equal to 0 in GEMM operation C = A * B + beta * C
--sec=VAL
Test duration in seconds
--tol=VAL
Tolerance value, default 1e-14
--iter=VAL
Number of iterations
--warmup=VAL
Warmup iterations, default 32
--verbose
Print every GEMM call
--help
Show this help
Additionally, the NVIDIA HPC Benchmark package includes the utility script gemm_parse_results.sh
for analyzing GEMM benchmark output. The sctipts accepts the following parameters:
--method
Whether to use min or mean to calculate statistics (defaultmean
)
--list
Print sorted list of ranks by performance
--hist
Pring histogram of performance
--hist-bin-size VAL
Granularity of histogram bins (default 200)
--drops
Print data points with a significant drop in performance
--correctness
Print any correct errors (requires running with./gemm --mode=check
)
--stats
Print min/max/mean/median of the performance values
--help
Print help
<list of inpit files>
The list of files generated by the GEMM microbenchmark
Examples:
Run the GEMM Benchmark in performance mode on nodes 8 nodes with 8 GPUs:
srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix --output="gemm-perf-%J-%N-%t.out" \ ./microbenchmarks/gemm_tests.sh --params "--mode=perf --iter=128 --m=4096 --n=8192 --k=2048 --verbose"Run the GEMM benchmark on each GPU of NVIDIA GB200 NVL72 during 60 seconds:
srun -N 18 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix --output="gemm-check-%J-%N-%t.out" \ ./microbenchmarks/gemm_tests.sh --params "--mode=check --sec=60 --m=4096 --n=8192 --k=2048 --verbose" --cpu-affinity 0-35:36-71:72-107:108-143 \ --mem-affinity 0:0:1:1 --gpu-affinity 0:1:2:3here,
--cpu-affinity
maps processes to specific cores on the local node, while--mem-affinity
maps them to NUMA nodes on the same node.Example of analyzing GEMM benchmark output:
./microbenchmarks/gemm_parse_results.sh --hist gemm-check-* ./microbenchmarks/gemm_parse_results.sh --correctness gemm-check-*