Microbenchmarks#

The NVIDIA HPC Benchmarks package includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.

Microbenchmarks and their corresponding run scripts are located in the microbenchmarks directory.

NCCL tests#

The NCCL tests microbenchmark check both the performance and the correctness of NCCL operations. The NCCL tests are built from the NCCL-tests repository.

The NVIDIA HPC Benchmarks package includes the microbenchmarks/nccl_tests.sh script, which provides a convenient wrapper for executing various NCCL communication operations across multiple nodes and GPUs.

The script microbenchmarks/nccl_tests.sh accepts the following parameters:

  • --op <operation> Specifies the NCCL operation to test (Required)

  • --test-params "<params>" Parameters to pass to the underlying NCCL test executable (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport layer specification (Optional)

Supported NCCL operations:

  • allreduce - NCCL All-Reduce test

  • allgather - NCCL All-Gather test

  • alltoall - NCCL All-to-All test

  • broadcast - NCCL Broadcast test

  • gather - NCCL Gather test

  • hypercube - NCCL Hypercube test

  • reduce - NCCL Reduce test

  • reduce_scatter - NCCL Reduce-Scatter test

  • scatter - NCCL Scatter test

  • sendrecv - NCCL Send/Receive test

Note: The sendrecv operation requires exactly two MPI processes. Using more than two processes will result in an exception.

Examples:

Run NCCL All-Reduce test on 4 nodes with 8 GPUs per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nccl_tests.sh --op allreduce --test-params "-b 8 -e 128M -f 2 -g 1"

Run NCCL All-Gather test on NVIDIA GB200 NVL72 with custom parameters and affinity settings:

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nccl_tests.sh --op allgather --test-params "-b 1K -e 1G -f 2" \
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1

NVSHMEM performance tests#

The NVSHMEM performance tests microbenchmark assess system performance for NVSHMEM (NVIDIA Symmetric Hierarchical Memory) operations. The NVSHMEM tests are built from the NVSHMEM performance tests package.

The NVIDIA HPC Benchmarks package includes two NVSHMEM test scripts:

  • microbenchmarks/nvshmem_host_tests.sh - NVSHMEM host-initiated operations

  • microbenchmarks/nvshmem_device_tests.sh - NVSHMEM device-initiated operations

Both scripts accept the following parameters:

  • --op <operation> Specifies the NVSHMEM operation to test (Required)

  • --test-params "<params>" Parameters to pass to the underlying NVSHMEM test executable (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport layer specification (Optional)

NVSHMEM Host Tests Operations:

Collective Operations:
  • alltoall - NVSHMEM Host All-to-All on stream test

  • broadcast - NVSHMEM Host Broadcast on stream test

Point-to-Point Operations:
  • bw - NVSHMEM Host bandwidth test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

NVSHMEM Device Tests Operations:

Collective Operations:
  • alltoall - NVSHMEM All-to-All latency test

  • bcast - NVSHMEM Broadcast latency test

Point-to-Point Operations:
  • get-bw - NVSHMEM Get Bandwidth test

  • put-bw - NVSHMEM Put Bandwidth test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

Examples:

Run NVSHMEM host alltoall test on 4 nodes with 8 GPUs per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nvshmem_host_tests.sh --op alltoall --test-params "-s 1024 -e 1048576"

Run NVSHMEM device put bandwidth test on NVIDIA GB200 NVL72 with affinity settings:

srun -N 2 --ntasks-per-node=1 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/nvshmem_device_tests.sh --op put-bw --test-params "-s 1 -e 1048576" \
    --cpu-affinity 0-71 \
    --mem-affinity 0

OSU MPI benchmark#

The OSU MPI benchmarks are a comprehensive suite of micro-benchmarks designed to evaluate the performance of MPI implementations. The OSU MPI benchmarks are built from the OSU MPI benchmark suite.

The NVIDIA HPC Benchmarks package includes the microbenchmarks/osu_mpi_tests.sh script, which provides a convenient wrapper for executing various OSU MPI benchmark operations across multiple nodes and processes.

The script microbenchmarks/osu_mpi_tests.sh accepts the following parameters:

  • --op <operation> Specifies the OSU MPI operation to test (Required)

  • --test-params "<params>" Parameters to pass to the underlying OSU MPI test executable (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport layer specification (Optional)

Supported OSU MPI Benchmark Operations:

Collective Operations:
  • allgather - OSU MPI Allgather test

  • allgatherv - OSU MPI Allgatherv test

  • allreduce - OSU MPI Allreduce test

  • alltoall - OSU MPI Alltoall test

  • alltoallv - OSU MPI Alltoallv test

  • alltoallw - OSU MPI Alltoallw test

  • barrier - OSU MPI Barrier test

  • bcast - OSU MPI Bcast test

  • gather - OSU MPI Gather test

  • gatherv - OSU MPI Gatherv test

  • iallgather - OSU MPI Iallgather test

  • iallgatherv - OSU MPI Iallgatherv test

  • ialltoall - OSU MPI Ialltoall test

  • ialltoallv - OSU MPI Ialltoallv test

  • ialltoallw - OSU MPI Ialltoallw test

  • ibarrier - OSU MPI Ibarrier test

  • ibcast - OSU MPI Ibcast test

  • igather - OSU MPI Igather test

  • igatherv - OSU MPI Igatherv test

  • ineighbor-allgather - OSU MPI Ineighbor_allgather test

  • ineighbor-allgatherv - OSU MPI Ineighbor_allgatherv test

  • ineighbor-alltoall - OSU MPI Ineighbor_alltoall test

  • ineighbor-alltoallv - OSU MPI Ineighbor_alltoallv test

  • ineighbor-alltoallw - OSU MPI Ineighbor_alltoallw test

  • ireduce - OSU MPI Ireduce test

  • ireduce-scatter - OSU MPI Ireduce_scatter test

  • ireduce-scatter-block - OSU MPI Ireduce_scatter_block test

  • iscatter - OSU MPI Iscatter test

  • iscatterv - OSU MPI Iscatterv test

  • neighbor-allgather - OSU MPI Neighbor_allgather test

  • neighbor-allgatherv - OSU MPI Neighbor_allgatherv test

  • neighbor-alltoall - OSU MPI Neighbor_alltoall test

  • neighbor-alltoallv - OSU MPI Neighbor_alltoallv test

  • neighbor-alltoallw - OSU MPI Neighbor_alltoallw test

  • reduce - OSU MPI Reduce test

  • reduce-scatter - OSU MPI Reduce_scatter test

  • reduce-scatter-block - OSU MPI Reduce_scatter_block test

  • scatter - OSU MPI Scatter test

  • scatterv - OSU MPI Scatterv test

Point-to-Point Operations:
  • bibw - OSU MPI Bidirectional Bandwidth test

  • bw - OSU MPI Bandwidth test

  • latency - OSU MPI Latency test

  • latency-mp - OSU MPI Latency Multi-Pair test

  • latency-mt - OSU MPI Latency Multi-Thread test

  • mbw-mr - OSU MPI Multiple Bandwidth / Multiple Round test

  • multi-lat - OSU MPI Multi-pair Latency test

Note: Point-to-point operations require exactly two MPI processes. Using more than two processes will result in an exception.

Congestion Tests:
  • bw-fan-in - OSU MPI Bandwidth Fan-in test

  • bw-fan-out - OSU MPI Bandwidth Fan-out test

One-Sided Operations:
  • acc-latency - OSU MPI Accumulate Latency test

  • cas-latency - OSU MPI Compare-and-Swap Latency test

  • fop-latency - OSU MPI Fetch-and-Op Latency test

  • get-acc-latency - OSU MPI Get-Accumulate Latency test

  • get-bw - OSU MPI Get Bandwidth test

  • get-latency - OSU MPI Get Latency test

  • put-bibw - OSU MPI Put Bidirectional Bandwidth test

  • put-bw - OSU MPI Put Bandwidth test

  • put-latency - OSU MPI Put Latency test

Examples:

Run OSU MPI Allreduce test on 4 nodes with 8 processes per node:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/osu_mpi_tests.sh --op allreduce --test-params "-m 1024:1048576"

Run OSU MPI Allgather test on NVIDIA GB200 NVL72 with affinity settings:

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./microbenchmarks/osu_mpi_tests.sh --op allgather --test-params "-m 1024:1048576" \
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1

GEMM (matrix-matrix multiplication) benchmark#

The GEMM benchmark evaluates GPU performance for matrix-matrix multiplication kernels used in the NVIDIA HPL benchmark. It operates in two modes:
  • Performance measurement only (the parameter of gemm executable --type=perf)

  • Performance measurement with parallel correctness verification (the parameter of gemm executable --type=check)

The NVIDIA HPC Benchmark package includes the gemm_tests.sh script, which can be used for execution of the GEMM benchmark across multiple nodes and GPUs.

The script gemm_tests.sh accepts the following parameters:

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --params "<list>" `` the list of parametes of ``gemm executable.

The GEMM benchmark executable gemm, located in the gemm_tests directory, accepts the following parameters:

  • --mode=VAL Set type of the benchmark performance (perf) or correctness (check)

  • --type=VAL Set type of GEMM (fp64 / fp64emu), default fp64. fp64emu is supported on NVIDIA Blackwell-architecture GPUs or newer

  • --mantissa-bits=VAL Set number of mantissa bits used in FP64 emulation GEMM, default 53

  • --m=VAL Set m dimension, default 65536

  • --n=VAL Set n dimension, default 16384

  • --k=VAL Set k dimension, default 1024

  • --lda=VAL Set leading dimension of matrix A

  • --ldb=VAL Set leading dimension of matrix B

  • --ldc=VAL Set leading dimension of matrix C

  • --zero-beta Use beta equal to 0 in GEMM operation C = A * B + beta * C

  • --sec=VAL Test duration in seconds

  • --tol=VAL Tolerance value, default 1e-14

  • --iter=VAL Number of iterations

  • --warmup=VAL Warmup iterations, default 32

  • --verbose Print every GEMM call

  • --help Show this help

Additionally, the NVIDIA HPC Benchmark package includes the utility script gemm_parse_results.sh for analyzing GEMM benchmark output. The sctipts accepts the following parameters:

  • --method Whether to use min or mean to calculate statistics (default mean)

  • --list Print sorted list of ranks by performance

  • --hist Pring histogram of performance

  • --hist-bin-size VAL Granularity of histogram bins (default 200)

  • --drops Print data points with a significant drop in performance

  • --correctness Print any correct errors (requires running with ./gemm --mode=check)

  • --stats Print min/max/mean/median of the performance values

  • --help Print help

  • <list of inpit files> The list of files generated by the GEMM microbenchmark

Examples:

Run the GEMM Benchmark in performance mode on nodes 8 nodes with 8 GPUs:

srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix --output="gemm-perf-%J-%N-%t.out" \
    ./microbenchmarks/gemm_tests.sh --params "--mode=perf --iter=128 --m=4096 --n=8192 --k=2048 --verbose"

Run the GEMM benchmark on each GPU of NVIDIA GB200 NVL72 during 60 seconds:

srun -N 18 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix --output="gemm-check-%J-%N-%t.out" \
    ./microbenchmarks/gemm_tests.sh --params "--mode=check --sec=60 --m=4096 --n=8192 --k=2048 --verbose"
    --cpu-affinity 0-35:36-71:72-107:108-143 \
    --mem-affinity 0:0:1:1
    --gpu-affinity 0:1:2:3

here, --cpu-affinity maps processes to specific cores on the local node, while --mem-affinity maps them to NUMA nodes on the same node.

Example of analyzing GEMM benchmark output:

./microbenchmarks/gemm_parse_results.sh --hist gemm-check-*

./microbenchmarks/gemm_parse_results.sh --correctness gemm-check-*