NVIDIA Matmul Heuristics#

NVIDIA Matmul Heuristics (nvMatmulHeuristics) is a GPU optimization module that provides fast, analytic heuristics for GPU tensor operations, particularly matrix multiplications (GEMMs). It analyzes tensor operation parameters and hardware capabilities to determine optimal kernel configurations for maximum performance.

Key features#

Delivers optimized kernel configurations for tensor operations based on problem dimensions and hardware characteristics.
Supports a wide range of precision formats including FP16, BF16, FP32, TF32, FP64, FP8 (E4M3, E5M2), INT8, INT4, and complex numbers.
Compatible with multiple GPU architectures including:
- Ampere (A100, A10, A30, RTX 3090, RTX A6000,…)
- Ada (L40, L40S, L4, RTX 6000)
- Hopper (H100, H200)
- Blackwell
nvMatmulHeuristics supports CUTLASS (versions 2 and 3) and powers some of the cuBLAS heuristics.
Provides sophisticated performance modeling and prediction capabilities including: - Runtime estimation - Memory bandwidth and throughput analysis - L2 cache hit rate prediction - Energy consumption estimation
Supports tuning of the heuristic to the kernel using the “discovery”. In this process the heuristic will detect how the kernel perform through a few benchmarks run by the user.
C++ and Python APIs.

nvMatmulHeuristics enables GPU applications to achieve peak performance without manual tuning by automatically selecting the best kernel implementation strategy for specific workloads and hardware configurations.

Getting Started#

In C++#

Install nvMatmulHeuristics by downloading and extracting the archive.

To compile (if necessary add -I${NVMMH_HOME}/include and/or -L${NVMMH_HOME}/lib with NVMMH_HOME the root of your nvMatmulHeuristics installation):

g++ -stdc++17 -o test test.cpp -lnvMatmulHeuristics

This code shows how to run nvMatmulHeuristics to predict the best configuration and runtime for a 128x128x400000 GEMM

#include <cassert>
#include <iostream>
#include <nvMatmulHeuristics/nvMatmulHeuristics.h>

#include "sampleHelpers.h"

/**
 * This sample shows the best way to use nvMatmulHeuristics to get a GEMM kernel configuration.
 */

int main() {
    nvmmhHandle_t handle = nullptr;
    // In case the user does not want to manage handles, a nullptr can be used as a handle which will use an internal global handle.
    // This is only recommended when it is known that no other library is using nvMatmulHeuristics
    NVMMH_CHECK(nvMatmulHeuristicsCreate(&handle));


    // We can create a hardware descriptor to specify what hardware nvMatmulHeuristics will target.
    // The hardware descriptor is optional, the user can pass a nullptr which will cause nvMatmulHeuristics to target the current GPU, if there's one.
    nvmmhHardwareDescriptor_t descr = nullptr;
    NVMMH_CHECK(nvMatmulHeuristicsHardwareDescriptorCreate(&descr));
    // Here we set A100 SXM properties.
    NVMMH_CHECK(nvMatmulHeuristicsHardwareDescriptorSetPredefinedGpu(descr, NVMMH_NVGPU_A100_SXM_80GB));

    // See header for precision string convention. HSH means FP16 A/B, FP32 computation and FP16 C/D.
    const char* precision = "HSH";
    constexpr int kernelCount = 8;
    constexpr auto layout = NVMMH_MATMUL_LAYOUT_TN_ROW_MAJOR;
    constexpr auto target = NVMMH_TARGET_CUTLASS;
    // Some matmul problem
    constexpr nvmmhMatmulProblem_t p = {
            .M = 128,
            .N = 128,
            .K = 400000,
            .batchSize = 1,
            .matmulLayout = static_cast<uint8_t>(layout),
    };

    // Loads the internal discovery data (silicon performance scans) to tune nvMatmulHeuristics to the actual kernel implementation.
    // This allows for a quick/cold start. If you are using customized kernels, or kernels not included in nvMatmulHeuristics, you need to run the discovery manually.
    if (nvMatmulHeuristicsLoadInternalDiscoverySet(handle, precision, target, layout, descr) != NVMMH_STATUS_SUCCESS) {
        std::cout << "Please check sample #2 to see how to pass the data to nvMatmulHeuristics manually." << std::endl;
        std::cout << "We can continue without the tuning data." << std::endl;
    }

    nvmmhKernelConfiguration_t configs[kernelCount];
    // NVMMH_FLAG_PERF_MODEL_BASED_AUTO_TUNING is the recommended flag.
    const int count = nvMatmulHeuristicsGetGemmConfig(handle, precision, NVMMH_FLAG_PERF_MODEL_BASED_AUTO_TUNING, target, &p, configs, kernelCount, descr);

    // nvMatmulHeuristics might return fewer kernels than requested.
    assert(count <= kernelCount);

    // Printing the kernels
    for (int i = 0; i < count; ++i) {
        // nvMatmulHeuristicsEstimateRuntime might use a different path that the heuristic internal ordering.
        const double runtime = nvMatmulHeuristicsEstimateRuntime(handle, precision, target, &p, &configs[i], descr) * 1000.;
        std::cout << '[' << i << "] " << to_string(configs[i]) << ", runtime: " << runtime << " ms" << std::endl;
    }

    // Freeing memory. We pass a pointer, so nvMatmulHeuristics can set it to nullptr, to avoid use-after-free/double-free,
    NVMMH_CHECK(nvMatmulHeuristicsHardwareDescriptorDestroy(&descr));
    NVMMH_CHECK(nvMatmulHeuristicsDestroy(&handle));
}

See API reference.

In Python#

Install nvMatmulHeuristics by installing the nvidia-matmul-heuristics Python wheel.

pip install nvidia-matmul-heuristics

The example below uses nvMatmulHeuristics to predict the top-8 configurations for a 4000x16x32768 half-precision GEMM

from nvMatmulHeuristics import NvMatmulHeuristicsInterface, \
                            NvMatmulHeuristicsTarget, \
                            NvMatmulHeuristicsMatmulLayout, \
                            NvMatmulHeuristicsFlags, \
                            NvMatmulHeuristicsNvidiaGpu

# Load interface
nvmmh = NvMatmulHeuristicsInterface(NvMatmulHeuristicsTarget.CUTLASS3,
                                    precision='HSH',
                                    flags=NvMatmulHeuristicsFlags.PERF_MODEL_BASED_AUTO_TUNING)

# Create Hardware descriptor
# hw can be "None" to use the system's GPU instead
hw = nvmmh.createHardwareDescriptor()
nvmmh.setHardwarePredefinedGpu(hw, NvMatmulHeuristicsNvidiaGpu.H200_SXM)

# Select layout
layout = NvMatmulHeuristicsMatmulLayout.NN_ROW_MAJOR

# Load internal discovery set for improved accuracy
assert nvmmh.loadInternalDiscoverySet(layout, hw)

# Get best configurations
configs = nvmmh.get_with_mnk(4000, 16, 32768, layout, 8, hw)

# Print results
print(f"Found {len(configs)} configurations:\n")
for i, config in enumerate(sorted(configs, key=lambda d: d['runtime']), 1):
   print(f"Configuration {i}:")
   print(f"  Kernel: {config['kernel']}")
   print(f"  Estimated runtime: {config['runtime'] * 1000:.6f} ms")

Running it will print

Found 8 configurations:

Configuration 1:
Kernel: layout(NN_ROW) stages(6) cta(128 16 128) warp(64 8 128) instr(64 8 16) splitK(4) swizz(1) ctaOrder(0) cluster(2 1)
Estimated runtime: 0.083215 ms
...
Configuration 8:
Kernel: layout(NN_ROW) stages(8) cta(64 16 64) warp(64 8 64) instr(64 8 16) splitK(1) swizz(1) ctaOrder(0) cluster(4 1)
Estimated runtime: 0.102996 ms

See Python API reference.