Architecture#

Nsight Python’s architecture consists of:

  1. Collection: Runs your benchmark under NVIDIA Nsight Compute. See Collection.

  2. Extraction: Parses Nsight reports using ncu-report and associates metrics with annotations/configs. See Extraction.

  3. Visualization: Converts data to a pandas DataFrame and optionally plots results via matplotlib. See Visualization.

Internally, Nsight Python:

  • Inserts NVTX ranges for each annotation

  • Profiles each configuration for multiple runs

  • Associates collected metrics with annotations

  • Supports TFLOPs, speedup, and other derived metrics

Advanced Options#

Metric Selection Nsight Python collects gpu__time_duration.sum by default. To collect another NVIDIA Nsight Compute metric:

@nsight.analyze.kernel(metric="sm__throughput.avg.pct_of_peak_sustained_elapsed")
def benchmark(...):
    ...

Derived Metrics Define a Python function that computes metrics like TFLOPs based on runtime and input configuration:

def tflops(t, m, n, k):
    return 2 * m * n * k / (t / 1e9) / 1e12

@nsight.analyze.kernel(configs=[(1024, 1024, 64)], derive_metric=tflops)
def benchmark(m, n, k):
    ...

Relative Metrics Compare performance against a baseline annotation:

@nsight.analyze.kernel(normalize_against="torch.einsum")
def benchmark(...):
    ...

Multiple Annotations Profile multiple implementations side-by-side:

with nsight.annotate("torch"):
    torch_impl(...)

with nsight.annotate("cutlass4"):
    cutlass_impl(...)

Multiple Config Parameters

Nsight Python supports multi-dimensional config tuples which can contain arbitrary Python objects:

import itertools
configs = list(itertools.product([512, 1024], [64, 128]))  # (seqlen, head_dim)

@nsight.analyze.kernel(configs=configs)
def benchmark(seqlen, head_dim):
    ...