Architecture#
Nsight Python’s architecture consists of:
Collection: Runs your benchmark under NVIDIA Nsight Compute. See Collection.
Extraction: Parses Nsight reports using ncu-report and associates metrics with annotations/configs. See Extraction.
Visualization: Converts data to a pandas DataFrame and optionally plots results via matplotlib. See Visualization.
Internally, Nsight Python:
Inserts NVTX ranges for each annotation
Profiles each configuration for multiple runs
Associates collected metrics with annotations
Supports TFLOPs, speedup, and other derived metrics
Advanced Options#
Metric Selection Nsight Python collects gpu__time_duration.sum by default. To collect another NVIDIA Nsight Compute metric:
@nsight.analyze.kernel(metric="sm__throughput.avg.pct_of_peak_sustained_elapsed")
def benchmark(...):
...
Derived Metrics Define a Python function that computes metrics like TFLOPs based on runtime and input configuration:
def tflops(t, m, n, k):
return 2 * m * n * k / (t / 1e9) / 1e12
@nsight.analyze.kernel(configs=[(1024, 1024, 64)], derive_metric=tflops)
def benchmark(m, n, k):
...
Relative Metrics Compare performance against a baseline annotation:
@nsight.analyze.kernel(normalize_against="torch.einsum")
def benchmark(...):
...
Multiple Annotations Profile multiple implementations side-by-side:
with nsight.annotate("torch"):
torch_impl(...)
with nsight.annotate("cutlass4"):
cutlass_impl(...)
Multiple Config Parameters
Nsight Python supports multi-dimensional config tuples which can contain arbitrary Python objects:
import itertools
configs = list(itertools.product([512, 1024], [64, 128])) # (seqlen, head_dim)
@nsight.analyze.kernel(configs=configs)
def benchmark(seqlen, head_dim):
...