nsight.analyze#

class nsight.analyze.kernel( _func: Callable[[...], Any] | None = None, *, configs: Sequence[Sequence[Any]] | None = None, runs: int = 1, derive_metric: Callable[[...], float] | None = None, normalize_against: str | None = None, output: Literal['quiet', 'progress', 'verbose'] = 'progress', metric: str = 'gpu__time_duration.sum', ignore_kernel_list: Sequence[str] | None = None, clock_control: Literal['base', 'none'] = 'none', cache_control: Literal['all', 'none'] = 'all', replay_mode: Literal['kernel', 'range'] = 'kernel', thermal_control: bool = True, combine_kernel_metrics: Callable[[float, float], float] | None = None, output_prefix: str | None = None, output_csv: bool = False, )#

Bases:

A decorator that collects profiling data using NVIDIA Nsight Compute.

Can be used with or without parentheses:

@nsight.analyze.kernel (no parentheses)
@nsight.analyze.kernel() (empty parentheses)
@nsight.analyze.kernel(configs=..., runs=10) (with arguments)

The decorator returns a wrapped version of your function with the following signature:

def wrapped_function(*args, configs=None, **kwargs) -> ProfileResults

Where:

*args: Original function arguments (when providing a single config)
configs: Optional list of configurations (overrides decorator-time configs)
**kwargs: Original function keyword arguments
Returns ProfileResults object containing profiling data

Parameters:

configs (Optional[Sequence[Sequence[Any]]]) – A sequence of configurations to run the function with. Each configuration is a tuple of arguments for the decorated function. Nsight Python invokes the decorated function len(configs) * runs times. If the configs are not provided at decoration time, they must be provided when calling the decorated function.
runs (int) – Number of times each configuration should be executed.
derive_metric (Optional[Callable[..., float]]) – A function to transform the collected metric. This can be used to compute derived metrics like TFLOPs that cannot be captured by ncu directly. The function takes the metric value and the arguments of the profile-decorated function and returns the new metric. See the examples for concrete use cases.
normalize_against (Optional[str]) – Annotation name to normalize metrics against. This is useful to compute relative metrics like speedup.
metric (str) – The metric to collect. By default, kernel runtimes in nanoseconds are collected. Default: "gpu__time_duration.sum". To see the available metrics on your system, use the command: ncu --query-metrics.
ignore_kernel_list (Optional[Sequence[str]]) – List of kernel names to ignore. If you call a library within an annotated range context, you might not have precise control over which and how many kernels are being launched. If some of these kernels should be ignored in the profile, their names can be provided in this parameter. Default: None
combine_kernel_metrics (Optional[Callable[[float, float], float]]) – By default, Nsight Python expects one kernel launch per annotation. In case an annotated region launches multiple kernels, instead of failing the profiling run, you can specify how to summarize the collected metrics into a single number. For example, if we profile runtime and want to sum the times of all kernels we can specify combine_kernel_metrics = lambda x, y: x + y. The function should take two arguments and return a single value. Default: None.
clock_control (Literal['base', 'none']) –
Control the behavior of the GPU clocks during profiling. Allowed values:
- "base": GPC and memory clocks are locked to their respective base frequency during profiling. This has no impact on thermal throttling. Note that actual clocks might still vary, depending on the level of driver support for this feature. As an alternative, use nvidia-smi to lock the clocks externally and set this option to "none".
- "none": No GPC or memory frequencies are changed during profiling.
Default: "none"
cache_control (Literal['all', 'none']) –
Control the behavior of the GPU caches during profiling. Allowed values:
- "all": All GPU caches are flushed before each kernel replay iteration during profiling. While metric values in the execution environment of the application might be slightly different without invalidating the caches, this mode offers the most reproducible metric results across the replay passes and also across multiple runs of the target application.
- "none": No GPU caches are flushed during profiling. This can improve performance and better replicates the application behavior if only a single kernel replay pass is necessary for metric collection. However, some metric results will vary depending on prior GPU work, and between replay iterations. This can lead to inconsistent and out-of-bounds metric values.
Default: "all"
replay_mode (Literal['kernel', 'range']) –
Mechanism used for replaying a kernel launch multiple times to collect selected metrics. Allowed values:
- "kernel": Replay individual kernel launches during the execution of the application.
- "range": Replay range of kernel launches during the execution of the application. Ranges are defined using nsight.annotate.
Default: "kernel"
thermal_control (bool) – Toggles whether to enable thermal control. Default: True
output (Literal['quiet', 'progress', 'verbose']) –
Controls the verbosity level of the output.
- "quiet": Suppresses all output.
- "progress": Shows a progress bar along with details about profiling and data extraction progress.
- "verbose": Displays the progress bar, configuration-specific logs, and profiler logs.
output_prefix (Optional[str]) –
When specified, all intermediate profiler files are created with this prefix. For example, if output_prefix=”/home/user/run1_”, the profiler will generate:
- /home/user/run1_ncu-output-<name_of_decorated_function>-<run_id>.log
- /home/user/run1_ncu-output-<name_of_decorated_function>-<run_id>.ncu-rep
- /home/user/run1_processed_data-<name_of_decorated_function>-<run_id>.csv
- /home/user/run1_profiled_data-<name_of_decorated_function>-<run_id>.csv
Where <run_id> is a counter that increments each time the decorated function is called within the same Python process (0, 1, 2, …). This allows calling the same decorated function multiple times without overwriting previous results.

if None, the intermediate profiler files are created in a directory under <TEMP_DIR> prefixed with nspy. <TEMP_DIR> is the system’s temporary directory ($TMPDIR or /tmp on Linux, %TEMP% on Windows).
output_csv (bool) –
Controls whether to dump raw and processed profiling data to CSV files. Default: False. When enabled, two CSV files are generated:

Raw Data CSV (profiled_data-<function_name>-<run_id>.csv): Contains unprocessed profiling data with one row per run per configuration. Columns include:
- Annotation: Name of the annotated region being profiled
- Value: Raw metric value collected by the profiler
- Metric: The metric being collected (e.g., gpu__time_duration.sum)
- Transformed: Name of the function used to transform the metric (specified via derive_metric), or False if no transformation was applied. For lambda functions, this shows "<lambda>"
- Kernel: Name of the GPU kernel(s) launched
- GPU: GPU device name
- Host: Host machine name
- ComputeClock: GPU compute clock frequency during profiling
- MemoryClock: GPU memory clock frequency during profiling
- <param_name>: One column for each parameter of the decorated function
Processed Data CSV (processed_data-<function_name>-<run_id>.csv): Contains aggregated statistics across multiple runs. Columns include:
- Annotation: Name of the annotated region being profiled
- <param_name>: One column for each parameter of the decorated function
- AvgValue: Average metric value across all runs
- StdDev: Standard deviation of the metric across runs
- MinValue: Minimum metric value observed
- MaxValue: Maximum metric value observed
- NumRuns: Number of runs used for aggregation
- CI95_Lower: Lower bound of the 95% confidence interval
- CI95_Upper: Upper bound of the 95% confidence interval
- RelativeStdDevPct: Standard deviation as a percentage of the mean
- StableMeasurement: Boolean indicating if the measurement is stable (low variance). The measurement is stable if RelativeStdDevPct < 2 % .
- Metric: The metric being collected
- Transformed: Name of the function used to transform the metric (specified via derive_metric), or False if no transformation was applied. For lambda functions, this shows "<lambda>"
- Kernel: Name of the GPU kernel(s) launched
- GPU: GPU device name
- Host: Host machine name
- ComputeClock: GPU compute clock frequency
- MemoryClock: GPU memory clock frequency
_func (Callable[[...], Any] | None)

Return type:

Callable[..., ProfileResults] | Callable[[Callable[..., Any]], Callable[..., ProfileResults]]

class nsight.analyze.plot( filename: str = 'plot.png', *, title: str = 'Nsight Analyze Kernel Plot Results', ylabel: str | None = None, annotate_points: bool = False, show_aggregate: str | None = None, plot_type: str = 'line', plot_width: int = 6, plot_height: int = 4, row_panels: Sequence[str] | None = None, col_panels: Sequence[str] | None = None, x_keys: Sequence[str] | None = None, print_data: bool = False, variant_fields: Sequence[str] | None = None, variant_annotations: Sequence[str] | None = None, plot_callback: Callable[[Figure], None] | None = None, )#

Bases:

A decorator that plots the result of a profile-decorated function. This decorator is intended to be only used on functions that have been decorated with @nsight.analyze.kernel.

The decorator returns a wrapped version of your function that maintains the same signature as the underlying @nsight.analyze.kernel decorated function:

def wrapped_function(*args, configs=None, **kwargs) -> ProfileResults

The function returns ProfileResults and generates a plot as a side effect.

Example usage:

@nsight.analyze.plot(title="My Plot")
@nsight.analyze.kernel
def my_func(...):

Parameters:

filename (str) – Filename to save the plot. Default: 'plot'
title (str) – Title for the plot. Default: 'Nsight Analyze Kernel Plot Results'
ylabel (Optional[str]) – Label for the y-axis in the generated plot. Default: f'{metric} (avg: {runs} runs)'
annotate_points (bool) – If True, annotate the points with their numeric value in the plot. Default: False
show_aggregate (Optional[str]) – If “avg”, show the average value in the plot. If “geomean”, show the geometric mean value in the plot. Default: None
plot_type (str) – Type of plot to generate. Options are ‘line’ or ‘bar’. Default: 'line'
plot_width (int) – Width of the plot in inches. Default: 6
plot_height (int) – Height of the plot in inches. Default: 4
row_panels (Optional[Sequence[str]]) – Enables generating subplots along the horizontal axis for each unique values of the listed function parameters. The provided strings must each match one argument of the nsight.analyze.kernel-decorated function. Default: None
col_panels (Optional[Sequence[str]]) – Enables generating subplots along the vertical axis for each unique values of the listed function parameters. The provided strings must each match one argument of the nsight.analyze.kernel-decorated function. Default: None
x_keys (Optional[Sequence[str]]) – List of fields to use for the x-axis. By default, we use all parameters of the decorated function except those specified in row_panels and col_panels.
print_data (bool) – If True, print the data used for plotting. Default: False
variant_fields (Optional[Sequence[str]]) – List of config fields to use as variant fields (lines).
variant_annotations (Optional[Sequence[str]]) – List of annotated range names for which to apply variant splitting. The provided strings must each match one of the names defined using nsight.annotate.
plot_callback (Callable[[Figure], None] | None)

Return type:

Callable[[Callable[..., ProfileResults]], Callable[..., ProfileResults]]

class nsight.analyze.ignore_failures#

Bases:

Context manager that ignores errors in a code block.

Useful when you want failures in the block to be suppressed so they do not propagate and cause the decorated function to fail.

Return type:: Any