Samples#

Once CUPTI Python is installed, the CUPTI samples are located under the site-packages/cupti-python-samples directory. You can determine the location of your site-packages directory by executing the following command:

$ python3 -m site

Setting up Numba CUDA#

The CUPTI Python Numba samples require the numba-cuda package along with the dependencies for CUDA 13.x. You can install numba-cuda using the following command:

$ pip install numba-cuda[cu13]

Samples#

The CuptiVectorAdd* samples have a simple code which does element by element vector addition.

CuptiVectorAddNumba.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses numba-cuda.

Command line options:

--profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddNumbaCallback.py#

CUPTI Python sample which shows use of CUPTI Callback APIs. This sample uses numba-cuda.

Command line options:

--profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddDrv.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses CUDA Python Driver APIs from cuda-bindings. It also shows how to use CUDA profiler start and stop APIs to define the range of code to be profiled.

This sample uses NVRTC (NVIDIA Runtime Compilation) to compile CUDA kernel code to PTX at runtime. The sample demonstrates:

Using cuda.bindings.nvrtc to compile CUDA kernel source code to PTX
Using cuda.bindings.driver APIs to load the PTX module and launch kernels
Using CUPTI Activity APIs to profile the CUDA operations

For ensuring cuda-bindings is set up correctly along with the necessary CUDA Toolkit (CTK) components (including NVRTC), please refer to the cuda-bindings runtime requirements documentation.

Command line options:

--profile, -p

Enable CUPTI based profiling. Default: OFF

--define-profile-range, -r

Include CUDA profiler start and stop APIs to define the range of code to be profiled. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiPmSamplingVectorAddNumba.py#

CUPTI Python sample which shows how to use the PM Sampling and Profiler Host APIs through the pythonic layer (cupti.pm_sampling and cupti.profiler_host). This sample uses numba-cuda.

It supports two workflows, selected by which flags are passed: if any Query Metrics flag is given, the Query workflow runs and the sample exits after printing; otherwise the Collection workflow runs.

Query workflow — inspect what the current GPU exposes: supported chips, base metrics, per-metric properties (description, units, scope, sub-metrics), single-pass metric sets, and the metrics in a given single-pass set.
Collection workflow — configure PM Sampling for a metric set, run a CUDA vector-add workload, stop and decode the PM Sampling data, and print sampled metric values with timestamps.

Query Metrics options (run the Query workflow; sample exits after printing):

--supported-chips

List the supported chips.

--base-metrics

List base metrics available on the selected device.

--metric-properties <metric[,metric…]>

Print metric properties (description, unit, scope, sub-metrics, etc.) for each comma-separated metric name.

--single-pass-metric-sets

List single-pass metric sets available on the selected device.

--metrics-in-single-pass-sets <set>

List metrics in the given single-pass metric set.

PM Sampling options (control the Collection workflow):

--metrics <comma-separated metric list>

Metrics to collect. Default: gr__cycles_elapsed.max,sm__cycles_elapsed.sum.

--hardware-buffer-size SIZE_BYTES

Hardware buffer size in bytes. Default: 536870912.

--sampling-interval INTERVAL

Sampling interval for the selected trigger mode. Default: 10000.

--trigger-mode MODE

Sampling trigger mode. One of gpu_time_interval, gpu_sysclk_interval. Default: gpu_time_interval.

—device-index INDEX CUDA device index to use. Default: 0.

--help, -h

Shows the usage.

cupyprof.py#

CUPTI Python sample which shows how to profile a CUDA Python application using the CUPTI Python APIs without having to modify the CUDA Python application code.

This sample exercises the Activity, Callback, PM Sampling, and Profiler Host APIs. By default, only tracing (Activity + Callback) is enabled. PM Sampling is enabled when any PM Sampling option is passed; passing both PM Sampling and tracing options enables both at once. Query Metrics options use the Profiler Host API to inspect chips and metrics, and exit without profiling.

usage: python3 cupyprof.py [OPTIONS] <python_file_path> [args]

Output:

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief.

Tracing options (Activity + Callback APIs):

--profile, -p PROFILING_TYPE

Enable profiling for the entire CUDA Python program, or only for the subset between cuProfilerStart and cuProfilerStop. PROFILING_TYPE can be from_start or range. Default: from_start.

--activity, -a <comma-separated list of activities>

Activities to trace. Use --help to view all supported values; the defaults are listed under DEFAULT_ACTIVITY_CHOICES in cupyprof.py.

PM Sampling options (passing any of these enables PM Sampling):

--metrics, -m <comma-separated metric list>

Metrics to collect. Default: gr__cycles_elapsed.max,sm__cycles_elapsed.sum.

--hardware-buffer-size, -b SIZE_BYTES

Hardware buffer size in bytes. Default: 536870912.

--sampling-interval, -i INTERVAL

Sampling interval for the selected trigger mode. Default: 10000.

--trigger-mode, -t MODE

Sampling trigger mode. One of gpu_time_interval, gpu_sysclk_interval. Default: gpu_time_interval.

--num-of-samples, -n COUNT

Number of samples in the PM sampling counter-data image. Default: 10000.

--device-index INDEX

CUDA device index to use. Default: 0.

Query Metrics options (mutually exclusive with Tracing and PM Sampling; cupyprof exits after printing):

--supported-chips

List the supported chips.

--base-metrics

List base metrics available on the selected device.

--metric-properties <metric[,metric…]>

Print metric properties (description, unit, sub-metrics, etc.) for each comma-separated metric name.

--single-pass-metric-sets

List single-pass metric sets available on the selected device.

--metrics-in-single-pass-sets <set>

List metrics in the given single-pass metric set.

—help, -h Shows the usage.

python_file_path is the path to the CUDA Python application, and args are the arguments for that application.

Examples of running samples#

Run the sample without profiling:

$ python3 CuptiVectorAddNumba.py

Run the sample with profiling enabled and use default output:

$ python3 CuptiVectorAddNumba.py --profile
profiling_enabled:  True
prof_output:  ProfOutput.BRIEF
vector_length:  1048576
threads_per_block:  128
blocks_per_grid:  8192
Activity Kind                  Start                Duration             correlationId        Name
DRIVER                         1714136661470990409  1834876              1                    cuCtxGetCurrent
DRIVER                         1714136661472854473  213                  2                    cuDeviceGetCount
DRIVER                         1714136661472869777  87                   3                    cuDeviceGet
DRIVER                         1714136661472880942  566                  4                    cuDeviceGetAttribute
DRIVER                         1714136661472883507  69                   5                    cuDeviceGetAttribute
DRIVER                         1714136661472906825  3702                 6                    cuDeviceGetName
DRIVER                         1714136661472969577  87                   7                    cuDeviceGetUuid_v2
DRIVER                         1714136661472991812  140587104            8                    cuDevicePrimaryCtxRetain
.
.
.
DRIVER                         1714136661714686225  218                  88                   cuCtxGetCurrent
DRIVER                         1714136661714688211  55                   89                   cuCtxGetDevice
DRIVER                         1714136661714702981  2080                 90                   cuCtxSynchronize
verify_result: PASS

Using the cupyprof.py sample to profile a CUDA Python application with profiling range defined and with detailed output:

$ python3 cupyprof.py --profile range --output detailed ./CuptiVectorAddDrv.py --define-profile-range
profiling_enabled: False
prof_output: ProfOutput.BRIEF
profile_range: True
vector_length: 1048576
threads_per_block: 128
blocks_per_grid: 8192
MEMCPY "HTOD" [ 1726060107808115285, 1726060107808868469 ] duration 753184, size 4194304, src_kind 1, dst_kind 3, correlation_id 2
        device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 10, channel_type ASYNC_MEMCPY
.
.
.
CONCURRENT_KERNEL [ 1737454707775744135, 1737454707775763143 ] duration 19008, "vector_add", correlation_id 5, cache_config_requested 0, cache_config_executed 0
    grid [8192, 1, 1], block [128, 1, 1], cluster [0, 0, 0], shared_memory (0, 0)
    device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 1, channel_type COMPUTE
.
.
.
MEMCPY "DTOH" [ 1737455038429091384, 1737455038429825494 ] duration 734110, size 4194304, src_kind 3, dst_kind 1, correlation_id 15
    device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 12, channel_type ASYNC_MEMCPY

verify_result: PASS

Using the CuptiPmSamplingVectorAddNumba.py sample to collect a PM Sampling metric (sm__cycles_elapsed.sum) while a CUDA vector-add workload runs:

$ python3 CuptiPmSamplingVectorAddNumba.py --metrics sm__cycles_elapsed.sum
verify_result: PASS
Number of completed samples: 10000
Sample Index: 0, Start Timestamp: 1779343039448356616, End Timestamp: 1779343039589277282
    sm__cycles_elapsed.sum: 99602762.0

Sample Index: 1, Start Timestamp: 1779343039589277282, End Timestamp: 1779343039589287298
    sm__cycles_elapsed.sum: 1207148.0

.
.
.