Samples#
Once CUPTI Python is installed, the CUPTI samples are located under the site-packages/cupti-python-samples directory. You can determine the location of your site-packages directory by executing the following command:
$ python3 -m site
Setting up Numba CUDA#
The CUPTI Python Numba samples require the numba-cuda package along with the dependencies for CUDA 13.x. You can install numba-cuda using the following command:
$ pip install numba-cuda[cu13]
Samples#
The CuptiVectorAdd* samples have a simple code which does element by element vector addition.
CuptiVectorAddNumba.py#
CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses numba-cuda.
- Command line options:
- --profile, -p
Enable CUPTI based profiling. Default: OFF
- --output, -o OUTPUT_TYPE
Select the profiler output format.
OUTPUT_TYPEcan be:brief,detailed, ornone. Default:brief- --help, -h
Shows the usage.
CuptiVectorAddNumbaCallback.py#
CUPTI Python sample which shows use of CUPTI Callback APIs. This sample uses numba-cuda.
- Command line options:
- --profile, -p
Enable CUPTI based profiling. Default: OFF
- --output, -o OUTPUT_TYPE
Select the profiler output format.
OUTPUT_TYPEcan be:brief,detailed, ornone. Default:brief- --help, -h
Shows the usage.
CuptiVectorAddDrv.py#
CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses CUDA Python Driver APIs from cuda-bindings. It also shows how to use CUDA profiler start and stop APIs to define the range of code to be profiled.
This sample uses NVRTC (NVIDIA Runtime Compilation) to compile CUDA kernel code to PTX at runtime. The sample demonstrates:
Using
cuda.bindings.nvrtcto compile CUDA kernel source code to PTXUsing
cuda.bindings.driverAPIs to load the PTX module and launch kernelsUsing CUPTI Activity APIs to profile the CUDA operations
For ensuring cuda-bindings is set up correctly along with the necessary CUDA Toolkit (CTK) components (including NVRTC), please refer to the cuda-bindings runtime requirements documentation.
- Command line options:
- --profile, -p
Enable CUPTI based profiling. Default: OFF
- --define-profile-range, -r
Include CUDA profiler start and stop APIs to define the range of code to be profiled. Default: OFF
- --output, -o OUTPUT_TYPE
Select the profiler output format.
OUTPUT_TYPEcan be:brief,detailed, ornone. Default:brief- --help, -h
Shows the usage.
CuptiPmSamplingVectorAddNumba.py#
CUPTI Python sample which shows how to use the PM Sampling and Profiler Host APIs through the pythonic layer (cupti.pm_sampling and cupti.profiler_host).
This sample uses numba-cuda.
It supports two workflows, selected by which flags are passed: if any Query Metrics flag is given, the Query workflow runs and the sample exits after printing; otherwise the Collection workflow runs.
Query workflow — inspect what the current GPU exposes: supported chips, base metrics, per-metric properties (description, units, scope, sub-metrics), single-pass metric sets, and the metrics in a given single-pass set.
Collection workflow — configure PM Sampling for a metric set, run a CUDA vector-add workload, stop and decode the PM Sampling data, and print sampled metric values with timestamps.
- Query Metrics options (run the Query workflow; sample exits after printing):
- --supported-chips
List the supported chips.
- --base-metrics
List base metrics available on the selected device.
- --metric-properties <metric[,metric…]>
Print metric properties (description, unit, scope, sub-metrics, etc.) for each comma-separated metric name.
- --single-pass-metric-sets
List single-pass metric sets available on the selected device.
- --metrics-in-single-pass-sets <set>
List metrics in the given single-pass metric set.
- PM Sampling options (control the Collection workflow):
- --metrics <comma-separated metric list>
Metrics to collect. Default:
gr__cycles_elapsed.max,sm__cycles_elapsed.sum.- --hardware-buffer-size SIZE_BYTES
Hardware buffer size in bytes. Default: 536870912.
- --sampling-interval INTERVAL
Sampling interval for the selected trigger mode. Default: 10000.
- --trigger-mode MODE
Sampling trigger mode. One of
gpu_time_interval,gpu_sysclk_interval. Default:gpu_time_interval.
—device-index INDEX CUDA device index to use. Default: 0.
- --help, -h
Shows the usage.
cupyprof.py#
CUPTI Python sample which shows how to profile a CUDA Python application using the CUPTI Python APIs without having to modify the CUDA Python application code.
This sample exercises the Activity, Callback, PM Sampling, and Profiler Host APIs. By default, only tracing (Activity + Callback) is enabled. PM Sampling is enabled when any PM Sampling option is passed; passing both PM Sampling and tracing options enables both at once. Query Metrics options use the Profiler Host API to inspect chips and metrics, and exit without profiling.
usage: python3 cupyprof.py [OPTIONS] <python_file_path> [args]
- Output:
- --output, -o OUTPUT_TYPE
Select the profiler output format.
OUTPUT_TYPEcan be:brief,detailed, ornone. Default:brief.- Tracing options (Activity + Callback APIs):
- --profile, -p PROFILING_TYPE
Enable profiling for the entire CUDA Python program, or only for the subset between
cuProfilerStartandcuProfilerStop.PROFILING_TYPEcan befrom_startorrange. Default:from_start.- --activity, -a <comma-separated list of activities>
Activities to trace. Use
--helpto view all supported values; the defaults are listed underDEFAULT_ACTIVITY_CHOICESincupyprof.py.- PM Sampling options (passing any of these enables PM Sampling):
- --metrics, -m <comma-separated metric list>
Metrics to collect. Default:
gr__cycles_elapsed.max,sm__cycles_elapsed.sum.- --hardware-buffer-size, -b SIZE_BYTES
Hardware buffer size in bytes. Default: 536870912.
- --sampling-interval, -i INTERVAL
Sampling interval for the selected trigger mode. Default: 10000.
- --trigger-mode, -t MODE
Sampling trigger mode. One of
gpu_time_interval,gpu_sysclk_interval. Default:gpu_time_interval.- --num-of-samples, -n COUNT
Number of samples in the PM sampling counter-data image. Default: 10000.
- --device-index INDEX
CUDA device index to use. Default: 0.
- Query Metrics options (mutually exclusive with Tracing and PM Sampling; cupyprof exits after printing):
- --supported-chips
List the supported chips.
- --base-metrics
List base metrics available on the selected device.
- --metric-properties <metric[,metric…]>
Print metric properties (description, unit, sub-metrics, etc.) for each comma-separated metric name.
- --single-pass-metric-sets
List single-pass metric sets available on the selected device.
- --metrics-in-single-pass-sets <set>
List metrics in the given single-pass metric set.
—help, -h Shows the usage.
python_file_pathis the path to the CUDA Python application, andargsare the arguments for that application.
Examples of running samples#
Run the sample without profiling:
$ python3 CuptiVectorAddNumba.py
Run the sample with profiling enabled and use default output:
$ python3 CuptiVectorAddNumba.py --profile
profiling_enabled: True
prof_output: ProfOutput.BRIEF
vector_length: 1048576
threads_per_block: 128
blocks_per_grid: 8192
Activity Kind Start Duration correlationId Name
DRIVER 1714136661470990409 1834876 1 cuCtxGetCurrent
DRIVER 1714136661472854473 213 2 cuDeviceGetCount
DRIVER 1714136661472869777 87 3 cuDeviceGet
DRIVER 1714136661472880942 566 4 cuDeviceGetAttribute
DRIVER 1714136661472883507 69 5 cuDeviceGetAttribute
DRIVER 1714136661472906825 3702 6 cuDeviceGetName
DRIVER 1714136661472969577 87 7 cuDeviceGetUuid_v2
DRIVER 1714136661472991812 140587104 8 cuDevicePrimaryCtxRetain
.
.
.
DRIVER 1714136661714686225 218 88 cuCtxGetCurrent
DRIVER 1714136661714688211 55 89 cuCtxGetDevice
DRIVER 1714136661714702981 2080 90 cuCtxSynchronize
verify_result: PASS
Using the
cupyprof.pysample to profile a CUDA Python application with profiling range defined and withdetailedoutput:
$ python3 cupyprof.py --profile range --output detailed ./CuptiVectorAddDrv.py --define-profile-range
profiling_enabled: False
prof_output: ProfOutput.BRIEF
profile_range: True
vector_length: 1048576
threads_per_block: 128
blocks_per_grid: 8192
MEMCPY "HTOD" [ 1726060107808115285, 1726060107808868469 ] duration 753184, size 4194304, src_kind 1, dst_kind 3, correlation_id 2
device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 10, channel_type ASYNC_MEMCPY
.
.
.
CONCURRENT_KERNEL [ 1737454707775744135, 1737454707775763143 ] duration 19008, "vector_add", correlation_id 5, cache_config_requested 0, cache_config_executed 0
grid [8192, 1, 1], block [128, 1, 1], cluster [0, 0, 0], shared_memory (0, 0)
device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 1, channel_type COMPUTE
.
.
.
MEMCPY "DTOH" [ 1737455038429091384, 1737455038429825494 ] duration 734110, size 4194304, src_kind 3, dst_kind 1, correlation_id 15
device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 12, channel_type ASYNC_MEMCPY
verify_result: PASS
Using the
CuptiPmSamplingVectorAddNumba.pysample to collect a PM Sampling metric (sm__cycles_elapsed.sum) while a CUDA vector-add workload runs:
$ python3 CuptiPmSamplingVectorAddNumba.py --metrics sm__cycles_elapsed.sum
verify_result: PASS
Number of completed samples: 10000
Sample Index: 0, Start Timestamp: 1779343039448356616, End Timestamp: 1779343039589277282
sm__cycles_elapsed.sum: 99602762.0
Sample Index: 1, Start Timestamp: 1779343039589277282, End Timestamp: 1779343039589287298
sm__cycles_elapsed.sum: 1207148.0
.
.
.