CUPTI Python Tutorial¶
This tutorial guides the reader through profiling a simple CUDA kernel using CUPTI Python. It starts with a basic vector addition kernel and, step by step, incrementally adds CUPTI Python API calls to collect and display both the kernel name and duration.
Simple Vector Addition with Numba¶
Let’s begin by writing a simple CUDA kernel using Numba to add two vectors:
import numpy as np
from numba import cuda
@cuda.jit
def vector_add(A, B, C):
idx = cuda.grid(1)
if idx < A.size:
C[idx] = A[idx] + B[idx]
# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)
threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block
# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()
This code runs a vector addition on the GPU. At this point, no profiling information is being collected.
Step 1: Register CUPTI Callbacks¶
Next, import the CUPTI Python API, define the callback functions, and register them. Add the following lines after the imports and before launching the kernel:
from cupti import cupti
def func_buffer_requested():
buffer_size = 8 * 1024 * 1024 # 8MB buffer
max_num_records = 0
return buffer_size, max_num_records
def func_buffer_completed(activities: list):
for activity in activities:
if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
print(f"kernel name = {activity.name}")
print(f"kernel duration (ns) = {activity.end - activity.start}")
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)
Step 2: Enable CUPTI Activity Collection¶
Then, enable kernel activity collection. Add the following line after registering the callbacks and before launching the kernel:
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)
Step 3: Flushing and Disabling CUPTI Activity¶
After profiling is complete, flush any remaining activity records and disable CUPTI activity collection. Add these lines after the synchronization call:
cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)
Your final code should look like this:
import numpy as np
from numba import cuda
from cupti import cupti
@cuda.jit
def vector_add(A, B, C):
idx = cuda.grid(1)
if idx < A.size:
C[idx] = A[idx] + B[idx]
def func_buffer_requested():
buffer_size = 8 * 1024 * 1024 # 8MB buffer
max_num_records = 0
return buffer_size, max_num_records
def func_buffer_completed(activities: list):
for activity in activities:
if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
print(f"kernel name = {activity.name}")
print(f"kernel duration (ns) = {activity.end - activity.start}")
#Step 1: Register CUPTI callbacks
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)
#Step 2: Enable CUPTI Activity Collection
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)
# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)
threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block
# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()
#Step 3: Flushing and Disabling CUPTI Activity
cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)
Expected Output¶
When the above code is run, output similar to the following should be seen:
kernel name = _ZN8__main__10vector_addB2v1B92cw51cXTLSUwv1sDUaKthoaNgqamjgOR3W3CwAkMXLaJtQYkOIgxJU0gCqOkEJoHkbttqdVhoqlspQGNFHSgJ5BnXagIAE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
kernel duration (ns) = 49472
This indicates that CUPTI has successfully captured and reported the name of the CUDA kernel that was launched.
CUPTI Python can be used to collect more detailed profiling information. Refer to the CUPTI Python samples section.