CUPTI Python Tutorial

This tutorial guides the reader through profiling a simple CUDA kernel using CUPTI Python. It starts with a basic vector addition kernel and, step by step, incrementally adds CUPTI Python API calls to collect and display both the kernel name and duration.

Simple Vector Addition with Numba

Let’s begin by writing a simple CUDA kernel using Numba to add two vectors:

import numpy as np
from numba import cuda

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

This code runs a vector addition on the GPU. At this point, no profiling information is being collected.

Step 1: Register CUPTI Callbacks

Next, import the CUPTI Python API, define the callback functions, and register them. Add the following lines after the imports and before launching the kernel:

from cupti import cupti

def func_buffer_requested():
    buffer_size = 8 * 1024 * 1024  # 8MB buffer
    max_num_records = 0
    return buffer_size, max_num_records

def func_buffer_completed(activities: list):
    for activity in activities:
        if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
            print(f"kernel name = {activity.name}")
            print(f"kernel duration (ns) = {activity.end - activity.start}")

cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)

Step 2: Enable CUPTI Activity Collection

Then, enable kernel activity collection. Add the following line after registering the callbacks and before launching the kernel:

cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)

Step 3: Flushing and Disabling CUPTI Activity

After profiling is complete, flush any remaining activity records and disable CUPTI activity collection. Add these lines after the synchronization call:

cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)

Your final code should look like this:

import numpy as np
from numba import cuda
from cupti import cupti

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

def func_buffer_requested():
    buffer_size = 8 * 1024 * 1024  # 8MB buffer
    max_num_records = 0
    return buffer_size, max_num_records

def func_buffer_completed(activities: list):
    for activity in activities:
        if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
            print(f"kernel name = {activity.name}")
            print(f"kernel duration (ns) = {activity.end - activity.start}")

#Step 1: Register CUPTI callbacks
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)

#Step 2: Enable CUPTI Activity Collection
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)

# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

#Step 3: Flushing and Disabling CUPTI Activity
cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)

Expected Output

When the above code is run, output similar to the following should be seen:

kernel name = _ZN8__main__10vector_addB2v1B92cw51cXTLSUwv1sDUaKthoaNgqamjgOR3W3CwAkMXLaJtQYkOIgxJU0gCqOkEJoHkbttqdVhoqlspQGNFHSgJ5BnXagIAE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
kernel duration (ns) = 49472

This indicates that CUPTI has successfully captured and reported the name of the CUDA kernel that was launched.

CUPTI Python can be used to collect more detailed profiling information. Refer to the CUPTI Python samples section.