Tutorial#

Step-by-step tutorials for each supported CUPTI API. Each tutorial builds up a complete profiling program from a minimal CUDA workload.

If you are not sure which API to start with, see Overview.

CUDA kernel tracing using Activity API#

This tutorial guides the reader through tracing a simple CUDA kernel using the CUPTI Activity API, exposed in CUPTI Python through cupti.cupti. It starts with a basic vector addition kernel and, step by step, incrementally adds Activity API calls to collect and display the kernel name and duration.

Note

You will need numba-cuda for running the tutorial. Refer to Setting up Numba CUDA.

Simple Vector Addition with Numba#

Let’s begin by writing a simple CUDA kernel using Numba to add two vectors:

import numpy as np
from numba import cuda

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

device_index = 0
cuda.select_device(device_index)

# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

This code runs a vector addition on the GPU. At this point, no profiling information is being collected.

Step 1: Register CUPTI Callbacks#

Next, import the CUPTI Python API, define the callback functions, and register them. Add the following lines after the imports and before launching the kernel:

from cupti import cupti

def func_buffer_requested():
    buffer_size = 8 * 1024 * 1024  # 8MB buffer
    max_num_records = 0
    return buffer_size, max_num_records

def func_buffer_completed(activities: list):
    for activity in activities:
        if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
            print(f"kernel name = {activity.name}")
            print(f"kernel duration (ns) = {activity.end - activity.start}")

cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)

Step 2: Enable CUPTI Activity Collection#

Then, enable kernel activity collection. Add the following line after registering the callbacks and before launching the kernel:

cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)

Step 3: Flushing and Disabling CUPTI Activity#

After profiling is complete, flush any remaining activity records and disable CUPTI activity collection. Add these lines after the synchronization call:

cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)

Your final code should look like this:

import numpy as np
from numba import cuda
from cupti import cupti

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

def func_buffer_requested():
    buffer_size = 8 * 1024 * 1024  # 8MB buffer
    max_num_records = 0
    return buffer_size, max_num_records

def func_buffer_completed(activities: list):
    for activity in activities:
        if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
            print(f"kernel name = {activity.name}")
            print(f"kernel duration (ns) = {activity.end - activity.start}")

#Step 1: Register CUPTI callbacks
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)

#Step 2: Enable CUPTI Activity Collection
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)

device_index = 0
cuda.select_device(device_index)

# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

#Step 3: Flushing and Disabling CUPTI Activity
cupti.activity_flush_all(1)
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)

Expected Output#

When the above code is run, output similar to the following should be seen:

kernel name = _ZN8__main__10vector_addB2v1B92cw51cXTLSUwv1sDUaKthoaNgqamjgOR3W3CwAkMXLaJtQYkOIgxJU0gCqOkEJoHkbttqdVhoqlspQGNFHSgJ5BnXagIAE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
kernel duration (ns) = 49472

This indicates that CUPTI has successfully captured and reported the name of the CUDA kernel that was launched.

Periodic metric sampling using PM Sampling API#

This tutorial walks through collecting PM (Performance Monitor) samples while a CUDA kernel runs, using the pythonic PM Sampling layer in cupti.pm_sampling. Each step introduces one method on cupti.pm_sampling.Collector and explains what it does and when to call it.

PM Sampling periodically captures GPU hardware counter values at a configured interval while your workload executes. The result is a time series of metric values that you can iterate over once collection has stopped.

Note

You will need numba-cuda to run this tutorial. See Setting up Numba CUDA.

Starting workload: vector addition with Numba#

We use the vector_add workload from the Simple Vector Addition with Numba section of the Activity API tutorial. The PM Sampling steps below add Collector calls around the kernel launch, so that hardware counters are sampled while the kernel is running.

Step 1: Create the Collector#

A cupti.pm_sampling.Collector manages the lifecycle of one PM Sampling session bound to a single CUDA device. Construction is cheap and does not touch CUPTI yet — it only records the device index.

from cupti import pm_sampling

collector = pm_sampling.Collector(device_index=device_index)

Step 2: Enable PM Sampling on the device#

enable() initializes profiler state for the process, resolves the device chip name, and creates the underlying CUPTI PM Sampling object. You must call this before configure(), start(), stop(), or decode().

CUDA must already be initialized on the target device before enable() is called.

cuda.select_device(device_index)

collector = pm_sampling.Collector(device_index=device_index)
collector.enable()

Common failure modes are surfaced as Python exceptions:

  • ValueError if device_index is invalid.

  • RuntimeError if PM Sampling is already enabled on the device.

  • PermissionError if the process lacks the privileges required to read GPU performance counters. See the NVIDIA developer-tools permissions note.

Always pair enable() with a disable() (Step 6) so profiler state is torn down before the process exits.

Step 3: Configure metrics and sampling parameters#

configure() selects which metrics to collect and how the GPU should trigger samples. You can call it again later to reconfigure the same enabled collector.

collector.configure(
    metrics=[
        "gr__cycles_elapsed.max",
        "sm__cycles_elapsed.sum",
    ],
    hardware_buffer_size=512 * 1024 * 1024,
    sampling_interval=10000,
    trigger_mode=pm_sampling.TriggerMode.GPU_TIME_INTERVAL,
)

Key arguments:

  • metrics – names of metrics to collect; they must be collectable in a single pass.

  • hardware_buffer_size – size (in bytes) of the device-side hardware buffer that holds raw samples until decoded.

  • sampling_interval – sampling period; its unit depends on trigger_mode.

  • trigger_mode – when the GPU should emit a sample. See cupti.pm_sampling.TriggerMode.

If the configuration cannot be satisfied (for example, the requested metrics cannot fit in a single pass, or the trigger mode is unsupported on the GPU), configure() raises ValueError. To discover what is supported on the current chip, use cupti.profiler_host query helpers such as get_supported_chips(), get_single_pass_sets(), and get_metrics_in_single_pass_set().

Step 4: Sample around the workload#

start() begins periodic sample collection on the device — the GPU starts writing samples into the hardware buffer based on the configured trigger mode and interval. stop() halts new sample collection; samples already written into the hardware buffer remain available for decode.

Call start() immediately before the code you want to profile, and stop() immediately after.

collector.start()

vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

collector.stop()

Step 5: Decode samples and read metric values#

decode() pulls raw records from the hardware buffer into a host-side counter data image and returns a cupti.pm_sampling.CounterData object.

counter_data = collector.decode(max_samples=200)

decode() raises MemoryError if the hardware buffer overflowed during sampling. To address this, increase the hardware_buffer_size passed to configure(), or increase the sampling_interval so fewer samples are emitted per unit time.

Iterate CounterData to retrieve each completed Sample (timestamps and one value per configured metric, aligned with counter_data.metrics):

print(f"Number of completed samples: {counter_data.num_completed_samples}")
for sample_idx, sample in enumerate(counter_data):
    print(
        f"Sample {sample_idx}: "
        f"start={sample.start_timestamp} end={sample.end_timestamp}"
    )
    for metric_name, value in zip(counter_data.metrics, sample.metric_values):
        print(f"  {metric_name}: {value}")

Step 6: Disable the collector#

disable() destroys the PM Sampling object for the device and tears down profiler state.

collector.disable()

Complete code#

Putting all the steps together with the vector_add workload from the Simple Vector Addition with Numba section:

import numpy as np
from cupti import pm_sampling
from numba import cuda

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

device_index = 0
cuda.select_device(device_index)

# Step 1: Create the Collector
collector = pm_sampling.Collector(device_index=device_index)

# Step 2: Enable PM Sampling on the device
collector.enable()

vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Step 3: Configure metrics and sampling parameters
collector.configure(
    metrics=[
        "gr__cycles_elapsed.max",
        "sm__cycles_elapsed.sum",
    ],
    hardware_buffer_size=512 * 1024 * 1024,
    sampling_interval=10000,
    trigger_mode=pm_sampling.TriggerMode.GPU_TIME_INTERVAL,
)

# Step 4: Sample around the workload
collector.start()

vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

collector.stop()

# Step 5: Decode samples and read metric values
counter_data = collector.decode(max_samples=200)
print(f"Number of completed samples: {counter_data.num_completed_samples}")
for sample_idx, sample in enumerate(counter_data):
    print(
        f"Sample {sample_idx}: "
        f"start={sample.start_timestamp} end={sample.end_timestamp}"
    )
    for metric_name, value in zip(counter_data.metrics, sample.metric_values):
        print(f"  {metric_name}: {value}")

# Step 6: Disable the collector
collector.disable()

Expected output#

Running the program produces output similar to the following (exact counts and values vary with GPU and workload):

Number of completed samples: 42
Sample 0: start=1737454707775744135 end=1737454707775763143
  gr__cycles_elapsed.max: 19008.0
  sm__cycles_elapsed.sum: 1824768.0
...

This confirms that PM Sampling captured a time series of hardware counter values while the vector-add kernel was running.

Where to go next#

CUPTI Python can be used to collect more detailed profiling information. Refer to the CUPTI Python samples section.