5. Tutorial#

This section provides detailed instructions on using various CUPTI APIs for tracing and profiling CUDA applications. You will learn how to utilize these APIs to gather performance metrics, analyze execution behavior, and optimize your CUDA applications effectively.

5.1. CUDA kernel tracing using Activity API#

This tutorial provides a guide to profiling a simple CUDA kernel using the CUPTI Activity API. Starting with a basic vector addition kernel, it incrementally introduces CUPTI API calls to collect and display the kernel name and execution duration.

5.1.1. Simple Vector Addition in CUDA C#

First, let’s write a vector addition kernel in CUDA C:

#include <cuda_runtime.h>
#include <stdio.h>

// CUDA kernel for vector addition
__global__ void VectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

int main() {
    int vectorLen = 1024 * 1024;
    size_t size = vectorLen * sizeof(float);

    // Host memory allocation
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize vectors
    for (int i = 0; i < vectorLen; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 128;
    int blocksPerGrid = (vectorLen + threadsPerBlock - 1) / threadsPerBlock;

    // Launch the kernel
    VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, vectorLen);
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

This code runs a vector addition on the GPU. At this point, no profiling information is being collected.

5.1.2. Step 1: Register CUPTI Callbacks#

Next, include the CUPTI API, define callback functions for activity buffer requested and completed, then register them. This is normally done right after initialization and before launching the kernel.

#include <cupti.h>

// Callback for buffer requests
static void BufferRequested(uint8_t** buffer, size_t* size, size_t* maxNumRecords) {
    *size = 8 * 1024 * 1024; // 8MB buffer
    *maxNumRecords = 0;
    *buffer = (uint8_t*)malloc(*size);
}

// Callback for buffer completed
static void BufferCompleted(CUcontext ctx, uint32_t streamId, uint8_t* buffer, size_t size, size_t validSize) {
    CUpti_Activity *record = NULL;

    if (validSize > 0)
    {
        // Parse CUPTI activity records here, print kernel name and duration
        while (cuptiActivityGetNextRecord(buffer, validSize, &record) == CUPTI_SUCCESS)
        {
            if (record->kind == CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL) {
                CUpti_ActivityKernel10 *kernel = (CUpti_ActivityKernel10 *)record;
                printf("kernel name = %s\n", kernel->name);
                printf("kernel duration (ns) = %llu\n", (unsigned long long)(kernel->end - kernel->start));
            }
        }
    }
    free(buffer);
}

cuptiActivityRegisterCallbacks(BufferRequested, BufferCompleted);

5.1.3. Step 2: Enable CUPTI Activity Collection#

Then, enable kernel activity collection. Add the following line after registering the callbacks and before launching the kernel:

cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);

5.1.4. Step 3: Flushing and Disabling CUPTI Activity#

After profiling is complete, flush any remaining activity records and disable CUPTI activity collection. Add these lines after the synchronization call:

cuptiActivityFlushAll(1);
cuptiActivityDisable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);

Your final code should look like this:

#include <cuda_runtime.h>
#include <stdio.h>
#include <cupti.h>

// Callback for buffer requests
static void BufferRequested(uint8_t** buffer, size_t* size, size_t* maxNumRecords) {
    *size = 8 * 1024 * 1024; // 8MB buffer
    *maxNumRecords = 0;
    *buffer = (uint8_t*)malloc(*size);
}

// Callback for buffer completed
static void BufferCompleted(CUcontext ctx, uint32_t streamId, uint8_t* buffer, size_t size, size_t validSize) {
    CUpti_Activity *record = NULL;

    if (validSize > 0)
    {
        // Parse CUPTI activity records here, print kernel name and duration
        while (cuptiActivityGetNextRecord(buffer, validSize, &record) == CUPTI_SUCCESS)
        {
            if (record->kind == CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL) {
                CUpti_ActivityKernel10 *kernel = (CUpti_ActivityKernel10 *)record;
                printf("kernel name = %s\n", kernel->name);
                printf("kernel duration (ns) = %llu\n", (unsigned long long)(kernel->end - kernel->start));
            }
        }
    }
    free(buffer);
}


// CUDA kernel for vector addition
__global__ void VectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

int main() {
    int vectorLen = 1024 * 1024;
    size_t size = vectorLen * sizeof(float);

    // Step 1: Register CUPTI callbacks
    cuptiActivityRegisterCallbacks(BufferRequested, BufferCompleted);

    // Step 2: Enable CUPTI Activity Collection
    cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);

    // Host memory allocation
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize vectors
    for (int i = 0; i < vectorLen; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 128;
    int blocksPerGrid = (vectorLen + threadsPerBlock - 1) / threadsPerBlock;

    // Launch the kernel (profiler will capture this call)
    VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, vectorLen);
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    // Step 3: Flushing and Disabling CUPTI Activity
    cuptiActivityFlushAll(1);
    cuptiActivityDisable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);

    return 0;
}

5.1.5. Expected Output#

When the above code is run, output similar to the following should be seen:

kernel name = _Z10vector_addPKfS0_Pfi
kernel duration (ns) = <some number>

This indicates that CUPTI has successfully captured and reported the name of the CUDA kernel that was launched.

5.2. CUDA memcpy API tracing using CUPTI Callback API#

This tutorial demonstrates how to collect information for CUDA memcpy API using the CUPTI Callback API. We start with a basic vector addition example and, step by step, add CUPTI calls to collect and display profiling information such as memcpy size and kind.

5.2.1. Simple Vector Addition in CUDA C#

First, let’s write a vector addition kernel in CUDA C, as demonstrated in the section on simple vector addition in CUDA C.

5.2.3. Step 2: Enable callback for specific domains and callback IDs#

Once subscribed, enable callbacks for specific domains and callback IDs. For this example, we track the CUDA Runtime API cudaMemcpy.

void CUPTIAPI
CallbackHandler(void *userData, CUpti_CallbackDomain domain, CUpti_CallbackId callbackId, const CUpti_CallbackData *callbackData) {
    switch(domain)
    {
        case CUPTI_CB_DOMAIN_RUNTIME_API:
            if (callbackData->callbackSite == CUPTI_API_ENTER)
            {
                // access parameters passed to cudaMemcpy
                if (callbackId == CUPTI_RUNTIME_TRACE_CBID_cudaMemcpy_v3020)
                {
                    printf("Memcpy size = %zu\n", ((cudaMemcpy_v3020_params *)(callbackData->functionParams))->count);
                    printf("Memcpy kind = %d\n", ((cudaMemcpy_v3020_params *)(callbackData->functionParams))->kind);
                }
            }
            break;
        default:
            break;
    }
}

// Enable all CUDA Runtime API callbacks
// Callback will be invoked at the entry and exit points of each of the CUDA Runtime API.
cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);

5.2.4. Step 3: Disable Callbacks and Cleanup#

After profiling, disable the domain and unsubscribe to release resources:

cuptiEnableDomain(0, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);
cuptiUnsubscribe(subscriber);

Your final code should look like this:

#include <cuda_runtime.h>
#include <stdio.h>
#include <cupti.h>

void CUPTIAPI
CallbackHandler(void *userData, CUpti_CallbackDomain domain, CUpti_CallbackId callbackId, const CUpti_CallbackData *callbackData) {
    switch(domain)
    {
        case CUPTI_CB_DOMAIN_RUNTIME_API:
            if (callbackData->callbackSite == CUPTI_API_ENTER)
            {
                // access parameters passed to cudaMemcpy
                if (callbackId == CUPTI_RUNTIME_TRACE_CBID_cudaMemcpy_v3020)
                {
                    printf("Memcpy size = %zu\n", ((cudaMemcpy_v3020_params *)(callbackData->functionParams))->count);
                    printf("Memcpy kind = %d\n", ((cudaMemcpy_v3020_params *)(callbackData->functionParams))->kind);
                }
            }
            break;
        default:
            break;
    }
}

// CUDA kernel for vector addition
__global__ void VectorAdd(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

int main() {
    int vectorLen = 1024 * 1024;
    size_t size = vectorLen * sizeof(float);

    // Step 1: Subscribe to CUPTI callbacks
    CUpti_SubscriberHandle subscriber;
    cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)CallbackHandler, NULL);

    // Step 2: Enable callback for specific domains and callback IDs
    // Enable all callbacks for CUDA Runtime APIs.
    // Callback will be invoked at the entry and exit points of each of the CUDA Runtime API.
    cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);

    // Host memory allocation
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize vectors
    for (int i = 0; i < vectorLen; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Device memory allocation
    float *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 128;
    int blocksPerGrid = (vectorLen + threadsPerBlock - 1) / threadsPerBlock;

    // Launch the kernel (profiler will capture this call)
    VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, vectorLen);
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    // Step 3: Disable callback for domains and callback IDs
    cuptiEnableDomain(0, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);
    cuptiUnsubscribe(subscriber);

    return 0;
}

5.2.5. Expected Output#

When the above code is run, output similar to the following should be seen:

Memcpy size = 4194304
Memcpy kind = 1
Memcpy size = 4194304
Memcpy kind = 1
Memcpy size = 4194304
Memcpy kind = 2

This confirms CUPTI successfully intercepted and reported cudaMemcpy calls.

5.3. Periodic metric sampling using PM Sampling API#

This tutorial provides a guide to periodic collection of performance metrics from a CUDA kernel using the CUPTI PM Sampling API. Starting with a basic vector addition kernel, it incrementally introduces CUPTI PM Sampling API calls to collect hardware performance counters at regular intervals during kernel execution.

5.3.1. Simple Vector Addition in CUDA C#

First, let’s write a vector addition kernel in CUDA C, as demonstrated in the section on simple vector addition in CUDA C.

5.3.2. Step 1: Initialize CUDA and Include Headers#

First, include the CUPTI PM Sampling headers and define global variables:

#include <cupti_target.h>
#include <cupti_pmsampling.h>
#include <cupti_profiler_target.h>
#include <cupti_profiler_host.h>

// Global variables for PM Sampling
CUpti_PmSampling_Object* g_pPmSamplingObject = NULL;

std::string g_chipName;
std::vector<uint8_t> g_configImage;
std::vector<uint8_t> g_counterDataImage;

std::vector<const char*> g_metrics =
{
    "gr__cycles_active.avg",      // GPU Active Cycles
    "gr__cycles_elapsed.max",     // GPU Elapsed Cycles
    "sm__cycles_active.avg"       // SM Active Cycles
};

5.3.3. Step 2: Initialize CUPTI Profiler and Enable PM Sampling#

Initialize the CUPTI profiler, enable PM sampling on the device, and retrieve the chip name required for the configuration image.

// Helper function to initialize PM Sampling
void InitializeAndEnablePmSampling(int deviceIndex)
{
    // Initialize CUPTI Profiler
    CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
    cuptiProfilerInitialize(&profilerInitializeParams);

    CUpti_Device_GetChipName_Params getChipNameParams = { CUpti_Device_GetChipName_Params_STRUCT_SIZE };
    getChipNameParams.deviceIndex = deviceIndex;
    cuptiDeviceGetChipName(&getChipNameParams);
    g_chipName = getChipNameParams.pChipName;
    printf("Chip Name: %s\n", g_chipName.c_str());

    // Enable PM sampling
    CUpti_PmSampling_Enable_Params enableParams = { CUpti_PmSampling_Enable_Params_STRUCT_SIZE };
    enableParams.deviceIndex = deviceIndex;
    cuptiPmSamplingEnable(&enableParams);
    g_pPmSamplingObject = enableParams.pPmSamplingObject;
}

5.3.4. Step 3: Create Config Image and Configure PM Sampling#

Create the configuration image that encapsulates metric scheduling, then configure PM sampling parameters - buffer size, sampling interval etc.

void ConfigurePmSampling(uint64_t hardwareBufferSize, uint64_t samplingInterval)
{
    // Need to create the config image which will have the scheduling information for the metrics
    CreateConfigImage();

    // Set configuration
    CUpti_PmSampling_SetConfig_Params setConfigParams = { CUpti_PmSampling_SetConfig_Params_STRUCT_SIZE };
    setConfigParams.pPmSamplingObject = g_pPmSamplingObject;
    setConfigParams.configSize = configImage.size();
    setConfigParams.pConfig = configImage.data();
    setConfigParams.hardwareBufferSize = hardwareBufferSize;
    setConfigParams.samplingInterval = samplingInterval;
    setConfigParams.triggerMode = CUPTI_PM_SAMPLING_TRIGGER_MODE_GPU_SYSCLK_INTERVAL;
    cuptiPmSamplingSetConfig(&setConfigParams);
}

5.3.5. Step 4: Start and Stop PM Sampling before and after Launching Workload#

Begin data collection and stop it after the workload is launched (refer to the complete example below for the usage of these helper functions):

// Helper function to start PM sampling
void StartPmSampling()
{
    CUpti_PmSampling_Start_Params startParams = { CUpti_PmSampling_Start_Params_STRUCT_SIZE };
    startParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingStart(&startParams);
}

// Helper function to stop PM sampling
void StopPmSampling()
{
    CUpti_PmSampling_Stop_Params stopParams = { CUpti_PmSampling_Stop_Params_STRUCT_SIZE };
    stopParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingStop(&stopParams);
}

5.3.6. Step 5: Create Counter Data Image and Decode Sampling Data#

Create counter data image which will store the decoded data from the hardware buffer and then decode the sampling data and print first 10 samples:

// Helper function to create counter data image
void CreateCounterDataImage(uint64_t maxSamplesInCounterDataImage)
{
    CUpti_PmSampling_GetCounterDataSize_Params getCounterDataSizeParams = { CUpti_PmSampling_GetCounterDataSize_Params_STRUCT_SIZE };
    getCounterDataSizeParams.pPmSamplingObject = g_pPmSamplingObject;
    getCounterDataSizeParams.numMetrics = g_metrics.size();
    getCounterDataSizeParams.pMetricNames = g_metrics.data();
    getCounterDataSizeParams.maxSamples = maxSamplesInCounterDataImage;
    cuptiPmSamplingGetCounterDataSize(&getCounterDataSizeParams);

    g_counterDataImage.resize(getCounterDataSizeParams.counterDataSize);
    CUpti_PmSampling_CounterDataImage_Initialize_Params initializeParams = { CUpti_PmSampling_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initializeParams.pPmSamplingObject = g_pPmSamplingObject;
    initializeParams.counterDataSize = g_counterDataImage.size();
    initializeParams.pCounterData = g_counterDataImage.data();
    cuptiPmSamplingCounterDataImageInitialize(&initializeParams);
}

void DecodeAndPrintSamplingData()
{
    // Create counter data image which will store the decoded data from the hardware buffer
    constexpr uint64_t maxSamplesInCounterDataImage = 10000;
    CreateCounterDataImage(maxSamplesInCounterDataImage);

    // Decode sampling data
    CUpti_PmSampling_DecodeData_Params decodeParams = { CUpti_PmSampling_DecodeData_Params_STRUCT_SIZE };
    decodeParams.pPmSamplingObject = g_pPmSamplingObject;
    decodeParams.pCounterDataImage = g_counterDataImage.data();
    decodeParams.counterDataImageSize = g_counterDataImage.size();
    cuptiPmSamplingDecodeData(&decodeParams);

    // Get information about decoded data
    CUpti_PmSampling_GetCounterDataInfo_Params counterDataInfo = { CUpti_PmSampling_GetCounterDataInfo_Params_STRUCT_SIZE };
    counterDataInfo.pCounterDataImage = g_counterDataImage.data();
    counterDataInfo.counterDataImageSize = g_counterDataImage.size();
    cuptiPmSamplingGetCounterDataInfo(&counterDataInfo);
    printf("Number of completed samples: %zu\n", counterDataInfo.numCompletedSamples);

    // Print sample information (first 10 samples)
    size_t maxSamplesToShow = (counterDataInfo.numCompletedSamples > 10) ? 10 : counterDataInfo.numCompletedSamples;
    EvaluateAndPrintAllSamples(maxSamplesToShow);
}

5.3.7. Step 6: Cleanup PM Sampling#

Disable PM Sampling and release all allocated resources:

void CleanupPmSampling()
{
    // Disable PM sampling
    CUpti_PmSampling_Disable_Params disableParams = { CUpti_PmSampling_Disable_Params_STRUCT_SIZE };
    disableParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingDisable(&disableParams);

    // Deinitialize profiler
    CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = { CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE };
    cuptiProfilerDeInitialize(&profilerDeInitializeParams);
}

5.3.8. Complete Example#

Your final code should look like this:

#include <cuda_runtime.h>
#include <cuda.h>
#include <stdio.h>
#include <vector>
#include <cupti_target.h>
#include <cupti_pmsampling.h>
#include <cupti_profiler_target.h>
#include <cupti_profiler_host.h>

// Global variables for PM Sampling
CUpti_PmSampling_Object* g_pPmSamplingObject = NULL;

std::string g_chipName;
std::vector<uint8_t> g_configImage;
std::vector<uint8_t> g_counterDataImage;

std::vector<const char*> g_metrics =
{
    "gr__cycles_active.avg",     // GPU Active Cycles
    "gr__cycles_elapsed.max",     // GPU Elapsed Cycles
    "sm__cycles_active.avg"     // SM Active Cycles
};

// CUDA kernel for vector addition
__global__ void VectorAdd(const int *A, const int *B, int *C, int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

// Helper function to initialize PM Sampling
void InitializeAndEnablePmSampling(int deviceIndex);
void ConfigurePmSampling(uint64_t hardwareBufferSize, uint64_t samplingInterval);
void StartPmSampling();
void StopPmSampling();
void DecodeAndPrintSamplingData();
void CleanupPmSampling();

int main()
{
    const int vectorLen = 4096 * 4096 * 2;
    size_t size = vectorLen * sizeof(int);

    // Initialize CUDA
    cuInit(0);

    // Setup CUDA workload
    int *h_A = (int*)malloc(size);
    int *h_B = (int*)malloc(size);
    int *h_C = (int*)malloc(size);

    for (int i = 0; i < vectorLen; ++i) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }

    int *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Initalize and Enable PM Sampling
    constexpr int deviceIndex = 0;
    InitializeAndEnablePmSampling(deviceIndex);

    // Configure PM Sampling
    constexpr size_t hardwareBufferSize = 512 * 1024 * 1024; // 512MB buffer
    constexpr uint64_t samplingInterval = 100000; // 100us interval
    ConfigurePmSampling(hardwareBufferSize, samplingInterval);

    // Start PM Sampling
    StartPmSampling();

    // Launch CUDA workload
    int threadsPerBlock = 512;
    int blocksPerGrid = (vectorLen + threadsPerBlock - 1) / threadsPerBlock;
    for (int i = 0; i < 100; ++i) {
        VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, vectorLen);
    }
    cudaDeviceSynchronize();

    // Stop PM Sampling
    StopPmSampling();

    // Decode and print sampling data
    DecodeAndPrintSamplingData();

    // Cleanup PM Sampling
    CleanupPmSampling();

    // Cleanup
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

void CreateConfigImage()
{
    CUpti_Profiler_Host_Initialize_Params hostInitializeParams = {CUpti_Profiler_Host_Initialize_Params_STRUCT_SIZE};
    hostInitializeParams.profilerType = CUPTI_PROFILER_TYPE_PM_SAMPLING;
    hostInitializeParams.pChipName = g_chipName.c_str();
    hostInitializeParams.pCounterAvailabilityImage = nullptr;
    cuptiProfilerHostInitialize(&hostInitializeParams);
    CUpti_Profiler_Host_Object* pHostObject = hostInitializeParams.pHostObject;

    CUpti_Profiler_Host_ConfigAddMetrics_Params configAddMetricsParams {CUpti_Profiler_Host_ConfigAddMetrics_Params_STRUCT_SIZE};
    configAddMetricsParams.pHostObject = pHostObject;
    configAddMetricsParams.ppMetricNames = g_metrics.data();
    configAddMetricsParams.numMetrics = g_metrics.size();
    cuptiProfilerHostConfigAddMetrics(&configAddMetricsParams);

    CUpti_Profiler_Host_GetConfigImageSize_Params getConfigImageSizeParams {CUpti_Profiler_Host_GetConfigImageSize_Params_STRUCT_SIZE};
    getConfigImageSizeParams.pHostObject = pHostObject;
    cuptiProfilerHostGetConfigImageSize(&getConfigImageSizeParams);
    g_configImage.resize(getConfigImageSizeParams.configImageSize);

    CUpti_Profiler_Host_GetConfigImage_Params getConfigImageParams = {CUpti_Profiler_Host_GetConfigImage_Params_STRUCT_SIZE};
    getConfigImageParams.pHostObject = pHostObject;
    getConfigImageParams.pConfigImage = g_configImage.data();
    getConfigImageParams.configImageSize = g_configImage.size();
    cuptiProfilerHostGetConfigImage(&getConfigImageParams);

    CUpti_Profiler_Host_GetNumOfPasses_Params getNumOfPassesParam {CUpti_Profiler_Host_GetNumOfPasses_Params_STRUCT_SIZE};
    getNumOfPassesParam.pConfigImage = g_configImage.data();
    getNumOfPassesParam.configImageSize = g_configImage.size();
    cuptiProfilerHostGetNumOfPasses(&getNumOfPassesParam);
    printf("Num of Passes: %d\n", getNumOfPassesParam.numOfPasses);

    CUpti_Profiler_Host_Deinitialize_Params deinitializeParams = {CUpti_Profiler_Host_Deinitialize_Params_STRUCT_SIZE};
    deinitializeParams.pHostObject = pHostObject;
    cuptiProfilerHostDeinitialize(&deinitializeParams);
    pHostObject = nullptr;
}

void EvaluateAndPrintForSample(size_t sampleIndex, CUpti_Profiler_Host_Object* pHostObject, std::vector<uint8_t>& counterDataImage)
{
    CUpti_PmSampling_CounterData_GetSampleInfo_Params getSampleInfoParams = {CUpti_PmSampling_CounterData_GetSampleInfo_Params_STRUCT_SIZE};
    getSampleInfoParams.pPmSamplingObject = g_pPmSamplingObject;
    getSampleInfoParams.pCounterDataImage = counterDataImage.data();
    getSampleInfoParams.counterDataImageSize = counterDataImage.size();
    getSampleInfoParams.sampleIndex = sampleIndex;
    cuptiPmSamplingCounterDataGetSampleInfo(&getSampleInfoParams);
    printf("Sample Index: %zu, Start Timestamp: %llu, End Timestamp: %llu\n", sampleIndex, getSampleInfoParams.startTimestamp, getSampleInfoParams.endTimestamp);

    std::vector<double> metricValues(g_metrics.size());
    CUpti_Profiler_Host_EvaluateToGpuValues_Params evalauateToGpuValuesParams {CUpti_Profiler_Host_EvaluateToGpuValues_Params_STRUCT_SIZE};
    evalauateToGpuValuesParams.pHostObject = pHostObject;
    evalauateToGpuValuesParams.pCounterDataImage = counterDataImage.data();
    evalauateToGpuValuesParams.counterDataImageSize = counterDataImage.size();
    evalauateToGpuValuesParams.ppMetricNames = g_metrics.data();
    evalauateToGpuValuesParams.numMetrics = g_metrics.size();
    evalauateToGpuValuesParams.rangeIndex = sampleIndex;
    evalauateToGpuValuesParams.pMetricValues = metricValues.data();
    cuptiProfilerHostEvaluateToGpuValues(&evalauateToGpuValuesParams);

    for (size_t i = 0; i < g_metrics.size(); ++i) {
        printf("\t%s: %f\n", g_metrics[i], metricValues[i]);
    }
    printf("\n");
}

void EvaluateAndPrintAllSamples(size_t numOfSamples)
{
    CUpti_Profiler_Host_Initialize_Params hostInitializeParams = {CUpti_Profiler_Host_Initialize_Params_STRUCT_SIZE};
    hostInitializeParams.profilerType = CUPTI_PROFILER_TYPE_PM_SAMPLING;
    hostInitializeParams.pChipName = g_chipName.c_str();
    hostInitializeParams.pCounterAvailabilityImage = nullptr;
    cuptiProfilerHostInitialize(&hostInitializeParams);
    CUpti_Profiler_Host_Object* pHostObject = hostInitializeParams.pHostObject;

    for (size_t sampleIndex = 0; sampleIndex < numOfSamples; ++sampleIndex) {
        EvaluateAndPrintForSample(sampleIndex, pHostObject, g_counterDataImage);
    }

    CUpti_Profiler_Host_Deinitialize_Params deinitializeParams = {CUpti_Profiler_Host_Deinitialize_Params_STRUCT_SIZE};
    deinitializeParams.pHostObject = pHostObject;
    cuptiProfilerHostDeinitialize(&deinitializeParams);
    pHostObject = nullptr;
}

// Helper function to initialize PM Sampling
void InitializeAndEnablePmSampling(int deviceIndex)
{
    // Initialize CUPTI Profiler
    CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
    cuptiProfilerInitialize(&profilerInitializeParams);

    CUpti_Device_GetChipName_Params getChipNameParams = { CUpti_Device_GetChipName_Params_STRUCT_SIZE };
    getChipNameParams.deviceIndex = deviceIndex;
    cuptiDeviceGetChipName(&getChipNameParams);
    g_chipName = getChipNameParams.pChipName;
    printf("Chip Name: %s\n", g_chipName.c_str());

    // Enable PM sampling
    CUpti_PmSampling_Enable_Params enableParams = { CUpti_PmSampling_Enable_Params_STRUCT_SIZE };
    enableParams.deviceIndex = deviceIndex;
    cuptiPmSamplingEnable(&enableParams);
    g_pPmSamplingObject = enableParams.pPmSamplingObject;
}

void ConfigurePmSampling(uint64_t hardwareBufferSize, uint64_t samplingInterval)
{
    // Need to create the config image which will have the scheduling information for the metrics
    CreateConfigImage();

    // Set configuration
    CUpti_PmSampling_SetConfig_Params setConfigParams = { CUpti_PmSampling_SetConfig_Params_STRUCT_SIZE };
    setConfigParams.pPmSamplingObject = g_pPmSamplingObject;
    setConfigParams.configSize = g_configImage.size();
    setConfigParams.pConfig = g_configImage.data();
    setConfigParams.hardwareBufferSize = hardwareBufferSize;
    setConfigParams.samplingInterval = samplingInterval;
    setConfigParams.triggerMode = CUPTI_PM_SAMPLING_TRIGGER_MODE_GPU_SYSCLK_INTERVAL;
    cuptiPmSamplingSetConfig(&setConfigParams);
}

// Helper function to start PM sampling
void StartPmSampling()
{
    CUpti_PmSampling_Start_Params startParams = { CUpti_PmSampling_Start_Params_STRUCT_SIZE };
    startParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingStart(&startParams);
}

// Helper function to stop PM sampling
void StopPmSampling()
{
    CUpti_PmSampling_Stop_Params stopParams = { CUpti_PmSampling_Stop_Params_STRUCT_SIZE };
    stopParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingStop(&stopParams);
}

// Helper function to create counter data image
void CreateCounterDataImage(uint64_t maxSamplesInCounterDataImage)
{
    CUpti_PmSampling_GetCounterDataSize_Params getCounterDataSizeParams = { CUpti_PmSampling_GetCounterDataSize_Params_STRUCT_SIZE };
    getCounterDataSizeParams.pPmSamplingObject = g_pPmSamplingObject;
    getCounterDataSizeParams.numMetrics = g_metrics.size();
    getCounterDataSizeParams.pMetricNames = g_metrics.data();
    getCounterDataSizeParams.maxSamples = maxSamplesInCounterDataImage;
    cuptiPmSamplingGetCounterDataSize(&getCounterDataSizeParams);

    g_counterDataImage.resize(getCounterDataSizeParams.counterDataSize);
    CUpti_PmSampling_CounterDataImage_Initialize_Params initializeParams = { CUpti_PmSampling_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initializeParams.pPmSamplingObject = g_pPmSamplingObject;
    initializeParams.counterDataSize = g_counterDataImage.size();
    initializeParams.pCounterData = g_counterDataImage.data();
    cuptiPmSamplingCounterDataImageInitialize(&initializeParams);
}

// Helper function to decode and print sampling data
void DecodeAndPrintSamplingData()
{
    // Create counter data image which will store the decoded data from the hardware buffer
    constexpr uint64_t maxSamplesInCounterDataImage = 10000;
    CreateCounterDataImage(maxSamplesInCounterDataImage);

    // Decode sampling data
    CUpti_PmSampling_DecodeData_Params decodeParams = { CUpti_PmSampling_DecodeData_Params_STRUCT_SIZE };
    decodeParams.pPmSamplingObject = g_pPmSamplingObject;
    decodeParams.pCounterDataImage = g_counterDataImage.data();
    decodeParams.counterDataImageSize = g_counterDataImage.size();
    cuptiPmSamplingDecodeData(&decodeParams);

    // Get information about decoded data
    CUpti_PmSampling_GetCounterDataInfo_Params counterDataInfo = { CUpti_PmSampling_GetCounterDataInfo_Params_STRUCT_SIZE };
    counterDataInfo.pCounterDataImage = g_counterDataImage.data();
    counterDataInfo.counterDataImageSize = g_counterDataImage.size();
    cuptiPmSamplingGetCounterDataInfo(&counterDataInfo);
    printf("Number of completed samples: %zu\n", counterDataInfo.numCompletedSamples);

    // Print sample information (first 10 samples)
    size_t maxSamplesToShow = (counterDataInfo.numCompletedSamples > 10) ? 10 : counterDataInfo.numCompletedSamples;
    EvaluateAndPrintAllSamples(maxSamplesToShow);
}

void CleanupPmSampling()
{
    // Disable PM sampling
    CUpti_PmSampling_Disable_Params disableParams = { CUpti_PmSampling_Disable_Params_STRUCT_SIZE };
    disableParams.pPmSamplingObject = g_pPmSamplingObject;
    cuptiPmSamplingDisable(&disableParams);

    // Deinitialize profiler
    CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = { CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE };
    cuptiProfilerDeInitialize(&profilerDeInitializeParams);
}

5.3.9. Expected Output#

When the above code is run, output similar to the following should be seen:

Chip Name: AD104
Num of Passes: 1
Number of completed samples: 1770
Sample Index: 0, Start Timestamp: 1756793790519908305, End Timestamp: 1756793790687993793
        gr__cycles_active.avg: 0.000000
        gr__cycles_elapsed.max: 1160086.000000
        sm__cycles_active.avg: 0.000000

Sample Index: 1, Start Timestamp: 1756793790687993793, End Timestamp: 1756793790688045473
        gr__cycles_active.avg: 0.000000
        gr__cycles_elapsed.max: 100001.000000
        sm__cycles_active.avg: 0.000000

Sample Index: 2, Start Timestamp: 1756793790688045473, End Timestamp: 1756793790688097153
        gr__cycles_active.avg: 49047.000000
        gr__cycles_elapsed.max: 100001.000000
        sm__cycles_active.avg: 0.000000

Sample Index: 3, Start Timestamp: 1756793790688097153, End Timestamp: 1756793790688149058
        gr__cycles_active.avg: 79180.000000
        gr__cycles_elapsed.max: 100001.000000
        sm__cycles_active.avg: 73766.214286
...

This indicates that CUPTI PM Sampling has successfully collected performance metrics during the execution of your CUDA kernels.

5.3.10. Advanced Usage: Continuous Decode Thread#

For long-running workloads, use a dedicated thread for continuous decoding to avoid hardware buffer overflow:

#include <pthread.h>
#include <atomic>

struct DecodeThreadArgs
{
    CUpti_PmSampling_Object* pPmSamplingObject;
    uint8_t* counterDataImage;
    size_t counterDataImageSize;
    std::atomic<bool>* stopFlag;
};

void* DecodeThread(void* args)
{
    DecodeThreadArgs* threadArgs = (DecodeThreadArgs*)args;
    while (!threadArgs->stopFlag->load())
    {
        CUpti_PmSampling_DecodeData_Params decodeParams = { CUpti_PmSampling_DecodeData_Params_STRUCT_SIZE };
        decodeParams.pPmSamplingObject = threadArgs->pPmSamplingObject;
        decodeParams.pCounterDataImage = threadArgs->counterDataImage;
        decodeParams.counterDataImageSize = threadArgs->counterDataImageSize;
        cuptiPmSamplingDecodeData(&decodeParams);

        // Process decoded data here

        // Reset counter data image for next batch
        CUpti_PmSampling_CounterDataImage_Initialize_Params resetParams = { CUpti_PmSampling_CounterDataImage_Initialize_Params_STRUCT_SIZE };
        resetParams.pPmSamplingObject = threadArgs->pPmSamplingObject;
        resetParams.counterDataSize = threadArgs->counterDataImageSize;
        resetParams.pCounterData = threadArgs->counterDataImage;
        cuptiPmSamplingCounterDataImageInitialize(&resetParams);

        usleep(10000); // 10ms sleep
    }
    return NULL;
}

Note: For detailed information on creating configuration images with specific metrics and evaluating counter data to obtain metric values, refer to the Host API Tutorial section.

5.4. GPU Performance Profiling using Range Profiler API#

This tutorial provides a guide to collecting performance metrics from a CUDA kernel using the CUPTI Range Profiler API. Starting with a basic vector addition kernel, it incrementally introduces CUPTI Range Profiler API calls to collect hardware performance counters for specific ranges of CUDA kernel execution.

5.4.1. Simple Vector Addition in CUDA C#

First, let’s write a vector addition kernel in CUDA C, as demonstrated in the section on simple vector addition in CUDA C.

5.4.2. Step 1: Include Headers and Define Global Variables#

First, include the CUPTI Range Profiler headers and declare these globals:

Range profiler object: holds per-context range profiling state.

Counter data image: stores decoded hardware profiling results.

Config image: built via host APIs; contains metric scheduling for collection.

CUDA context: the context where range profiling is enabled and data is collected.

Metric list: the set of metrics to capture during range profiling.

#include <cupti_profiler_host.h>
#include <cupti_range_profiler.h>
#include <cupti_target.h>

// Global variables for Range Profiler
CUpti_RangeProfiler_Object* g_pRangeProfilerObject = NULL;

std::vector<uint8_t> g_counterDataImage;
std::vector<uint8_t> g_configImage;

CUcontext g_cuContext;
std::string g_chipName;

std::vector<const char*> g_metrics =
{
    "sm__warps_launched.sum",     // Number of warps launched
    "sm__ctas_launched.sum"       // Number of CTAs launched
};

5.4.3. Step 2: Initialize CUPTI Profiler and Enable Range Profiler#

Initialize the CUPTI profiler, enable Range Profiler on the device, and retrieve the chip name required for the configuration image:

// Helper function to initialize CUDA context and Range Profiler
void InitializeAndEnableRangeProfiler(CUcontext cuContext)
{
    // Initialize CUPTI Profiler
    CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
    cuptiProfilerInitialize(&profilerInitializeParams);

    CUdevice device;
    cuCtxGetDevice(&device);
    CUpti_Device_GetChipName_Params getChipNameParams = { CUpti_Device_GetChipName_Params_STRUCT_SIZE };
    getChipNameParams.deviceIndex = (size_t)device;
    cuptiDeviceGetChipName(&getChipNameParams);
    g_chipName = std::string(getChipNameParams.pChipName);
    printf("Chip Name: %s\n", g_chipName.c_str());

    // Enable Range profiler
    CUpti_RangeProfiler_Enable_Params enableRange = { CUpti_RangeProfiler_Enable_Params_STRUCT_SIZE };
    enableRange.ctx = cuContext;
    cuptiRangeProfilerEnable(&enableRange);
    g_pRangeProfilerObject = enableRange.pRangeProfilerObject;
}

5.4.4. Step 3: Create Counter Data Image and Configure Range Profiler#

Create the configuration image with metric scheduling, then create the counter data image to hold decoded hardware results, and configure the profiler:

// Helper function to create counter data image
void CreateCounterDataImage(size_t maxNumOfRangesInCounterDataImage)
{
    // Get counter data size
    CUpti_RangeProfiler_GetCounterDataSize_Params ctDataSize = { CUpti_RangeProfiler_GetCounterDataSize_Params_STRUCT_SIZE };
    ctDataSize.pRangeProfilerObject = g_pRangeProfilerObject;
    ctDataSize.pMetricNames = g_metrics.data();
    ctDataSize.numMetrics = g_metrics.size();
    ctDataSize.maxNumOfRanges = maxNumOfRangesInCounterDataImage;
    ctDataSize.maxNumRangeTreeNodes = maxNumOfRangesInCounterDataImage;
    cuptiRangeProfilerGetCounterDataSize(&ctDataSize);

    // Initialize counter data image
    g_counterDataImage.resize(ctDataSize.counterDataSize);
    CUpti_RangeProfiler_CounterDataImage_Initialize_Params initCtImg = { CUpti_RangeProfiler_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initCtImg.pRangeProfilerObject = g_pRangeProfilerObject;
    initCtImg.pCounterData = g_counterDataImage.data();
    initCtImg.counterDataSize = g_counterDataImage.size();
    cuptiRangeProfilerCounterDataImageInitialize(&initCtImg);
}

void ConfigureRangeProfiler(CUpti_ProfilerRange range, CUpti_ProfilerReplayMode replayMode, size_t numOfRanges)
{
    // Create config image
    CreateConfigImage();

    // Create counter data image
    CreateCounterDataImage(numOfRanges);

    CUpti_RangeProfiler_SetConfig_Params setConfig = { CUpti_RangeProfiler_SetConfig_Params_STRUCT_SIZE };
    setConfig.pRangeProfilerObject = g_pRangeProfilerObject;
    setConfig.configSize = g_configImage.size();
    setConfig.pConfig = g_configImage.data();
    setConfig.counterDataImageSize = g_counterDataImage.size();
    setConfig.pCounterDataImage = g_counterDataImage.data();
    setConfig.range = range;
    setConfig.replayMode = replayMode;
    setConfig.maxRangesPerPass = numOfRanges;
    setConfig.numNestingLevels = 1;
    setConfig.minNestingLevel = 1;
    setConfig.passIndex = 0;
    setConfig.targetNestingLevel = 0;
    cuptiRangeProfilerSetConfig(&setConfig);
}

5.4.5. Step 4: Start and Stop Range Profiling Around Workload#

Begin data collection and stop it after the workload is launched:

// Helper function to start range profiling
void StartRangeProfiler()
{
    CUpti_RangeProfiler_Start_Params startRangeProfiler = { CUpti_RangeProfiler_Start_Params_STRUCT_SIZE };
    startRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerStart(&startRangeProfiler);
}

// Helper function to stop range profiling
void StopRangeProfiler()
{
    CUpti_RangeProfiler_Stop_Params stopRangeProfiler = { CUpti_RangeProfiler_Stop_Params_STRUCT_SIZE };
    stopRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerStop(&stopRangeProfiler);
}

5.4.6. Step 5: Decode and Evaluate Profiling Data#

Decode the collected profiling data and evaluate metrics:

// Helper function to decode and print profiling data
void DecodeAndPrintProfilingData()
{
    // Decode profiling data
    CUpti_RangeProfiler_DecodeData_Params decodeData = { CUpti_RangeProfiler_DecodeData_Params_STRUCT_SIZE };
    decodeData.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerDecodeData(&decodeData);

    // Get information about profiled ranges
    CUpti_RangeProfiler_GetCounterDataInfo_Params cdiParams = { CUpti_RangeProfiler_GetCounterDataInfo_Params_STRUCT_SIZE };
    cdiParams.pCounterDataImage = g_counterDataImage.data();
    cdiParams.counterDataImageSize = g_counterDataImage.size();
    cuptiRangeProfilerGetCounterDataInfo(&cdiParams);
    printf("Number of profiled ranges: %zu\n", cdiParams.numTotalRanges);

    // Evaluate and print profiling data
    const size_t numRangesToPrint = cdiParams.numTotalRanges > 10 ? 10 : cdiParams.numTotalRanges;
    EvaluateAndPrintAllRanges(numRangesToPrint);
}

5.4.7. Step 7: Cleanup Range Profiler#

Disable Range Profiler and release all allocated resources:

void CleanupRangeProfiler()
{
    // Disable Range profiler
    CUpti_RangeProfiler_Disable_Params disableRangeProfiler = { CUpti_RangeProfiler_Disable_Params_STRUCT_SIZE };
    disableRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerDisable(&disableRangeProfiler);

    // Deinitialize profiler
    CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = { CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE };
    cuptiProfilerDeInitialize(&profilerDeInitializeParams);
}

5.4.8. Complete Example#

Your final code should look like this:

#include <cuda_runtime.h>
#include <cuda.h>
#include <stdio.h>
#include <stdint.h>
#include <vector>
#include <cupti_profiler_host.h>
#include <cupti_range_profiler.h>
#include <cupti_target.h>

// Global variables for Range Profiler
CUpti_RangeProfiler_Object* g_pRangeProfilerObject = NULL;

std::vector<uint8_t> g_counterDataImage;
std::vector<uint8_t> g_configImage;

CUcontext g_cuContext;
std::string g_chipName;

std::vector<const char*> g_metrics =
{
    "sm__warps_launched.sum",     // Number of warps launched
    "sm__ctas_launched.sum"
};

// CUDA kernel for vector addition
__global__ void VectorAdd(const float *A, const float *B, float *C, int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

// Helper function declarations
void InitializeAndEnableRangeProfiler(CUcontext cuContext);
void ConfigureRangeProfiler(CUpti_ProfilerRange range, CUpti_ProfilerReplayMode replayMode, size_t numOfRanges);
void StartRangeProfiler();
void StopRangeProfiler();
void DecodeAndPrintProfilingData();
void CleanupRangeProfiler();

int main()
{
    const int vectorLen = 1024 * 1024;
    const size_t size = vectorLen * sizeof(float);

    // Initialize CUDA and create context
    cuInit(0);
    cuCtxCreate(&g_cuContext, (CUctxCreateParams*)0, 0, 0);

    // Initialize and Enable Range Profiler
    InitializeAndEnableRangeProfiler(g_cuContext);

    // Configure Range Profiler
    constexpr size_t numOfRanges = 10;
    ConfigureRangeProfiler(CUPTI_AutoRange, CUPTI_KernelReplay, numOfRanges);

    // Setup CUDA workload
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < vectorLen; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 128;
    int blocksPerGrid = (vectorLen + threadsPerBlock - 1) / threadsPerBlock;

    // Start Range Profiling
    StartRangeProfiler();

    // Launch CUDA workload
    VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, vectorLen);

    // Stop Range Profiling
    StopRangeProfiler();

    // Decode and evaluate profiling data
    DecodeAndPrintProfilingData();

    // Cleanup Range Profiler
    CleanupRangeProfiler();

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

void CreateConfigImage()
{
    CUpti_Profiler_Host_Initialize_Params hostInitializeParams = {CUpti_Profiler_Host_Initialize_Params_STRUCT_SIZE};
    hostInitializeParams.profilerType = CUPTI_PROFILER_TYPE_RANGE_PROFILER;
    hostInitializeParams.pChipName = g_chipName.c_str();
    hostInitializeParams.pCounterAvailabilityImage = nullptr;
    cuptiProfilerHostInitialize(&hostInitializeParams);
    CUpti_Profiler_Host_Object* pHostObject = hostInitializeParams.pHostObject;

    CUpti_Profiler_Host_ConfigAddMetrics_Params configAddMetricsParams {CUpti_Profiler_Host_ConfigAddMetrics_Params_STRUCT_SIZE};
    configAddMetricsParams.pHostObject = pHostObject;
    configAddMetricsParams.ppMetricNames = g_metrics.data();
    configAddMetricsParams.numMetrics = g_metrics.size();
    cuptiProfilerHostConfigAddMetrics(&configAddMetricsParams);

    CUpti_Profiler_Host_GetConfigImageSize_Params getConfigImageSizeParams {CUpti_Profiler_Host_GetConfigImageSize_Params_STRUCT_SIZE};
    getConfigImageSizeParams.pHostObject = pHostObject;
    cuptiProfilerHostGetConfigImageSize(&getConfigImageSizeParams);
    g_configImage.resize(getConfigImageSizeParams.configImageSize);

    CUpti_Profiler_Host_GetConfigImage_Params getConfigImageParams = {CUpti_Profiler_Host_GetConfigImage_Params_STRUCT_SIZE};
    getConfigImageParams.pHostObject = pHostObject;
    getConfigImageParams.pConfigImage = g_configImage.data();
    getConfigImageParams.configImageSize = g_configImage.size();
    cuptiProfilerHostGetConfigImage(&getConfigImageParams);

    CUpti_Profiler_Host_GetNumOfPasses_Params getNumOfPassesParam {CUpti_Profiler_Host_GetNumOfPasses_Params_STRUCT_SIZE};
    getNumOfPassesParam.pConfigImage = g_configImage.data();
    getNumOfPassesParam.configImageSize = g_configImage.size();
    cuptiProfilerHostGetNumOfPasses(&getNumOfPassesParam);
    printf("Num of Passes: %d\n", getNumOfPassesParam.numOfPasses);

    CUpti_Profiler_Host_Deinitialize_Params deinitializeParams = {CUpti_Profiler_Host_Deinitialize_Params_STRUCT_SIZE};
    deinitializeParams.pHostObject = pHostObject;
    cuptiProfilerHostDeinitialize(&deinitializeParams);
    pHostObject = nullptr;
}

void EvaluateAndPrintForRange(size_t rangeIndex, CUpti_Profiler_Host_Object* pHostObject)
{
    std::vector<double> metricValues(g_metrics.size());
    CUpti_Profiler_Host_EvaluateToGpuValues_Params evalauateToGpuValuesParams {CUpti_Profiler_Host_EvaluateToGpuValues_Params_STRUCT_SIZE};
    evalauateToGpuValuesParams.pHostObject = pHostObject;
    evalauateToGpuValuesParams.pCounterDataImage = g_counterDataImage.data();
    evalauateToGpuValuesParams.counterDataImageSize = g_counterDataImage.size();
    evalauateToGpuValuesParams.ppMetricNames = g_metrics.data();
    evalauateToGpuValuesParams.numMetrics = g_metrics.size();
    evalauateToGpuValuesParams.rangeIndex = rangeIndex;
    evalauateToGpuValuesParams.pMetricValues = metricValues.data();
    cuptiProfilerHostEvaluateToGpuValues(&evalauateToGpuValuesParams);

    for (size_t i = 0; i < g_metrics.size(); ++i) {
        printf("\t%s: %f\n", g_metrics[i], metricValues[i]);
    }
    printf("\n");
}

void EvaluateAndPrintAllRanges(size_t numOfRanges)
{
    CUpti_Profiler_Host_Initialize_Params hostInitializeParams = {CUpti_Profiler_Host_Initialize_Params_STRUCT_SIZE};
    hostInitializeParams.profilerType = CUPTI_PROFILER_TYPE_RANGE_PROFILER;
    hostInitializeParams.pChipName = g_chipName.c_str();
    hostInitializeParams.pCounterAvailabilityImage = nullptr;
    cuptiProfilerHostInitialize(&hostInitializeParams);
    CUpti_Profiler_Host_Object* pHostObject = hostInitializeParams.pHostObject;

    for (size_t i = 0; i < numOfRanges; ++i)
    {
        CUpti_RangeProfiler_CounterData_GetRangeInfo_Params getRangeInfoParams = {CUpti_RangeProfiler_CounterData_GetRangeInfo_Params_STRUCT_SIZE};
        getRangeInfoParams.counterDataImageSize = g_counterDataImage.size();
        getRangeInfoParams.pCounterDataImage = g_counterDataImage.data();
        getRangeInfoParams.rangeIndex = i;
        getRangeInfoParams.rangeDelimiter = "/";
        cuptiRangeProfilerCounterDataGetRangeInfo(&getRangeInfoParams);

        printf("Range: %s\n", getRangeInfoParams.rangeName);
        printf("Metric Values:\n");
        EvaluateAndPrintForRange(i, pHostObject);
    }

    CUpti_Profiler_Host_Deinitialize_Params deinitializeParams = {CUpti_Profiler_Host_Deinitialize_Params_STRUCT_SIZE};
    deinitializeParams.pHostObject = pHostObject;
    cuptiProfilerHostDeinitialize(&deinitializeParams);
    pHostObject = nullptr;
}

void InitializeAndEnableRangeProfiler(CUcontext cuContext)
{
    // Initialize CUPTI Profiler
    CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
    cuptiProfilerInitialize(&profilerInitializeParams);

    CUdevice device;
    cuCtxGetDevice(&device);
    CUpti_Device_GetChipName_Params getChipNameParams = { CUpti_Device_GetChipName_Params_STRUCT_SIZE };
    getChipNameParams.deviceIndex = (size_t)device;
    cuptiDeviceGetChipName(&getChipNameParams);
    g_chipName = std::string(getChipNameParams.pChipName);
    printf("Chip Name: %s\n", g_chipName.c_str());

    // Enable Range profiler
    CUpti_RangeProfiler_Enable_Params enableRange = { CUpti_RangeProfiler_Enable_Params_STRUCT_SIZE };
    enableRange.ctx = cuContext;
    cuptiRangeProfilerEnable(&enableRange);
    g_pRangeProfilerObject = enableRange.pRangeProfilerObject;
}

void CreateCounterDataImage(size_t maxNumOfRangesInCounterDataImage)
{
    // Get counter data size
    CUpti_RangeProfiler_GetCounterDataSize_Params ctDataSize = { CUpti_RangeProfiler_GetCounterDataSize_Params_STRUCT_SIZE };
    ctDataSize.pRangeProfilerObject = g_pRangeProfilerObject;
    ctDataSize.pMetricNames = g_metrics.data();
    ctDataSize.numMetrics = g_metrics.size();
    ctDataSize.maxNumOfRanges = maxNumOfRangesInCounterDataImage;
    ctDataSize.maxNumRangeTreeNodes = maxNumOfRangesInCounterDataImage;
    cuptiRangeProfilerGetCounterDataSize(&ctDataSize);

    // Initialize counter data image
    g_counterDataImage.resize(ctDataSize.counterDataSize);
    CUpti_RangeProfiler_CounterDataImage_Initialize_Params initCtImg = { CUpti_RangeProfiler_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initCtImg.pRangeProfilerObject = g_pRangeProfilerObject;
    initCtImg.pCounterData = g_counterDataImage.data();
    initCtImg.counterDataSize = g_counterDataImage.size();
    cuptiRangeProfilerCounterDataImageInitialize(&initCtImg);
}

void ConfigureRangeProfiler(CUpti_ProfilerRange range, CUpti_ProfilerReplayMode replayMode, size_t numOfRanges)
{
    // Create config image
    CreateConfigImage();

    // Create counter data image
    CreateCounterDataImage(numOfRanges);

    CUpti_RangeProfiler_SetConfig_Params setConfig = { CUpti_RangeProfiler_SetConfig_Params_STRUCT_SIZE };
    setConfig.pRangeProfilerObject = g_pRangeProfilerObject;
    setConfig.configSize = g_configImage.size();
    setConfig.pConfig = g_configImage.data();
    setConfig.counterDataImageSize = g_counterDataImage.size();
    setConfig.pCounterDataImage = g_counterDataImage.data();
    setConfig.range = range;
    setConfig.replayMode = replayMode;
    setConfig.maxRangesPerPass = numOfRanges;
    setConfig.numNestingLevels = 1;
    setConfig.minNestingLevel = 1;
    setConfig.passIndex = 0;
    setConfig.targetNestingLevel = 0;
    cuptiRangeProfilerSetConfig(&setConfig);
}

void StartRangeProfiler()
{
    CUpti_RangeProfiler_Start_Params startRangeProfiler = { CUpti_RangeProfiler_Start_Params_STRUCT_SIZE };
    startRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerStart(&startRangeProfiler);
}

void StopRangeProfiler()
{
    CUpti_RangeProfiler_Stop_Params stopRangeProfiler = { CUpti_RangeProfiler_Stop_Params_STRUCT_SIZE };
    stopRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerStop(&stopRangeProfiler);
}

void DecodeAndPrintProfilingData()
{
    // Decode profiling data
    CUpti_RangeProfiler_DecodeData_Params decodeData = { CUpti_RangeProfiler_DecodeData_Params_STRUCT_SIZE };
    decodeData.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerDecodeData(&decodeData);

    // Get information about profiled ranges
    CUpti_RangeProfiler_GetCounterDataInfo_Params cdiParams = { CUpti_RangeProfiler_GetCounterDataInfo_Params_STRUCT_SIZE };
    cdiParams.pCounterDataImage = g_counterDataImage.data();
    cdiParams.counterDataImageSize = g_counterDataImage.size();
    cuptiRangeProfilerGetCounterDataInfo(&cdiParams);
    printf("Number of profiled ranges: %zu\n", cdiParams.numTotalRanges);

    // Evaluate and print profiling data
    const size_t numRangesToPrint = cdiParams.numTotalRanges > 10 ? 10 : cdiParams.numTotalRanges;
    EvaluateAndPrintAllRanges(numRangesToPrint);
}

void CleanupRangeProfiler()
{
    // Disable Range profiler
    CUpti_RangeProfiler_Disable_Params disableRangeProfiler = { CUpti_RangeProfiler_Disable_Params_STRUCT_SIZE };
    disableRangeProfiler.pRangeProfilerObject = g_pRangeProfilerObject;
    cuptiRangeProfilerDisable(&disableRangeProfiler);

    // Deinitialize profiler
    CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = { CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE };
    cuptiProfilerDeInitialize(&profilerDeInitializeParams);
}

5.4.9. Expected Output#

When the above code is run, output similar to the following should be seen:

Number of profiled ranges: 1
Range: 0
Metric Values:
    sm__warps_launched.sum: 32768.000000
    sm__ctas_launched.sum: 8192.000000

This indicates that CUPTI Range Profiler has successfully collected performance metrics for the vector addition kernel execution.

Note: For detailed information on available metrics and their meanings, refer to the CUPTI documentation and use the Host API to query available metrics for your GPU architecture.

5.5. CUDA tracing with User-Defined Activity Records#

Note

Beta Release: The CUPTI user-defined activity records feature is currently in beta. APIs and behavior may change in future releases.

This tutorial demonstrates how to use CUPTI’s user-defined activity records feature to collect only the fields you need. Starting with the basic Activity API setup, it incrementally shows how to enable user-defined records and parse custom records.

Minimum CUPTI Version: CUPTI_API_VERSION 130200 (CUDA 13.2). The CUPTI version can be queried using cuptiGetVersion() API.

5.5.1. Why Use User-defined Records?#

Standard CUPTI activity records include all predefined fields, which may waste memory if you only need specific fields. User-defined Records let you:

Select only the fields you need
Reduce memory usage and improve performance
Create custom layouts tailored to your profiling needs

This tutorial will show you how to collect CUDA API and kernel activity records with custom field selection.

For complete API reference, data structures, and best practices, see CUPTI User-Defined Activity Records.

5.5.2. Prerequisites#

This tutorial assumes you’re familiar with basic CUPTI Activity API usage. If not, please refer to CUDA kernel tracing using Activity API first.

5.5.4. Step 2: Register Activity Buffer Callbacks#

Register the v2 buffer callbacks to receive record layout information. The v2 callbacks are mandatory for user-defined records:

// Buffer request callback
void CUPTIAPI BufferRequested(uint8_t **ppBuffer,
                               size_t *pSize,
                               size_t *pMaxNumRecords,
                               CUpti_BufferCallbackRequestInfo *pBufferRequestInfo)
{
    size_t bufferSize = 8 * 1024 * 1024; // 8 MB
    uint8_t *pBuffer = (uint8_t *)malloc(bufferSize);

    *ppBuffer = pBuffer;
    *pSize = bufferSize;
    *pMaxNumRecords = 0;
}

// Buffer completed callback - receives record layouts
void CUPTIAPI BufferCompleted(uint8_t *pBuffer,
                               size_t size,
                               size_t validSize,
                               CUpti_BufferCallbackCompleteInfo *pBufferCompleteInfo)
{
    if (validSize > 0)
    {
        // pBufferCompleteInfo->ppRecordLayouts contains layouts for all enabled kinds
        ParseActivityRecords(pBuffer, validSize, pBufferCompleteInfo);
    }
    free(pBuffer);
}

// Register the v2 callbacks
CUPTI_API_CALL(cuptiActivityRegisterCallbacks_v2(subscriber, BufferRequested, BufferCompleted));

The CUpti_BufferCallbackCompleteInfo structure provides record layouts for all enabled activity kinds, eliminating the need to manually store layouts.

5.5.5. Step 3: Select Fields for CUDA API Tracing#

Select specific fields to collect for CUDA Runtime API tracing. Field IDs are defined in the CUpti_ActivityApiFieldIds enum (see CUPTI User-Defined Activity Records for complete field mapping):

// Select fields for CUDA Runtime API
CUpti_ActivityKind activityKind = CUPTI_ACTIVITY_KIND_RUNTIME;
int selectedFields[] = {
    API_FIELD_KIND,             // Required - must be first
    API_FIELD_CBID,             // Callback ID
    API_FIELD_THREAD_ID,        // Thread ID (64-bit)
    API_FIELD_CORRELATION_ID,   // Correlation ID (64-bit)
    API_FIELD_START,            // Start timestamp
    API_FIELD_END               // End timestamp
};
size_t numFields = sizeof(selectedFields) / sizeof(int);

We’re omitting API_FIELD_PROCESS_ID and API_FIELD_RETURN_VALUE to save memory.

Note

Always include *_FIELD_KIND as the first field in your selection.

5.5.6. Step 4: Enable Activity with Field Selection#

Create the configuration and enable the activity kind with cuptiActivityEnable_v2():

CUpti_ActivityFieldSelection selection;
selection.pFieldIds = selectedFields;
selection.numFields = numFields;

CUpti_ActivityConfig activityConfig;
activityConfig.fieldSelection = selection;

// Enable activity kind with field selection
CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, activityKind, &activityConfig));

The record layout is provided in the BufferCompleted callback through CUpti_BufferCallbackCompleteInfo.ppRecordLayouts.

5.5.7. Step 5: Enable More Activity Kinds#

You can enable multiple activity kinds with different field selections. Let’s add kernel activity. Field IDs are defined in CUpti_ActivityKernelFieldIds enum:

activityKind = CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL;
int kernelFields[] = {
    KERNEL_FIELD_KIND,          // Required - must be first
    KERNEL_FIELD_DEVICE_ID,     // Device ID
    KERNEL_FIELD_CONTEXT_ID,    // Context ID
    KERNEL_FIELD_STREAM_ID,     // Stream ID
    KERNEL_FIELD_CORRELATION_ID,// Correlation ID
    KERNEL_FIELD_NAME,          // Kernel name
    KERNEL_FIELD_START,         // Start timestamp
    KERNEL_FIELD_END            // End timestamp
};

selection.pFieldIds = kernelFields;
selection.numFields = sizeof(kernelFields) / sizeof(int);
activityConfig.fieldSelection = selection;

CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, activityKind, &activityConfig));

Note

Explicit Enabling of Dependent Activity Kinds: When using user-defined records, some activity kinds that are implicitly enabled with predefined structures must be explicitly enabled. For example:

CUPTI_ACTIVITY_KIND_MEMCPY2 must be explicitly enabled when you need it, even if you enable CUPTI_ACTIVITY_KIND_MEMCPY
CUPTI_ACTIVITY_KIND_DEVICE_GRAPH_TRACE must be explicitly enabled when you need it, even if you enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE

Without explicit enabling with field selection, CUPTI will not provide records for these kinds.

5.5.8. Step 6: Parse User-Defined Records#

In the BufferCompleted callback, parse records using the layout information provided in CUpti_BufferCallbackCompleteInfo. There are two approaches:

Approach 1: Field-by-Field Parsing (Recommended)

Use the layout information to extract each field:

void ParseActivityRecords(uint8_t *pBuffer,
                         size_t validSize,
                         CUpti_BufferCallbackCompleteInfo *pBufferCompleteInfo)
{
    CUpti_Activity *pRecord = NULL;
    CUptiResult status;

    do
    {
        status = cuptiActivityGetNextRecord(pBuffer, validSize, &pRecord);
        if (status == CUPTI_SUCCESS)
        {
            // Get the record layout for this activity kind from the complete info structure
            // The ppRecordLayouts array is indexed by activity kind
            CUpti_ActivityRecordLayout *pLayout =
                pBufferCompleteInfo->ppRecordLayouts[pRecord->kind];

            if (pLayout == NULL)
            {
                printf("Warning: No layout for activity kind %d\n", pRecord->kind);
                continue;
            }

            uint8_t *pRecordData = (uint8_t *)pRecord;

            if (pRecord->kind == CUPTI_ACTIVITY_KIND_RUNTIME)
            {
                printf("API Call: ");
                for (size_t i = 0; i < pLayout->numFields; i++)
                {
                    CUpti_ActivityFieldLayoutEntry entry = pLayout->pEntries[i];

                    switch (entry.fieldId)
                    {
                        case API_FIELD_CBID:
                            printf("CBID=%u ",
                                *(uint32_t *)(pRecordData + entry.offset));
                            break;
                        case API_FIELD_START:
                        {
                            uint64_t start = *(uint64_t *)(pRecordData + entry.offset);
                            uint64_t end = 0;
                            // Find end field
                            for (size_t j = 0; j < pLayout->numFields; j++)
                            {
                                if (pLayout->pEntries[j].fieldId == API_FIELD_END)
                                {
                                    end = *(uint64_t *)(pRecordData +
                                        pLayout->pEntries[j].offset);
                                    break;
                                }
                            }
                            printf("duration=%llu ns ", (unsigned long long)(end - start));
                            break;
                        }
                        case API_FIELD_CORRELATION_ID:
                            printf("corrID=%llu ",
                                *(uint64_t *)(pRecordData + entry.offset));
                            break;
                    }
                }
                printf("\n");
            }
            else if (pRecord->kind == CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL)
            {
                printf("Kernel: ");
                for (size_t i = 0; i < pLayout->numFields; i++)
                {
                    CUpti_ActivityFieldLayoutEntry entry = pLayout->pEntries[i];

                    switch (entry.fieldId)
                    {
                        case KERNEL_FIELD_NAME:
                            printf("name=%s ",
                                *(const char **)(pRecordData + entry.offset));
                            break;
                        case KERNEL_FIELD_START:
                        {
                            uint64_t start = *(uint64_t *)(pRecordData + entry.offset);
                            uint64_t end = 0;
                            for (size_t j = 0; j < pLayout->numFields; j++)
                            {
                                if (pLayout->pEntries[j].fieldId == KERNEL_FIELD_END)
                                {
                                    end = *(uint64_t *)(pRecordData +
                                        pLayout->pEntries[j].offset);
                                    break;
                                }
                            }
                            printf("duration=%llu ns ", (unsigned long long)(end - start));
                            break;
                        }
                    }
                }
                printf("\n");
            }
        }
        else if (status == CUPTI_ERROR_MAX_LIMIT_REACHED)
        {
            break;
        }
    } while (1);
}

Approach 2: Typecast to User-Defined Struct

Define a struct matching your field selection and typecast. See CUPTI User-Defined Activity Records for details on structure alignment:

typedef struct
{
    CUpti_ActivityKind kind;
    CUpti_CallbackId cbid;
    uint64_t threadId;
    uint64_t correlationId;
    uint64_t start;
    uint64_t end;
} CudaApiTrace;

// In your parsing function:
if (pRecord->kind == CUPTI_ACTIVITY_KIND_RUNTIME)
{
    CudaApiTrace *pApiRecord = (CudaApiTrace *)pRecord;

    printf("API Call: CBID=%u duration=%llu ns corrID=%llu\n",
           pApiRecord->cbid,
           (unsigned long long)(pApiRecord->end - pApiRecord->start),
           pApiRecord->correlationId);
}

Warning

When using Approach 2, your struct’s field ordering, types, and alignment must exactly match the CUPTI-provided layout.

5.5.9. Step 7: Changing Field Selection#

To change which fields are collected for an already-enabled activity kind, you must follow the disable-flush-enable pattern:

// 1. Disable the activity kind
CUPTI_API_CALL(cuptiActivityDisable_v2(subscriber, CUPTI_ACTIVITY_KIND_RUNTIME, NULL));

// 2. Flush all pending activity records (mandatory)
CUPTI_API_CALL(cuptiActivityFlushAll(1));

// 3. Define new field selection
int newFields[] = { API_FIELD_KIND, API_FIELD_CBID };
CUpti_ActivityFieldSelection selection;
selection.pFieldIds = newFields;
selection.numFields = sizeof(newFields) / sizeof(int);

CUpti_ActivityConfig activityConfig;
activityConfig.fieldSelection = selection;

// 4. Re-enable with new field selection
CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, CUPTI_ACTIVITY_KIND_RUNTIME, &activityConfig));

Note

The flush operation in step 2 is critical. Without it, CUPTI will return an error when attempting to enable the activity kind with different fields. See CUPTI User-Defined Activity Records for more details.

5.5.10. Complete Code Example#

Here’s the complete code for tracing CUDA operations with user-defined records, including changing field selection at runtime:

#include <cuda_runtime.h>
#include <cupti.h>
#include <stdio.h>
#include <stdlib.h>

#define CUPTI_API_CALL(call)                                              \
do {                                                                      \
    CUptiResult status = call;                                            \
    if (status != CUPTI_SUCCESS) {                                        \
        const char *errstr;                                               \
        cuptiGetResultString(status, &errstr);                            \
        fprintf(stderr, "CUPTI Error: %s at %s:%d\n",                     \
                errstr, __FILE__, __LINE__);                              \
        exit(EXIT_FAILURE);                                               \
    }                                                                     \
} while (0)

CUpti_SubscriberHandle subscriber;

// CUDA kernel
__global__ void VectorAdd(const float *A, const float *B, float *C, int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

// Buffer callbacks
void CUPTIAPI BufferRequested(uint8_t **ppBuffer,
                               size_t *pSize,
                               size_t *pMaxNumRecords,
                               CUpti_BufferCallbackRequestInfo *pBufferRequestInfo)
{
    *ppBuffer = (uint8_t *)malloc(8 * 1024 * 1024);
    *pSize = 8 * 1024 * 1024;
    *pMaxNumRecords = 0;
}

void CUPTIAPI BufferCompleted(uint8_t *pBuffer,
                               size_t size,
                               size_t validSize,
                               CUpti_BufferCallbackCompleteInfo *pBufferCompleteInfo)
{
    if (validSize > 0)
    {
        CUpti_Activity *pRecord = NULL;
        CUptiResult status;

        do
        {
            status = cuptiActivityGetNextRecord(pBuffer, validSize, &pRecord);
            if (status == CUPTI_SUCCESS)
            {
                CUpti_ActivityRecordLayout *pLayout =
                    pBufferCompleteInfo->ppRecordLayouts[pRecord->kind];
                if (pLayout == NULL) continue;

                uint8_t *pRecordData = (uint8_t *)pRecord;

                if (pRecord->kind == CUPTI_ACTIVITY_KIND_RUNTIME)
                {
                    printf("API: ");
                    for (size_t i = 0; i < pLayout->numFields; i++)
                    {
                        CUpti_ActivityFieldLayoutEntry entry = pLayout->pEntries[i];
                        if (entry.fieldId == API_FIELD_CBID)
                            printf("CBID=%u ", *(uint32_t *)(pRecordData + entry.offset));
                    }
                    printf("\n");
                }
                else if (pRecord->kind == CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL)
                {
                    printf("Kernel: ");
                    for (size_t i = 0; i < pLayout->numFields; i++)
                    {
                        CUpti_ActivityFieldLayoutEntry entry = pLayout->pEntries[i];
                        if (entry.fieldId == KERNEL_FIELD_NAME)
                            printf("name=%s ", *(const char **)(pRecordData + entry.offset));
                    }
                    printf("\n");
                }
            }
            else if (status == CUPTI_ERROR_MAX_LIMIT_REACHED) break;
        } while (1);
    }
    free(pBuffer);
}

void RunCudaWorkload()
{
    int N = 50000;
    size_t size = N * sizeof(float);
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < N; i++)
    {
        h_A[i] = (float)i;
        h_B[i] = (float)i * 2;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    VectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);
}

void SetupCupti()
{
    CUPTI_API_CALL(cuptiSubscribe_v2(&subscriber, NULL, NULL, NULL));

    // Enable user-defined records (mandatory)
    size_t valueSize = sizeof(uint8_t);
    uint8_t value = 1;
    CUPTI_API_CALL(cuptiActivitySetAttribute(CUPTI_ACTIVITY_ATTR_USER_DEFINED_RECORDS, &valueSize, &value));

    // Register v2 callbacks (mandatory)
    CUPTI_API_CALL(cuptiActivityRegisterCallbacks_v2(subscriber, BufferRequested, BufferCompleted));

    // Enable CUDA Runtime API
    int apiFields[] = { API_FIELD_KIND, API_FIELD_CBID, API_FIELD_START, API_FIELD_END };
    CUpti_ActivityFieldSelection selection;
    selection.pFieldIds = apiFields;
    selection.numFields = sizeof(apiFields) / sizeof(int);

    CUpti_ActivityConfig activityConfig;
    activityConfig.fieldSelection = selection;
    CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, CUPTI_ACTIVITY_KIND_RUNTIME, &activityConfig));

    // Enable kernel activity
    int kernelFields[] = { KERNEL_FIELD_KIND, KERNEL_FIELD_NAME, KERNEL_FIELD_START, KERNEL_FIELD_END };
    selection.pFieldIds = kernelFields;
    selection.numFields = sizeof(kernelFields) / sizeof(int);
    activityConfig.fieldSelection = selection;
    CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL, &activityConfig));
}

void ChangeFieldSelection()
{
    printf("\n=== Changing field selection ===\n");
    CUPTI_API_CALL(cuptiActivityDisable_v2(subscriber, CUPTI_ACTIVITY_KIND_RUNTIME, NULL));
    CUPTI_API_CALL(cuptiActivityFlushAll(1));  // Mandatory flush

    int newFields[] = { API_FIELD_KIND, API_FIELD_CBID };
    CUpti_ActivityFieldSelection selection;
    selection.pFieldIds = newFields;
    selection.numFields = sizeof(newFields) / sizeof(int);

    CUpti_ActivityConfig activityConfig;
    activityConfig.fieldSelection = selection;
    CUPTI_API_CALL(cuptiActivityEnable_v2(subscriber, CUPTI_ACTIVITY_KIND_RUNTIME, &activityConfig));
}

void CleanupCupti()
{
    CUPTI_API_CALL(cuptiActivityFlushAll(1));
    CUPTI_API_CALL(cuptiUnsubscribe(subscriber));
}

int main()
{
    // Check CUPTI version
    uint32_t version;
    CUPTI_API_CALL(cuptiGetVersion(&version));
    printf("Current CUPTI Version: %u\n", version);

    if (version < 130200)
    {
        printf("CUPTI user-defined activity records require CUPTI version 130200 or higher. Skipping running the sample.\n");
        exit(EXIT_SUCCESS);
    }

    SetupCupti();

    printf("\n=== Running workload with initial fields ===\n");
    RunCudaWorkload();
    cudaDeviceSynchronize();
    CUPTI_API_CALL(cuptiActivityFlushAll(0));

    ChangeFieldSelection();

    printf("\n=== Running workload with new fields ===\n");
    RunCudaWorkload();
    cudaDeviceSynchronize();
    CUPTI_API_CALL(cuptiActivityFlushAll(0));

    CleanupCupti();
    return 0;
}

5.5.11. Expected Output#

When you run the application, you’ll see activity records with only your selected fields:

=== Running workload with initial fields ===
API: CBID=157
API: CBID=158
API: CBID=159
Kernel: name=VectorAdd(float const*, float const*, float*, int)
API: CBID=160

=== Changing field selection ===

=== Running workload with new fields ===
API: CBID=157
API: CBID=158
API: CBID=159
Kernel: name=VectorAdd(float const*, float const*, float*, int)
API: CBID=160

Notice that:

Only the selected fields are collected, reducing memory usage compared to standard predefined records
The exact CBID values and number of API calls may vary depending on CUDA runtime version
Field selection can be changed at runtime using the disable-flush-enable pattern

5.5.12. Key Takeaways#

Subscribe first: Call cuptiSubscribe() or cuptiSubscribe_v2() before calling any other CUPTI APIs (mandatory). Using cuptiSubscribe_v2() is recommended.
Enable the feature: Set CUPTI_ACTIVITY_ATTR_USER_DEFINED_RECORDS before enabling any activity kinds (mandatory)
Use v2 APIs: cuptiActivityEnable_v2(), cuptiActivityRegisterCallbacks_v2(), and cuptiActivityDisable_v2() are mandatory for user-defined records.
Always include *_FIELD_KIND: This must be the first field in your selection
Layouts provided in callbacks: Record layouts are available through the ppRecordLayouts field in CUpti_BufferCallbackCompleteInfo structure received in the BufferCompleted callback
Field-by-field parsing is safer: It’s more flexible than typecasting to custom structs and handles alignment automatically
Disable-Flush-Enable pattern: To change field selection, you must disable with cuptiActivityDisable_v2(), flush with cuptiActivityFlushAll(1) (mandatory), then re-enable. The complete example demonstrates this pattern.
Explicit enabling required: Some activity kinds like CUPTI_ACTIVITY_KIND_MEMCPY2 and CUPTI_ACTIVITY_KIND_DEVICE_GRAPH_TRACE must be explicitly enabled with field selection, unlike predefined structures where they are implicitly enabled.

For complete API reference, data structures, limitations, and version compatibility, see CUPTI User-Defined Activity Records.

5.5.13. Next Steps#

Complete API Reference: See CUPTI User-Defined Activity Records for data structures and limitations, and version compatibility
Sample Code: Review the complete sample in cupti_user_defined_records.cu (Location: $CUPTI_DIR/samples/)
Field Enums: Learn about all supported activity kinds and their field ID enums in cupti_activity.h
Helper Utilities: Explore helper_cupti_activity.h, helper_cupti_activity_enums.h and helper_cupti_activity_user_defined_records.h for helper functions (Location: $CUPTI_DIR/samples/common/)

5. Tutorial#

5.1. CUDA kernel tracing using Activity API#

5.1.1. Simple Vector Addition in CUDA C#

5.1.2. Step 1: Register CUPTI Callbacks#

5.1.3. Step 2: Enable CUPTI Activity Collection#

5.1.4. Step 3: Flushing and Disabling CUPTI Activity#

5.1.5. Expected Output#

5.2. CUDA memcpy API tracing using CUPTI Callback API#

5.2.1. Simple Vector Addition in CUDA C#

5.2.2. Step 1: Subscribe to CUPTI callbacks#

5.2.3. Step 2: Enable callback for specific domains and callback IDs#

5.2.4. Step 3: Disable Callbacks and Cleanup#

5.2.5. Expected Output#

5.3. Periodic metric sampling using PM Sampling API#

5.3.1. Simple Vector Addition in CUDA C#

5.3.2. Step 1: Initialize CUDA and Include Headers#

5.3.3. Step 2: Initialize CUPTI Profiler and Enable PM Sampling#

5.3.4. Step 3: Create Config Image and Configure PM Sampling#

5.3.5. Step 4: Start and Stop PM Sampling before and after Launching Workload#

5.3.6. Step 5: Create Counter Data Image and Decode Sampling Data#

5.3.7. Step 6: Cleanup PM Sampling#

5.3.8. Complete Example#

5.3.9. Expected Output#

5.3.10. Advanced Usage: Continuous Decode Thread#

5.4. GPU Performance Profiling using Range Profiler API#

5.4.1. Simple Vector Addition in CUDA C#

5.4.2. Step 1: Include Headers and Define Global Variables#

5.4.3. Step 2: Initialize CUPTI Profiler and Enable Range Profiler#

5.4.4. Step 3: Create Counter Data Image and Configure Range Profiler#

5.4.5. Step 4: Start and Stop Range Profiling Around Workload#

5.4.6. Step 5: Decode and Evaluate Profiling Data#

5.4.7. Step 7: Cleanup Range Profiler#

5.4.8. Complete Example#

5.4.9. Expected Output#

5.5. CUDA tracing with User-Defined Activity Records#

5.5.1. Why Use User-defined Records?#

5.5.2. Prerequisites#

5.5.3. Step 1: Subscribe and Enable User-defined Records#

5.5.4. Step 2: Register Activity Buffer Callbacks#

5.5.5. Step 3: Select Fields for CUDA API Tracing#

5.5.6. Step 4: Enable Activity with Field Selection#

5.5.7. Step 5: Enable More Activity Kinds#

5.5.8. Step 6: Parse User-Defined Records#

5.5.9. Step 7: Changing Field Selection#

5.5.10. Complete Code Example#

5.5.11. Expected Output#

5.5.12. Key Takeaways#

5.5.13. Next Steps#