2. Usage

2.1. CUPTI Compatibility and Requirements

CUPTI, the CUDA Profiling Tools Interface, ensures seamless profiling compatibility for CUDA applications across various GPU architectures and CUDA driver versions. As part of the CUDA Toolkit, CUPTI adheres to CUDA Toolkit compatibility requirements with CUDA drivers, which includes support for Backward, Forward and Enhanced compatibilities. For instance, a profiling tool based on an older version of CUPTI can still operate with a more recent CUDA driver.

It’s essential to refer to the CUDA Toolkit and Compatible Driver Versions table to determine the minimum CUDA driver version required for each release of CUPTI corresponding to a CUDA Toolkit release. Attempting to use CUPTI calls with an incompatible CUDA driver version will result in a CUPTI_ERROR_NOT_INITIALIZED error code.

2.2. CUPTI Initialization

CUPTI initialization occurs lazily the first time you invoke any CUPTI function. For the Activity, Event, Metric, and Callback APIs there are no requirements on when this initialization must occur (i.e. you can invoke the first CUPTI function at any point). See the CUPTI Activity API section for more information on CUPTI initialization requirements for the activity API.

It is recommended for CUPTI clients to call the API cuptiSubscribe() before starting the profiling session i.e. API cuptiSubscribe() should be called before calling any other CUPTI API. This API will return the error code CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED when another CUPTI client is already subscribed. CUPTI client should error out and not make further CUPTI calls if cuptiSubscribe() returns an error. This would prevent multiple CUPTI clients to be active at the same time otherwise those might interfere with the profiling state of each other.

2.3. CUPTI Activity API

The CUPTI Activity API allows you to asynchronously collect a trace of an application’s CPU and GPU CUDA activity. The following terminology is used by the activity API.

Activity Record

CPU and GPU activity is reported in C data structures called activity records. There is a different C structure type for each activity kind (e.g. CUpti_ActivityAPI). Records are generically referred to using the CUpti_Activity type. This type contains only a field that indicates the kind of the activity record. Using this kind, the object can be cast from the generic CUpti_Activity type to the specific type representing the activity. See the printActivity function in the activity_trace_async sample for an example.

Activity Buffer

An activity buffer is used to transfer one or more activity records from CUPTI to the client. CUPTI fills activity buffers with activity records as the corresponding activities occur on the CPU and GPU. But CUPTI doesn’t guarantee any ordering of the activities in the activity buffer as activity records for few activity kinds are added lazily. The CUPTI client is responsible for providing empty activity buffers as necessary to ensure that no records are dropped.

An asynchronous buffering API is implemented by cuptiActivityRegisterCallbacks and cuptiActivityFlushAll.

It is not required that the activity API be initialized before CUDA initialization. All related activities occurring after initializing the activity API are collected. You can force initialization of the activity API by enabling one or more activity kinds using cuptiActivityEnable or cuptiActivityEnableContext, as shown in the initTrace function of the activity_trace_async sample. Some activity kinds cannot be directly enabled, see the API documentation for CUpti_ActivityKind for details. The functions cuptiActivityEnable and cuptiActivityEnableContext will return CUPTI_ERROR_NOT_COMPATIBLE if the requested activity kind cannot be enabled.

The activity buffer API uses callbacks to request and return buffers of activity records. To use the asynchronous buffering API, you must first register two callbacks using cuptiActivityRegisterCallbacks. One of these callbacks will be invoked whenever CUPTI needs an empty activity buffer. The other callback is used to deliver a buffer containing one or more activity records to the client. To minimize profiling overhead the client should return as quickly as possible from these callbacks. Client can pre-allocate a pool of activity buffers and return an empty buffer from the pool when requested by CUPTI. Activity buffer size should be chosen carefully, smaller buffers can result in frequent requests by CUPTI and bigger buffers can delay the automatic delivery of completed activity buffers. For typical workloads, it’s suggested to choose a size between 1 and 10 MB. The functions cuptiActivityGetAttribute and cuptiActivitySetAttribute can be used to read and write attributes that control how the buffering API behaves. See the API documentation for more information.

Flushing of the activity buffers

CUPTI is expected to deliver the activity buffer automatically as soon as it gets full and all the activity records in it are completed. For performance reasons, CUPTI calls the underlying methods based on certain heuristics, thus it can cause delay in the delivery of the buffer. However client can make a request to deliver the activity buffer/s at any time, and this can be achieved using the APIs cuptiActivityFlushAll and cuptiActivityFlushPeriod. Behavior of these APIs is as follows:

  • For on-demand flush using the API cuptiActivityFlushAll with the flag set as 0, CUPTI returns all the activity buffers which have all the activity records completed, buffers need not to be full though. It doesn’t return buffers which have one or more incomplete records. This flush can be done at a regular interval in a separate thread.

  • For on-demand forced flush using the API cuptiActivityFlushAll with the flag set as CUPTI_ACTIVITY_FLAG_FLUSH_FORCED, CUPTI returns all the activity buffers including the ones which have one or more incomplete activity records. It’s suggested to do the forced flush before the termination of the profiling session to allow remaining buffers to be delivered.

  • For periodic flush using the API cuptiActivityFlushPeriod, CUPTI returns only those activity buffers which are full and have all the activity records completed. It’s allowed to use the API cuptiActivityFlushAll to flush the buffers on-demand, even when client sets the periodic flush.

Note that activity record is considered as completed if it has all the information filled up including the timestamps (if any).

The activity_trace_async sample shows how to use the activity buffer API to collect a trace of CPU and GPU activity for a simple application.

CUPTI Threads

CUPTI creates a worker thread to minimize the perturbance for the application created threads. CUPTI offloads certain operations from the application threads to the worker thread, this incldues synchronization of profiling resources between host and device, delivery of the activity buffers to the client using the buffer completed callback registered in the API cuptiActivityRegisterCallbacks etc. To minimize the overhead, CUPTI wakes up the worker thread based on certain heuristics. API cuptiActivityFlushPeriod introduced in CUDA 11.1 can be used to control the flush period of the worker thread. This setting overrides the CUPTI heuristics. It’s allowed to use the API cuptiActivityFlushAll to flush the data on-demand, even when client sets the periodic flush.

Further, CUPTI creates separate threads when certain activity kinds are enabled. For example, CUPTI creates one thread each for activity kinds CUPTI_ACTIVITY_KIND_UNIFIED_MEMORY_COUNTER and CUPTI_ACTIVITY_KIND_ENVIRONMENT to collect the information from the backend.

2.3.1. SASS Source Correlation

While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of the details of the execution on the hardware. It is sometimes helpful to analyze performance problems for a kernel at the assembly instruction level. Reading assembly language is tedious and challenging; CUPTI can help you to build the correlation between lines in your high-level source code and the executed assembly instructions.

Building SASS source correlation for a PC can be split into two parts:

  • Correlation of the PC to SASS instruction - subscribe to any one of the CUPTI_CBID_RESOURCE_MODULE_LOADED, CUPTI_CBID_RESOURCE_MODULE_UNLOAD_STARTING, or CUPTI_CBID_RESOURCE_MODULE_PROFILED callbacks. This returns a CUpti_ModuleResourceData structure having the CUDA binary. The binary can be disassembled using the nvdisasm utility that comes with the CUDA toolkit. An application can have multiple functions and modules, to uniquely identify there is a functionId field in all source level activity records. This uniquely corresponds to a CUPTI_ACTIVITY_KIND_FUNCTION, which has the unique module ID and function ID in the module.

  • Correlation of the SASS instruction to CUDA source line - every source level activity has a sourceLocatorId field which uniquely maps to a record of kind CUPTI_ACTIVITY_KIND_SOURCE_LOCATOR, containing the line and file name information. Please note that multiple PCs can correspond to a single source line.

When any source level activity (global access, branch, PC Sampling, etc.) is enabled, a source locator record is generated for the PCs that have the source level results. The record CUpti_ActivityInstructionCorrelation can be used, along with source level activities, to generate SASS assembly instructions to CUDA C source code mapping for all the PCs of the function, and not just the PCs that have the source level results. This can be enabled using the activity kind CUPTI_ACTIVITY_KIND_INSTRUCTION_CORRELATION.

The sass_source_map sample shows how to map SASS assembly instructions to CUDA C source.

2.3.2. PC Sampling

CUPTI supports device-wide sampling of the program counter (PC). The PC Sampling gives the number of samples for each source and assembly line with various stall reasons. Using this information, you can pinpoint portions of your kernel that are introducing latencies and the reason for the latency. Samples are taken in round robin order for all active warps at a fixed number of cycles, regardless of whether the warp is issuing an instruction or not.

Devices with compute capability 6.0 and higher have a new feature that gives latency reasons. The latency samples indicate the reasons for holes in the issue pipeline. While collecting these samples, there is no instruction issued in the respective warp scheduler, hence these give the latency reasons. The latency reasons will be one of the stall reasons listed in the enum CUpti_ActivityPCSamplingStallReason, except stall reason CUPTI_ACTIVITY_PC_SAMPLING_STALL_NOT_SELECTED.

The activity record CUpti_ActivityPCSampling3, enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING, outputs the stall reason along with PC and other related information. The enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. A wide range of sampling periods, ranging from 2^5 cycles to 2^31 cycles per sample, is supported. This can be controlled through the field samplingPeriod2 in the PC sampling configuration struct CUpti_ActivityPCSamplingConfig. The activity record CUpti_ActivityPCSamplingRecordInfo provides the total and dropped samples for each kernel profiled for PC sampling.

This feature is available on devices with compute capability 5.2 and higher, excluding mobile devices. For Pascal and older chips cuptiActivityConfigurePCSampling api must be called before enabling activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING, for Volta and newer chips order does not matter. For Volta and newer GPU architectures if cuptiActivityConfigurePCSampling API is called in the middle of execution, PC sampling configuration will be updated for subsequent kernel launches. PC sampling can significantly change the overall performance characteristics of the application because all kernel executions are serialized on the GPU.

The pc_sampling sample shows how to use these APIs to collect PC Sampling profiling information for a kernel.

Note

A new set of PC Sampling APIs was introduced in the CUDA 11.3 release, which supports continuous mode data collection without serializing kernel execution and have a lower runtime overhead. Refer to the section CUPTI PC Sampling API for more details. PC Sampling APIs from the header cupti_activity.h would be referred as PC Sampling Activity APIs and APIs from the header cupti_pcsampling.h would be referred as PC Sampling APIs.

2.3.4. OpenACC

CUPTI supports collecting information for OpenACC applications using the OpenACC tools interface implementation of the PGI runtime. OpenACC profiling is available only on Linux x86_64, IBM POWER and Arm server platform (arm64 SBSA) platforms. This feature also requires PGI runtime version 19.1 or higher.

The activity records CUpti_ActivityOpenAccData, CUpti_ActivityOpenAccLaunch, and CUpti_ActivityOpenAccOther are created, representing the three groups of callback events specified in the OpenACC tools interface. CUPTI_ACTIVITY_KIND_OPENACC_DATA, CUPTI_ACTIVITY_KIND_OPENACC_LAUNCH, and CUPTI_ACTIVITY_KIND_OPENACC_OTHER can be enabled to collect the respective activity records.

Due to the restrictions of the OpenACC tools interface, CUPTI cannot record OpenACC records from within the client application. Instead, a shared library that exports the acc_register_library function defined in the OpenACC tools interface specification must be implemented. Parameters passed into this function from the OpenACC runtime can be used to initialize the CUPTI OpenACC measurement using cuptiOpenACCInitialize. Before starting the client application, the environment variable ACC_PROFLIB must be set to point to this shared library.

cuptiOpenACCInitialize is defined in cupti_openacc.h, which is included by cupti_activity.h. Since the CUPTI OpenACC header is only available on supported platforms, CUPTI clients must define CUPTI_OPENACC_SUPPORT when compiling.

The openacc_trace sample shows how to use CUPTI APIs for OpenACC data collection.

2.3.5. CUDA Graphs

CUPTI can collect trace of CUDA Graphs applications without breaking driver performance optimizations. CUPTI has added fields graphId and graphNodeId in the kernel, memcpy and memset activity records to denote the unique ID of the graph and the graph node respectively of the GPU activity. CUPTI issues callbacks for graph operations like graph and graph node creation/destruction/cloning and also for executable graph creation/destruction. The cuda_graphs_trace sample shows how to collect GPU trace and API trace for CUDA Graphs and how to correlate a graph node launch to the node creation API by using CUPTI callbacks for graph operations.

2.3.6. External Correlation

CUPTI supports correlation of CUDA API activity records with external APIs. Such APIs include OpenACC, OpenMP, and MPI. This associates CUPTI correlation IDs with IDs provided by the external API. Both IDs are stored in a new activity record of type CUpti_ActivityExternalCorrelation.

CUPTI maintains a stack of external correlation IDs per CPU thread and per CUpti_ExternalCorrelationKind. Clients must use cuptiActivityPushExternalCorrelationId to push an external ID of a specific kind to this stack and cuptiActivityPopExternalCorrelationId to remove the latest ID. If a CUDA API activity record is generated while any CUpti_ExternalCorrelationKind-stack on the same CPU thread is non-empty, one CUpti_ActivityExternalCorrelation record per CUpti_ExternalCorrelationKind-stack is inserted into the activity buffer before the respective CUDA API activity record. The CUPTI client is responsible for tracking passed external API correlation IDs, in order to eventually associate external API calls with CUDA API calls. Along with the activity kind CUPTI_ACTIVITY_KIND_EXTERNAL_CORRELATION, it is necessary to enable the CUDA API activity kinds i.e. CUPTI_ACTIVITY_KIND_RUNTIME and CUPTI_ACTIVITY_KIND_DRIVER to generate external correlation activity records.

If both CUPTI_ACTIVITY_KIND_EXTERNAL_CORRELATION and any of CUPTI_ACTIVITY_KIND_OPENACC_* activity kinds are enabled, CUPTI will generate external correlation activity records for OpenACC with externalKindCUPTI_EXTERNAL_CORRELATION_KIND_OPENACC.

The cupti_external_correlation sample shows how to use CUPTI APIs for external correlation.

2.3.7. Dynamic Attach and Detach

CUPTI provides mechanisms for attaching to or detaching from a running process to support on-demand profiling. CUPTI can be attached by calling any CUPTI API as CUPTI supports lazy initialization. To detach CUPTI, call the API cuptiFinalize() which destroys and cleans up all the resources associated with CUPTI in the current process. After CUPTI detaches from the process, the process will keep on running with no CUPTI attached to it. Any subsequent CUPTI API call will reinitialize the CUPTI. You can attach and detach CUPTI any number of times. For safe operation of the API, it is recommended that API cuptiFinalize() is invoked from the exit call site of any of the CUDA Driver or Runtime API. Otherwise, CUPTI client needs to make sure that CUDA synchronization and CUPTI activity buffer flush is done before calling the API cuptiFinalize(). To understand the need for calling the API cuptiFinalize() from specific point/s in the code flow, consider multiple application threads performing various CUDA activities. While one thread is in the middle of the cuptiFinalize(), it is quite possible that other threads continue to call into the CUPTI and try to access the state of various objects (device, context, thread state etc) maintained by CUPTI, which might be rendered invalid as part of the cuptiFinalize(), thus resulting in the crash. We have to block the other threads until CUPTI teardown is completed via cuptiFinalize(). API exit call site is one such location where we can ensure that the work submitted by all the threads has been completed and we can safely teardown CUPTI. cuptiFinalize() is a heavy operation as it does context synchronization for all active CUDA contexts and blocks all the application threads until CUPTI teardown is completed. Sample code showing the usage of the API cuptiFinalize() in the cupti callback handler code:

void CUPTIAPI
cuptiCallbackHandler(void *userdata, CUpti_CallbackDomain domain,
    CUpti_CallbackId cbid, void *cbdata)
{
    const CUpti_CallbackData *cbInfo = (CUpti_CallbackData *)cbdata;

    // Take this code path when CUPTI detach is requested
    if (detachCupti) {
        switch(domain)
        {
        case CUPTI_CB_DOMAIN_RUNTIME_API:
        case CUPTI_CB_DOMAIN_DRIVER_API:
            if (cbInfo->callbackSite == CUPTI_API_EXIT) {
                // call the CUPTI detach API
                cuptiFinalize();
            }
            break;
        default:
            break;
        }
    }
}

Full code can be found in the sample cupti_finalize.

2.3.8. Device Memory Allocation Source Tracking

CUDA applications utilize various shared libraries such as cuBLAS, cuFFT, cuDNN etc, each serving distinct purposes. These libraries can be integrated either statically at compile time or loaded dynamically during runtime. In the case of dynamic loading, CUPTI enables precise attribution of memory allocations to their respective shared libraries. This can be achieved by calling the cuptiActivityEnableAllocationSource() API. The filepath of the responsible shared object is assigned in the source field in the activity record CUpti_ActivityMemory4 which is enabled using the activity kind CUPTI_ACTIVITY_KIND_MEMORY2. This functionality is currently exclusive to Linux x86_64 platform. However, if the library is statically linked, the source is identified as the main application executable rather than the library.

2.4. CUPTI Callback API

The CUPTI Callback API allows you to register a callback into your own code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver. The following terminology is used by the callback API.

Callback Domain

Callbacks are grouped into domains to make it easier to associate your callback functions with groups of related CUDA functions or events. There are currently four callback domains, as defined by CUpti_CallbackDomain: a domain for CUDA runtime functions, a domain for CUDA driver functions, a domain for CUDA resource tracking, and a domain for CUDA synchronization notification.

Callback ID

Each callback is given a unique ID within the corresponding callback domain so that you can identify it within your callback function. The CUDA driver API IDs are defined in cupti_driver_cbid.h and the CUDA runtime API IDs are defined in cupti_runtime_cbid.h. Both of these headers are included for you when you include cupti.h. The CUDA resource callback IDs are defined by CUpti_CallbackIdResource, and the CUDA synchronization callback IDs are defined by CUpti_CallbackIdSync.

Callback Function

Your callback function must be of type CUpti_CallbackFunc. This function type has two arguments that specify the callback domain and ID so that you know why the callback is occurring. The type also has a cbdata argument that is used to pass data specific to the callback.

Subscriber

A subscriber is used to associate each of your callback functions with one or more CUDA API functions. There can be at most one subscriber initialized with cuptiSubscribe() at any time. Before initializing a new subscriber, the existing subscriber must be finalized with cuptiUnsubscribe().

Each callback domain is described in detail below. Unless explicitly stated, it is not supported to call any CUDA runtime or driver API from within a callback function. Doing so may cause the application to hang.

2.4.1. Driver and Runtime API Callbacks

Using the callback API with the CUPTI_CB_DOMAIN_DRIVER_API or CUPTI_CB_DOMAIN_RUNTIME_API domains, you can associate a callback function with one or more CUDA API functions. When those CUDA functions are invoked in the application, your callback function is invoked as well. For these domains, the cbdata argument to your callback function will be of the type CUpti_CallbackData.

It is legal to call cudaThreadSynchronize(), cudaDeviceSynchronize(), cudaStreamSynchronize(), cuCtxSynchronize(), and cuStreamSynchronize() from within a driver or runtime API callback function.

The following code shows a typical sequence used to associate a callback function with one or more CUDA API functions. To simplify the presentation, error checking code has been removed.

CUpti_SubscriberHandle subscriber;
MyDataStruct *my_data = ...;
...
cuptiSubscribe(&subscriber,
               (CUpti_CallbackFunc)my_callback , my_data);
cuptiEnableDomain(1, subscriber,
                  CUPTI_CB_DOMAIN_RUNTIME_API);

First, cuptiSubscribe is used to initialize a subscriber with the my_callback callback function. Next, cuptiEnableDomain is used to associate that callback with all the CUDA runtime API functions. Using this code sequence will cause my_callback to be called twice each time any of the CUDA runtime API functions are invoked, once on entry to the CUDA function and once just before exit from the CUDA function. CUPTI callback API functions cuptiEnableCallback and cuptiEnableAllDomains can also be used to associate CUDA API functions with a callback (see reference below for more information).

The following code shows a typical callback function.

void CUPTIAPI
my_callback(void *userdata, CUpti_CallbackDomain domain,
            CUpti_CallbackId cbid, const void *cbdata)
{
  const CUpti_CallbackData *cbInfo = (CUpti_CallbackData *)cbdata;
  MyDataStruct *my_data = (MyDataStruct *)userdata;

  if ((domain == CUPTI_CB_DOMAIN_RUNTIME_API) &&
      (cbid == CUPTI_RUNTIME_TRACE_CBID_cudaMemcpy_v3020))  {
    if (cbInfo->callbackSite == CUPTI_API_ENTER) {
        cudaMemcpy_v3020_params *funcParams =
             (cudaMemcpy_v3020_params *)(cbInfo->
                 functionParams);

        size_t count = funcParams->count;
        enum cudaMemcpyKind kind = funcParams->kind;
        ...
      }
  ...

In your callback function, you use the CUpti_CallbackDomain and CUpti_CallbackID parameters to determine which CUDA API function invocation is causing this callback. In the example above, we are checking for the CUDA runtime cudaMemcpy function. The cbdata parameter holds a structure of useful information that can be used within the callback. In this case, we use the callbackSite member of the structure to detect that the callback is occurring on entry to cudaMemcpy, and we use the functionParams member to access the parameters that were passed to cudaMemcpy. To access the parameters, we first cast functionParams to a structure type corresponding to the cudaMemcpy function. These parameter structures are contained in generated_cuda_runtime_api_meta.h, generated_cuda_meta.h, and a number of other files. When possible, these files are included for you by cupti.h.

The callback_event and callback_timestamp samples described on the samples page both show how to use the callback API for the driver and runtime API domains.

2.4.2. Resource Callbacks

Using the callback API with the CUPTI_CB_DOMAIN_RESOURCE domain, you can associate a callback function with some CUDA resource creation and destruction events. For example, when a CUDA context is created, your callback function will be invoked with a callback ID equal to CUPTI_CBID_RESOURCE_CONTEXT_CREATED. For this domain, the cbdata argument to your callback function will be of the type CUpti_ResourceData.

Note that APIs cuptiActivityFlush and cuptiActivityFlushAll will result in deadlock when called from stream destroy starting callback identified using callback ID CUPTI_CBID_RESOURCE_STREAM_DESTROY_STARTING.

2.4.3. Synchronization Callbacks

Using the callback API with the CUPTI_CB_DOMAIN_SYNCHRONIZE domain, you can associate a callback function with CUDA context and stream synchronizations. For example, when a CUDA context is synchronized, your callback function will be invoked with a callback ID equal to CUPTI_CBID_SYNCHRONIZE_CONTEXT_SYNCHRONIZED. For this domain, the cbdata argument to your callback function will be of the type CUpti_SynchronizeData.

2.4.4. NVIDIA Tools Extension Callbacks

Using the callback API with the CUPTI_CB_DOMAIN_NVTX domain, you can associate a callback function with NVIDIA Tools Extension (NVTX) API functions. When an NVTX function is invoked in the application, your callback function is invoked as well. For these domains, the cbdata argument to your callback function will be of the type CUpti_NvtxData.

The NVTX library has its own convention for discovering the profiling library that will provide the implementation of the NVTX callbacks. To receive callbacks, you must set the NVTX environment variables appropriately so that when the application calls an NVTX function, your profiling library receives the callbacks. The following code sequence shows a typical initialization sequence to enable NVTX callbacks and activity records.

/* Set env so CUPTI-based profiling library loads on first nvtx call. */
char *inj32_path = "/path/to/32-bit/version/of/cupti/based/profiling/library";
char *inj64_path = "/path/to/64-bit/version/of/cupti/based/profiling/library";
setenv("NVTX_INJECTION32_PATH", inj32_path, 1);
setenv("NVTX_INJECTION64_PATH", inj64_path, 1);

The following code shows a typical sequence used to associate a callback function with one or more NVTX functions. To simplify the presentation, error checking code has been removed.

CUpti_SubscriberHandle subscriber;
MyDataStruct *my_data = ...;
...
cuptiSubscribe(&subscriber,
               (CUpti_CallbackFunc)my_callback , my_data);
cuptiEnableDomain(1, subscriber,
                  CUPTI_CB_DOMAIN_NVTX);

First, cuptiSubscribe is used to initialize a subscriber with the my_callback callback function. Next, cuptiEnableDomain is used to associate that callback with all the NVTX functions. Using this code sequence will cause my_callback to be called once each time any of the NVTX functions are invoked. CUPTI callback API functions cuptiEnableCallback and cuptiEnableAllDomains can also be used to associate NVTX API functions with a callback (see reference below for more information).

The following code shows a typical callback function.

void CUPTIAPI
my_callback(void *userdata, CUpti_CallbackDomain domain,
            CUpti_CallbackId cbid, const void *cbdata)
{
  const CUpti_NvtxData *nvtxInfo = (CUpti_NvtxData *)cbdata;
  MyDataStruct *my_data = (MyDataStruct *)userdata;

  if ((domain == CUPTI_CB_DOMAIN_NVTX) &&
      (cbid == CUPTI_CBID_NVTX_nvtxRangeStartEx))  {
    nvtxRangeStartEx_params *params = (nvtxRangeStartEx_params *)nvtxInfo->
             functionParams;
    nvtxRangeId_t *id = (nvtxRangeId_t *)nvtxInfo->functionReturnValue;
    ...
  }
  ...

In your callback function, you use the CUpti_CallbackDomain and CUpti_CallbackID parameters to determine which NVTX API function invocation is causing this callback. In the example above, we are checking for the nvtxRangeStartEx function. The cbdata parameter holds a structure of useful information that can be used within the callback. In this case, we use the functionParams member to access the parameters that were passed to nvtxRangeStartEx. To access the parameters, we first cast functionParams to a structure type corresponding to the nvtxRangeStartEx function. These parameter structures are contained in generated_nvtx_meta.h. We also use functionReturnValue member to access the value returned by nvtxRangeStartEx. To access the return value, we first cast functionReturnValue to the return type corresponding to the nvtxRangeStartEx function. If there is no return value for the NVTX function, functionReturnValue is NULL.

The sample cupti_nvtx shows the initialization sequence to enable NVTX callbacks and activity records.

If your CUPTI-based profiling library links static CUPTI library, you can define and export your own NvtxInitializeInjection and NvtxInitializeInjection2 functions, which would be called by setting the NVTX environment variables.

If you want CUPTI to handle NVTX calls, these functions should call CUPTI’s corresponding initialization functions, as shown in the example below so that when the application calls a NVTX function, your profiling library receives the callbacks. The following code sequence shows how this can be done to receive callbacks and activity records when linking static CUPTI library.

/* Set env so CUPTI-based profiling library loads on first nvtx call. */
char *inj32_path = "/path/to/32-bit/version/of/cupti/based/profiling/library";
char *inj64_path = "/path/to/64-bit/version/of/cupti/based/profiling/library";
setenv("NVTX_INJECTION32_PATH", inj32_path, 1);
setenv("NVTX_INJECTION64_PATH", inj64_path, 1);

/* Extern the CUPTI NVTX initialization APIs. The APIs are thread-safe */
extern "C" CUptiResult CUPTIAPI cuptiNvtxInitialize(void* pfnGetExportTable);
extern "C" CUptiResult CUPTIAPI cuptiNvtxInitialize2(void* pfnGetExportTable);

extern "C" int InitializeInjectionNvtx(void* p)
{
  CUptiResult res = cuptiNvtxInitialize(p);
  return (res == CUPTI_SUCCESS) ? 1 : 0;
}

extern "C" int InitializeInjectionNvtx2(void* p)
{
  CUptiResult res = cuptiNvtxInitialize2(p);
  return (res == CUPTI_SUCCESS) ? 1 : 0;
}

Alternatively, if you want to handle NVTX calls directly in your profiling library, you can attach your own callbacks to the NVTX client in these functions.

NVTX v1 and v2 both have the initialization code in a single injection library shared by all users of NVTX in the whole process, so the initialization will happen only once per process. NVTX v3 embeds the initialization code into your own binaries, so if NVTX v3 is in multiple dynamic libraries, each one of those sites will initialize the first time a NVTX call is made from that dynamic library. These first calls could be on different threads. So if you are wiring up your own NVTX handlers, you should ensure that code is thread-safe when called from multiple threads at once.

2.4.5. State Callbacks

Any fatal error encountered by an explicit CUPTI API call is returned by the API itself, whereas errors encountered by CUPTI in the background is returned to the user only during the next explicit CUPTI API call. Using the callback API with the CUPTI_CB_DOMAIN_STATE domain, you can associate a callback function with errors in CUPTI, and receive the reported error instantaneously. For example, when a CUPTI runs into a fatal error, your callback function will be invoked with a callback ID equal to CUPTI_CBID_STATE_FATAL_ERROR. For this domain, the cbdata argument to your callback function will be of the type CUpti_StateData.

As part of CUpti_StateData, you can receive the error code of the failure, along with an appropriate error message with possible causes or appropriate links to documentation. The example usage of these callbacks can be found in the CUPTI trace samples.

2.5. CUPTI Event API

The CUPTI Event API allows you to query, configure, start, stop, and read the event counters on a CUDA-enabled device. The following terminology is used by the event API.

Event

An event is a countable activity, action, or occurrence on a device.

Event ID

Each event is assigned a unique identifier. A named event will represent the same activity, action, or occurrence on all device types. But the named event may have different IDs on different device families. Use cuptiEventGetIdFromName to get the ID for a named event on a particular device.

Event Category

Each event is placed in one of the categories defined by CUpti_EventCategory. The category indicates the general type of activity, action, or occurrence measured by the event.

Event Domain

A device exposes one or more event domains. Each event domain represents a group of related events available on that device. A device may have multiple instances of a domain, indicating that the device can simultaneously record multiple instances of each event within that domain.

Event Group

An event group is a collection of events that are managed together. The number and type of events that can be added to an event group are subject to device-specific limits. At any given time, a device may be configured to count events from a limited number of event groups. All events in an event group must belong to the same event domain.

Event Group Set

An event group set is a collection of event groups that can be enabled at the same time. Event group sets are created by cuptiEventGroupSetsCreate and cuptiMetricCreateEventGroupSets.

You can determine the events available on a device using the cuptiDeviceEnumEventDomains and cuptiEventDomainEnumEvents functions. The cupti_query sample described on the samples page shows how to use these functions. You can also enumerate all the CUPTI events available on any device using the cuptiEnumEventDomains function.

Configuring and reading event counts requires the following steps. First, select your event collection mode. If you want to count events that occur during the execution of a kernel, use cuptiSetEventCollectionMode to set mode CUPTI_EVENT_COLLECTION_MODE_KERNEL. If you want to continuously sample the event counts, use mode CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS. Next, determine the names of the events that you want to count, and then use the cuptiEventGroupCreate, cuptiEventGetIdFromName, and cuptiEventGroupAddEvent functions to create and initialize an event group with those events. If you are unable to add all the events to a single event group, then you will need to create multiple event groups. Alternatively, you can use the cuptiEventGroupSetsCreate function to automatically create the event group(s) required for a set of events.

It’s possible that all the requested events can’t be collected in the single pass due to hardware or software limitations, one needs to replay the exact same set of GPU workloads multiple times. Number of passes can be queried using the API cuptiEventGroupSetsCreate. Profiling one event always takes single pass. Multiple passes might be required when we want to profile multiple events together. Code snippet showing how to query number of passes:

CUpti_EventGroupSets *eventGroupSets = NULL;
size_t eventIdArraySize = sizeof(CUpti_EventID) * numEvents;
CUpti_EventID *eventIdArray = (CUpti_EventID *)malloc(sizeof(CUpti_EventID) * numEvents);
// fill in event Ids
cuptiEventGroupSetsCreate(context, eventIdArraySize, eventIdArray, &eventGroupSets);
// number of passes required to collect all the events
passes = eventGroupSets->numSets;

To begin counting a set of events, enable the event group or groups that contain those events by using the cuptiEventGroupEnable function. If your events are contained in multiple event groups, you may be unable to enable all of the event groups at the same time i.e. in the same pass. In this case, you can gather the events across multiple executions of the application or you can enable kernel replay. If you enable kernel replay using cuptiEnableKernelReplayMode, you will be able to enable any number of event groups and all the contained events will be collected.

Use the cuptiEventGroupReadEvent and/or cuptiEventGroupReadAllEvents functions to read the event values. When you are done collecting events, use the cuptiEventGroupDisable function to stop counting the events contained in an event group. The callback_event sample described on the samples page shows how to use these functions to create, enable, and disable event groups, and how to read event counts.

Note

For event collection mode CUPTI_EVENT_COLLECTION_MODE_KERNEL, event or metric collection may significantly change the overall performance characteristics of the application because all kernel executions that occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls are serialized on the GPU. This can be avoided by using mode CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS, and restricting profiling to events and metrics that can be collected in a single pass.

Note

All the events and metrics except NVLink metrics are collected at the context level, irrespective of the event collection mode. That is, events or metrics can be attributed to the context being profiled and values can be accurately collected, when multiple contexts are executing on the GPU. NVLink metrics are collected at device level for all event collection modes.

In a system with multiple GPUs, events can be collected simultaneously on all the GPUs; in other words, event profiling doesn’t enforce any serialization of work across GPUs. The event_multi_gpu sample shows how to use the CUPTI event and CUDA APIs on such setups.

Note

Event APIs from the header cupti_events.h are not supported for devices with compute capability 7.5 and higher. It is advised to use the CUPTI Profiling API instead. Refer to the section Migration to the Profiling API.

2.5.1. Collecting Kernel Execution Events

A common use of the event API is to count a set of events during the execution of a kernel (as demonstrated by the callback_event sample). The following code shows a typical callback used for this purpose. Assume that the callback was enabled only for a kernel launch using the CUDA runtime (i.e., by cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020). To simplify the presentation, error checking code has been removed.

static void CUPTIAPI
getEventValueCallback(void *userdata,
                      CUpti_CallbackDomain domain,
                      CUpti_CallbackId cbid,
                      const void *cbdata)
{
  const CUpti_CallbackData *cbData =
                (CUpti_CallbackData *)cbdata;

  if (cbData->callbackSite == CUPTI_API_ENTER) {
    cudaDeviceSynchronize();
    cuptiSetEventCollectionMode(cbInfo->context,
                                CUPTI_EVENT_COLLECTION_MODE_KERNEL);
    cuptiEventGroupEnable(eventGroup);
  }

  if (cbData->callbackSite == CUPTI_API_EXIT) {
    cudaDeviceSynchronize();
    cuptiEventGroupReadEvent(eventGroup,
                             CUPTI_EVENT_READ_FLAG_NONE,
                             eventId,
                             &bytesRead, &eventVal);

    cuptiEventGroupDisable(eventGroup);
  }
}

Two synchronization points are used to ensure that events are counted only for the execution of the kernel. If the application contains other threads that launch kernels, then additional thread-level synchronization must also be introduced to ensure that those threads do not launch kernels while the callback is collecting events. When the cudaLaunch API is entered (that is, before the kernel is actually launched on the device), cudaDeviceSynchronize is used to wait until the GPU is idle. The event collection mode is set to CUPTI_EVENT_COLLECTION_MODE_KERNEL so that the event counters are automatically started and stopped just before and after the kernel executes. Then event collection is enabled with cuptiEventGroupEnable.

When the cudaLaunch API is exited (that is, after the kernel is queued for execution on the GPU) another cudaDeviceSynchronize is used to cause the CPU thread to wait for the kernel to finish execution. Finally, the event counts are read with cuptiEventGroupReadEvent.

2.5.2. Sampling Events

The event API can also be used to sample event values while a kernel or kernels are executing (as demonstrated by the event_sampling sample). The sample shows one possible way to perform the sampling. The event collection mode is set to CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS so that the event counters run continuously. Two threads are used in event_sampling: one thread schedules the kernels and memcpys that perform the computation, while another thread wakes up periodically to sample an event counter. In this sample, there is no correlation of the event samples with what is happening on the GPU.

2.6. CUPTI Metric API

The CUPTI Metric API allows you to collect application metrics calculated from one or more event values. The following terminology is used by the metric API.

Metric

A characteristic of an application that is calculated from one or more event values.

Metric ID

Each metric is assigned a unique identifier. A named metric will represent the same characteristic on all device types. But the named metric may have different IDs on different device families. Use cuptiMetricGetIdFromName to get the ID for a named metric on a particular device.

Metric Category

Each metric is placed in one of the categories defined by CUpti_MetricCategory. The category indicates the general type of the characteristic measured by the metric.

Metric Property

Each metric is calculated from input values. These input values can be events or properties of the device or system. The available properties are defined by CUpti_MetricPropertyID.

Metric Value

Each metric has a value that represents one of the kinds defined by CUpti_MetricValueKind. For each value kind, there is a corresponding member of the CUpti_MetricValue union that is used to hold the metric’s value.

The tables included in this section list the metrics available for each device, as determined by the device’s compute capability. You can also determine the metrics available on a device using the cuptiDeviceEnumMetrics function. The cupti_query sample described on the samples page shows how to use this function. You can also enumerate all the CUPTI metrics available on any device using the cuptiEnumMetrics function.

CUPTI provides two functions for calculating a metric value. cuptiMetricGetValue2 can be used to calculate a metric value when the device is not available. All required event values and metric properties must be provided by the caller. cuptiMetricGetValue can be used to calculate a metric value when the device is available (as a CUdevice object). All required event values must be provided by the caller, but CUPTI will determine the appropriate property values from the CUdevice object.

Configuring and calculating metric values requires the following steps. First, determine the name of the metric that you want to collect, and then use the cuptiMetricGetIdFromName to get the metric ID. Use cuptiMetricEnumEvents to get the events required to calculate the metric, and follow instructions in the CUPTI Event API section to create the event groups for those events. When creating event groups in this manner, it is important to use the result of cuptiMetricGetRequiredEventGroupSets to properly group together events that must be collected in the same pass to ensure proper metric calculation.

Alternatively, you can use the cuptiMetricCreateEventGroupSets function to automatically create the event group(s) required for metrics’ events. When using this function, events will be grouped as required to most accurately calculate the metric; as a result, it is not necessary to use cuptiMetricGetRequiredEventGroupSets.

If you are using cuptiMetricGetValue2, then you must also collect the required metric property values using cuptiMetricEnumProperties.

Collect event counts as described in the CUPTI Event API section, and then use either cuptiMetricGetValue or cuptiMetricGetValue2 to calculate the metric value from the collected event and property values. The callback_metric sample described on the samples page shows how to use the functions to calculate event values and calculate a metric using cuptiMetricGetValue. Note that as shown in the example, you should collect event counts from all domain instances, and normalize the counts to get the most accurate metric values. It is necessary to normalize the event counts because the number of event counter instances varies by device and by the event being counted.

For example, a device might have 8 multiprocessors but only have event counters for 4 of the multiprocessors, and might have 3 memory units and only have events counters for one memory unit. When calculating a metric that requires a multiprocessor event and a memory unit event, the 4 multiprocessor counters should be summed and multiplied by 2 to normalize the event count across the entire device. Similarly, the one memory unit counter should be multiplied by 3 to normalize the event count across the entire device. The normalized values can then be passed to cuptiMetricGetValue or cuptiMetricGetValue2 to calculate the metric value.

As described, the normalization assumes the kernel executes a sufficient number of blocks to completely load the device. If the kernel has only a small number of blocks, normalizing across the entire device may skew the result.

It’s possible that all the requested metrics can’t be collected in the single pass due to hardware or software limitations, one needs to replay the exact same set of GPU workloads multiple times. Number of passes can be queried using the API cuptiMetricCreateEventGroupSets. Profiling a single metric can also take multiple passes depending on the number and type of events it is calculated from. Code snippet showing how to query number of passes:

CUpti_EventGroupSets *eventGroupSets = NULL;
size_t metricIdArraySize = sizeof(CUpti_MetricID) * numMetrics;
CUpti_MetricID metricIdArray = (CUpti_MetricID *)malloc(sizeof(CUpti_MetricID) * numMetrics);
// fill in metric Ids
cuptiMetricCreateEventGroupSets(context, metricIdArraySize, metricIdArray, &eventGroupSets);
// number of passes required to collect all the metrics
passes = eventGroupSets->numSets;

Note

Metric APIs from the header cupti_metrics.h are not supported for devices with compute capability 7.5 and higher. It is advised to use the CUPTI Profiling API instead. Refer to the section Migration to the Profiling API.

2.6.1. Metrics Reference

This section contains detailed descriptions of the metrics that can be collected by the CUPTI. A scope value of “Single-context” indicates that the metric can only be accurately collected when a single context (CUDA or graphics) is executing on the GPU. A scope value of “Multi-context” indicates that the metric can be accurately collected when multiple contexts are executing on the GPU. A scope value of “Device” indicates that the metric will be collected at device level, that is, it will include values for all the contexts executing on the GPU.

2.6.1.1. Metrics for Capability 5.x

Devices with compute capability 5.x implement the metrics shown in the following table. Note that for some metrics, the “Multi-context” scope is supported only for specific devices. Such metrics are marked with “Multi-context*” under the “Scope” column. Refer to the note at the bottom of the table.

Table 1. Capability 5.x Metrics

Metric Name

Description

Scope

achieved_occupancy

Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

Multi-context

atomic_transactions

Global memory atomic and reduction transactions

Multi-context

atomic_transactions_per_request

Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction

Multi-context

branch_efficiency

Ratio of non-divergent branches to total branches expressed as percentage

Multi-context

cf_executed

Number of executed control-flow instructions

Multi-context

cf_fu_utilization

The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10

Multi-context

cf_issued

Number of issued control-flow instructions

Multi-context

double_precision_fu_utilization

The utilization level of the multiprocessor function units that execute double-precision floating-point instructions on a scale of 0 to 10

Multi-context

dram_read_bytes

Total bytes read from DRAM to L2 cache. This is available for compute capability 5.0 and 5.2.

Multi-context*

dram_read_throughput

Device memory read throughput. This is available for compute capability 5.0 and 5.2.

Multi-context*

dram_read_transactions

Device memory read transactions. This is available for compute capability 5.0 and 5.2.

Multi-context*

dram_utilization

The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10

Multi-context*

dram_write_bytes

Total bytes written from L2 cache to DRAM. This is available for compute capability 5.0 and 5.2.

Multi-context*

dram_write_throughput

Device memory write throughput. This is available for compute capability 5.0 and 5.2.

Multi-context*

dram_write_transactions

Device memory write transactions. This is available for compute capability 5.0 and 5.2.

Multi-context*

ecc_throughput

ECC throughput from L2 to DRAM. This is available for compute capability 5.0 and 5.2.

Multi-context*

ecc_transactions

Number of ECC transactions between L2 and DRAM. This is available for compute capability 5.0 and 5.2.

Multi-context*

eligible_warps_per_cycle

Average number of warps that are eligible to issue per active cycle

Multi-context

flop_count_dp

Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.

Multi-context

flop_count_dp_add

Number of double-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_dp_fma

Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_dp_mul

Number of double-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_hp

Number of half-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. This is available for compute capability 5.3.

Multi-context*

flop_count_hp_add

Number of half-precision floating-point add operations executed by non-predicated threads. This is available for compute capability 5.3.

Multi-context*

flop_count_hp_fma

Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count. This is available for compute capability 5.3.

Multi-context*

flop_count_hp_mul

Number of half-precision floating-point multiply operations executed by non-predicated threads. This is available for compute capability 5.3.

Multi-context*

flop_count_sp

Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

Multi-context

flop_count_sp_add

Number of single-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_sp_fma

Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_sp_mul

Number of single-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_sp_special

Number of single-precision floating-point special operations executed by non-predicated threads.

Multi-context

flop_dp_efficiency

Ratio of achieved to peak double-precision floating-point operations

Multi-context

flop_hp_efficiency

Ratio of achieved to peak half-precision floating-point operations. This is available for compute capability 5.3.

Multi-context*

flop_sp_efficiency

Ratio of achieved to peak single-precision floating-point operations

Multi-context

gld_efficiency

Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage.

Multi-context*

gld_requested_throughput

Requested global memory load throughput

Multi-context

gld_throughput

Global memory load throughput

Multi-context*

gld_transactions

Number of global memory load transactions

Multi-context*

gld_transactions_per_request

Average number of global memory load transactions performed for each global memory load.

Multi-context*

global_atomic_requests

Total number of global atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

global_hit_rate

Hit rate for global loads in unified l1/tex cache. Metric value maybe wrong if malloc is used in kernel.

Multi-context*

global_load_requests

Total number of global load requests from Multiprocessor

Multi-context

global_reduction_requests

Total number of global reduction requests from Multiprocessor

Multi-context

global_store_requests

Total number of global store requests from Multiprocessor. This does not include atomic requests.

Multi-context

gst_efficiency

Ratio of requested global memory store throughput to required global memory store throughput expressed as percentage.

Multi-context*

gst_requested_throughput

Requested global memory store throughput

Multi-context

gst_throughput

Global memory store throughput

Multi-context*

gst_transactions

Number of global memory store transactions

Multi-context*

gst_transactions_per_request

Average number of global memory store transactions performed for each global memory store

Multi-context*

half_precision_fu_utilization

The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10. This is available for compute capability 5.3.

Multi-context*

inst_bit_convert

Number of bit-conversion instructions executed by non-predicated threads

Multi-context

inst_compute_ld_st

Number of compute load/store instructions executed by non-predicated threads

Multi-context

inst_control

Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)

Multi-context

inst_executed

The number of instructions executed

Multi-context

inst_executed_global_atomics

Warp level instructions for global atom and atom cas

Multi-context

inst_executed_global_loads

Warp level instructions for global loads

Multi-context

inst_executed_global_reductions

Warp level instructions for global reductions

Multi-context

inst_executed_global_stores

Warp level instructions for global stores

Multi-context

inst_executed_local_loads

Warp level instructions for local loads

Multi-context

inst_executed_local_stores

Warp level instructions for local stores

Multi-context

inst_executed_shared_atomics

Warp level shared instructions for atom and atom CAS

Multi-context

inst_executed_shared_loads

Warp level instructions for shared loads

Multi-context

inst_executed_shared_stores

Warp level instructions for shared stores

Multi-context

inst_executed_surface_atomics

Warp level instructions for surface atom and atom cas

Multi-context

inst_executed_surface_loads

Warp level instructions for surface loads

Multi-context

inst_executed_surface_reductions

Warp level instructions for surface reductions

Multi-context

inst_executed_surface_stores

Warp level instructions for surface stores

Multi-context

inst_executed_tex_ops

Warp level instructions for texture

Multi-context

inst_fp_16

Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.) This is available for compute capability 5.3.

Multi-context*

inst_fp_32

Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_fp_64

Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_integer

Number of integer instructions executed by non-predicated threads

Multi-context

inst_inter_thread_communication

Number of inter-thread communication instructions executed by non-predicated threads

Multi-context

inst_issued

The number of instructions issued

Multi-context

inst_misc

Number of miscellaneous instructions executed by non-predicated threads

Multi-context

inst_per_warp

Average number of instructions executed by each warp

Multi-context

inst_replay_overhead

Average number of replays for each instruction executed

Multi-context

ipc

Instructions executed per cycle

Multi-context

issue_slot_utilization

Percentage of issue slots that issued at least one instruction, averaged across all cycles

Multi-context

issue_slots

The number of issue slots used

Multi-context

issued_ipc

Instructions issued per cycle

Multi-context

l2_atomic_throughput

Memory read throughput seen at L2 cache for atomic and reduction requests

Multi-context

l2_atomic_transactions

Memory read transactions seen at L2 cache for atomic and reduction requests

Multi-context*

l2_global_atomic_store_bytes

Bytes written to L2 from Unified cache for global atomics (ATOM and ATOM CAS)

Multi-context*

l2_global_load_bytes

Bytes read from L2 for misses in Unified Cache for global loads

Multi-context*

l2_global_reduction_bytes

Bytes written to L2 from Unified cache for global reductions

Multi-context*

l2_local_global_store_bytes

Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.

Multi-context*

l2_local_load_bytes

Bytes read from L2 for misses in Unified Cache for local loads

Multi-context*

l2_read_throughput

Memory read throughput seen at L2 cache for all read requests

Multi-context*

l2_read_transactions

Memory read transactions seen at L2 cache for all read requests

Multi-context*

l2_surface_atomic_store_bytes

Bytes transferred between Unified Cache and L2 for surface atomics (ATOM and ATOM CAS)

Multi-context*

l2_surface_load_bytes

Bytes read from L2 for misses in Unified Cache for surface loads

Multi-context*

l2_surface_reduction_bytes

Bytes written to L2 from Unified Cache for surface reductions

Multi-context*

l2_surface_store_bytes

Bytes written to L2 from Unified Cache for surface stores. This does not include surface atomics.

Multi-context*

l2_tex_hit_rate

Hit rate at L2 cache for all requests from texture cache

Multi-context*

l2_tex_read_hit_rate

Hit rate at L2 cache for all read requests from texture cache. This is available for compute capability 5.0 and 5.2.

Multi-context*

l2_tex_read_throughput

Memory read throughput seen at L2 cache for read requests from the texture cache

Multi-context*

l2_tex_read_transactions

Memory read transactions seen at L2 cache for read requests from the texture cache

Multi-context*

l2_tex_write_hit_rate

Hit Rate at L2 cache for all write requests from texture cache. This is available for compute capability 5.0 and 5.2.

Multi-context*

l2_tex_write_throughput

Memory write throughput seen at L2 cache for write requests from the texture cache

Multi-context*

l2_tex_write_transactions

Memory write transactions seen at L2 cache for write requests from the texture cache

Multi-context*

l2_utilization

The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10

Multi-context*

l2_write_throughput

Memory write throughput seen at L2 cache for all write requests

Multi-context*

l2_write_transactions

Memory write transactions seen at L2 cache for all write requests

Multi-context*

ldst_executed

Number of executed local, global, shared and texture memory load and store instructions

Multi-context

ldst_fu_utilization

The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions on a scale of 0 to 10

Multi-context

ldst_issued

Number of issued local, global, shared and texture memory load and store instructions

Multi-context

local_hit_rate

Hit rate for local loads and stores

Multi-context*

local_load_requests

Total number of local load requests from Multiprocessor

Multi-context*

local_load_throughput

Local memory load throughput

Multi-context*

local_load_transactions

Number of local memory load transactions

Multi-context*

local_load_transactions_per_request

Average number of local memory load transactions performed for each local memory load

Multi-context*

local_memory_overhead

Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage

Multi-context*

local_store_requests

Total number of local store requests from Multiprocessor

Multi-context*

local_store_throughput

Local memory store throughput

Multi-context*

local_store_transactions

Number of local memory store transactions

Multi-context*

local_store_transactions_per_request

Average number of local memory store transactions performed for each local memory store

Multi-context*

pcie_total_data_received

Total data bytes received through PCIe

Device

pcie_total_data_transmitted

Total data bytes transmitted through PCIe

Device

shared_efficiency

Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

Multi-context*

shared_load_throughput

Shared memory load throughput

Multi-context*

shared_load_transactions

Number of shared memory load transactions

Multi-context*

shared_load_transactions_per_request

Average number of shared memory load transactions performed for each shared memory load

Multi-context*

shared_store_throughput

Shared memory store throughput

Multi-context*

shared_store_transactions

Number of shared memory store transactions

Multi-context*

shared_store_transactions_per_request

Average number of shared memory store transactions performed for each shared memory store

Multi-context*

shared_utilization

The utilization level of the shared memory relative to peak utilization on a scale of 0 to 10

Multi-context*

single_precision_fu_utilization

The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions on a scale of 0 to 10

Multi-context

sm_efficiency

The percentage of time at least one warp is active on a specific multiprocessor

Multi-context*

special_fu_utilization

The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10

Multi-context

stall_constant_memory_dependency

Percentage of stalls occurring because of immediate constant cache miss

Multi-context

stall_exec_dependency

Percentage of stalls occurring because an input required by the instruction is not yet available

Multi-context

stall_inst_fetch

Percentage of stalls occurring because the next assembly instruction has not yet been fetched

Multi-context

stall_memory_dependency

Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding

Multi-context

stall_memory_throttle

Percentage of stalls occurring because of memory throttle

Multi-context

stall_not_selected

Percentage of stalls occurring because warp was not selected

Multi-context

stall_other

Percentage of stalls occurring due to miscellaneous reasons

Multi-context

stall_pipe_busy

Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy

Multi-context

stall_sync

Percentage of stalls occurring because the warp is blocked at a __syncthreads() call

Multi-context

stall_texture

Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests

Multi-context

surface_atomic_requests

Total number of surface atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

surface_load_requests

Total number of surface load requests from Multiprocessor

Multi-context

surface_reduction_requests

Total number of surface reduction requests from Multiprocessor

Multi-context

surface_store_requests

Total number of surface store requests from Multiprocessor

Multi-context

sysmem_read_bytes

Number of bytes read from system memory

Multi-context*

sysmem_read_throughput

System memory read throughput

Multi-context*

sysmem_read_transactions

Number of system memory read transactions

Multi-context*

sysmem_read_utilization

The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 5.0 and 5.2.

Multi-context

sysmem_utilization

The utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 5.0 and 5.2.

Multi-context*

sysmem_write_bytes

Number of bytes written to system memory

Multi-context*

sysmem_write_throughput

System memory write throughput

Multi-context*

sysmem_write_transactions

Number of system memory write transactions

Multi-context*

sysmem_write_utilization

The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 5.0 and 5.2.

Multi-context*

tex_cache_hit_rate

Unified cache hit rate

Multi-context*

tex_cache_throughput

Unified cache throughput

Multi-context*

tex_cache_transactions

Unified cache read transactions

Multi-context*

tex_fu_utilization

The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10

Multi-context

tex_utilization

The utilization level of the unified cache relative to the peak utilization on a scale of 0 to 10

Multi-context*

texture_load_requests

Total number of texture Load requests from Multiprocessor

Multi-context

warp_execution_efficiency

Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

Multi-context

warp_nonpred_execution_efficiency

Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor

Multi-context

* The “Multi-context” scope for this metric is supported only for devices with compute capability 5.0 and 5.2.

2.6.1.2. Metrics for Capability 6.x

Devices with compute capability 6.x implement the metrics shown in the following table.

Table 2. Capability 6.x Metrics

Metric Name

Description

Scope

achieved_occupancy

Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

Multi-context

atomic_transactions

Global memory atomic and reduction transactions

Multi-context

atomic_transactions_per_request

Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction

Multi-context

branch_efficiency

Ratio of non-divergent branches to total branches expressed as percentage

Multi-context

cf_executed

Number of executed control-flow instructions

Multi-context

cf_fu_utilization

The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10

Multi-context

cf_issued

Number of issued control-flow instructions

Multi-context

double_precision_fu_utilization

The utilization level of the multiprocessor function units that execute double-precision floating-point instructions on a scale of 0 to 10

Multi-context

dram_read_bytes

Total bytes read from DRAM to L2 cache

Multi-context

dram_read_throughput

Device memory read throughput. This is available for compute capability 6.0 and 6.1.

Multi-context

dram_read_transactions

Device memory read transactions. This is available for compute capability 6.0 and 6.1.

Multi-context

dram_utilization

The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10

Multi-context

dram_write_bytes

Total bytes written from L2 cache to DRAM

Multi-context

dram_write_throughput

Device memory write throughput. This is available for compute capability 6.0 and 6.1.

Multi-context

dram_write_transactions

Device memory write transactions. This is available for compute capability 6.0 and 6.1.

Multi-context

ecc_throughput

ECC throughput from L2 to DRAM. This is available for compute capability 6.1.

Multi-context

ecc_transactions

Number of ECC transactions between L2 and DRAM. This is available for compute capability 6.1.

Multi-context

eligible_warps_per_cycle

Average number of warps that are eligible to issue per active cycle

Multi-context

flop_count_dp

Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.

Multi-context

flop_count_dp_add

Number of double-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_dp_fma

Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_dp_mul

Number of double-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_hp

Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.

Multi-context

flop_count_hp_add

Number of half-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_hp_fma

Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_hp_mul

Number of half-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_sp

Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

Multi-context

flop_count_sp_add

Number of single-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_sp_fma

Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_sp_mul

Number of single-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_sp_special

Number of single-precision floating-point special operations executed by non-predicated threads.

Multi-context

flop_dp_efficiency

Ratio of achieved to peak double-precision floating-point operations

Multi-context

flop_hp_efficiency

Ratio of achieved to peak half-precision floating-point operations

Multi-context

flop_sp_efficiency

Ratio of achieved to peak single-precision floating-point operations

Multi-context

gld_efficiency

Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage.

Multi-context

gld_requested_throughput

Requested global memory load throughput

Multi-context

gld_throughput

Global memory load throughput

Multi-context

gld_transactions

Number of global memory load transactions

Multi-context

gld_transactions_per_request

Average number of global memory load transactions performed for each global memory load.

Multi-context

global_atomic_requests

Total number of global atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

global_hit_rate

Hit rate for global loads in unified l1/tex cache. Metric value maybe wrong if malloc is used in kernel.

Multi-context

global_load_requests

Total number of global load requests from Multiprocessor

Multi-context

global_reduction_requests

Total number of global reduction requests from Multiprocessor

Multi-context

global_store_requests

Total number of global store requests from Multiprocessor. This does not include atomic requests.

Multi-context

gst_efficiency

Ratio of requested global memory store throughput to required global memory store throughput expressed as percentage.

Multi-context

gst_requested_throughput

Requested global memory store throughput

Multi-context

gst_throughput

Global memory store throughput

Multi-context

gst_transactions

Number of global memory store transactions

Multi-context

gst_transactions_per_request

Average number of global memory store transactions performed for each global memory store

Multi-context

half_precision_fu_utilization

The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions on a scale of 0 to 10

Multi-context

inst_bit_convert

Number of bit-conversion instructions executed by non-predicated threads

Multi-context

inst_compute_ld_st

Number of compute load/store instructions executed by non-predicated threads

Multi-context

inst_control

Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)

Multi-context

inst_executed

The number of instructions executed

Multi-context

inst_executed_global_atomics

Warp level instructions for global atom and atom cas

Multi-context

inst_executed_global_loads

Warp level instructions for global loads

Multi-context

inst_executed_global_reductions

Warp level instructions for global reductions

Multi-context

inst_executed_global_stores

Warp level instructions for global stores

Multi-context

inst_executed_local_loads

Warp level instructions for local loads

Multi-context

inst_executed_local_stores

Warp level instructions for local stores

Multi-context

inst_executed_shared_atomics

Warp level shared instructions for atom and atom CAS

Multi-context

inst_executed_shared_loads

Warp level instructions for shared loads

Multi-context

inst_executed_shared_stores

Warp level instructions for shared stores

Multi-context

inst_executed_surface_atomics

Warp level instructions for surface atom and atom cas

Multi-context

inst_executed_surface_loads

Warp level instructions for surface loads

Multi-context

inst_executed_surface_reductions

Warp level instructions for surface reductions

Multi-context

inst_executed_surface_stores

Warp level instructions for surface stores

Multi-context

inst_executed_tex_ops

Warp level instructions for texture

Multi-context

inst_fp_16

Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_fp_32

Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_fp_64

Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_integer

Number of integer instructions executed by non-predicated threads

Multi-context

inst_inter_thread_communication

Number of inter-thread communication instructions executed by non-predicated threads

Multi-context

inst_issued

The number of instructions issued

Multi-context

inst_misc

Number of miscellaneous instructions executed by non-predicated threads

Multi-context

inst_per_warp

Average number of instructions executed by each warp

Multi-context

inst_replay_overhead

Average number of replays for each instruction executed

Multi-context

ipc

Instructions executed per cycle

Multi-context

issue_slot_utilization

Percentage of issue slots that issued at least one instruction, averaged across all cycles

Multi-context

issue_slots

The number of issue slots used

Multi-context

issued_ipc

Instructions issued per cycle

Multi-context

l2_atomic_throughput

Memory read throughput seen at L2 cache for atomic and reduction requests

Multi-context

l2_atomic_transactions

Memory read transactions seen at L2 cache for atomic and reduction requests

Multi-context

l2_global_atomic_store_bytes

Bytes written to L2 from Unified cache for global atomics (ATOM and ATOM CAS)

Multi-context

l2_global_load_bytes

Bytes read from L2 for misses in Unified Cache for global loads

Multi-context

l2_global_reduction_bytes

Bytes written to L2 from Unified cache for global reductions

Multi-context

l2_local_global_store_bytes

Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.

Multi-context

l2_local_load_bytes

Bytes read from L2 for misses in Unified Cache for local loads

Multi-context

l2_read_throughput

Memory read throughput seen at L2 cache for all read requests

Multi-context

l2_read_transactions

Memory read transactions seen at L2 cache for all read requests

Multi-context

l2_surface_atomic_store_bytes

Bytes transferred between Unified Cache and L2 for surface atomics (ATOM and ATOM CAS)

Multi-context

l2_surface_load_bytes

Bytes read from L2 for misses in Unified Cache for surface loads

Multi-context

l2_surface_reduction_bytes

Bytes written to L2 from Unified Cache for surface reductions

Multi-context

l2_surface_store_bytes

Bytes written to L2 from Unified Cache for surface stores. This does not include surface atomics.

Multi-context

l2_tex_hit_rate

Hit rate at L2 cache for all requests from texture cache

Multi-context

l2_tex_read_hit_rate

Hit rate at L2 cache for all read requests from texture cache. This is available for compute capability 6.0 and 6.1.

Multi-context

l2_tex_read_throughput

Memory read throughput seen at L2 cache for read requests from the texture cache

Multi-context

l2_tex_read_transactions

Memory read transactions seen at L2 cache for read requests from the texture cache

Multi-context

l2_tex_write_hit_rate

Hit Rate at L2 cache for all write requests from texture cache. This is available for compute capability 6.0 and 6.1.

Multi-context

l2_tex_write_throughput

Memory write throughput seen at L2 cache for write requests from the texture cache

Multi-context

l2_tex_write_transactions

Memory write transactions seen at L2 cache for write requests from the texture cache

Multi-context

l2_utilization

The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10

Multi-context

l2_write_throughput

Memory write throughput seen at L2 cache for all write requests

Multi-context

l2_write_transactions

Memory write transactions seen at L2 cache for all write requests

Multi-context

ldst_executed

Number of executed local, global, shared and texture memory load and store instructions

Multi-context

ldst_fu_utilization

The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions on a scale of 0 to 10

Multi-context

ldst_issued

Number of issued local, global, shared and texture memory load and store instructions

Multi-context

local_hit_rate

Hit rate for local loads and stores

Multi-context

local_load_requests

Total number of local load requests from Multiprocessor

Multi-context

local_load_throughput

Local memory load throughput

Multi-context

local_load_transactions

Number of local memory load transactions

Multi-context

local_load_transactions_per_request

Average number of local memory load transactions performed for each local memory load

Multi-context

local_memory_overhead

Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage

Multi-context

local_store_requests

Total number of local store requests from Multiprocessor

Multi-context

local_store_throughput

Local memory store throughput

Multi-context

local_store_transactions

Number of local memory store transactions

Multi-context

local_store_transactions_per_request

Average number of local memory store transactions performed for each local memory store

Multi-context

nvlink_overhead_data_received

Ratio of overhead data to the total data, received through NVLink. This is available for compute capability 6.0.

Device

nvlink_overhead_data_transmitted

Ratio of overhead data to the total data, transmitted through NVLink. This is available for compute capability 6.0.

Device

nvlink_receive_throughput

Number of bytes received per second through NVLinks. This is available for compute capability 6.0.

Device

nvlink_total_data_received

Total data bytes received through NVLinks including headers. This is available for compute capability 6.0.

Device

nvlink_total_data_transmitted

Total data bytes transmitted through NVLinks including headers. This is available for compute capability 6.0.

Device

nvlink_total_nratom_data_transmitted

Total non-reduction atomic data bytes transmitted through NVLinks. This is available for compute capability 6.0.

Device

nvlink_total_ratom_data_transmitted

Total reduction atomic data bytes transmitted through NVLinks This is available for compute capability 6.0.

Device

nvlink_total_response_data_received

Total response data bytes received through NVLink, response data includes data for read requests and result of non-reduction atomic requests. This is available for compute capability 6.0.

Device

nvlink_total_write_data_transmitted

Total write data bytes transmitted through NVLinks. This is available for compute capability 6.0.

Device

nvlink_transmit_throughput

Number of Bytes Transmitted per second through NVLinks. This is available for compute capability 6.0.

Device

nvlink_user_data_received

User data bytes received through NVLinks, doesn’t include headers. This is available for compute capability 6.0.

Device

nvlink_user_data_transmitted

User data bytes transmitted through NVLinks, doesn’t include headers. This is available for compute capability 6.0.

Device

nvlink_user_nratom_data_transmitted

Total non-reduction atomic user data bytes transmitted through NVLinks. This is available for compute capability 6.0.

Device

nvlink_user_ratom_data_transmitted

Total reduction atomic user data bytes transmitted through NVLinks. This is available for compute capability 6.0.

Device

nvlink_user_response_data_received

Total user response data bytes received through NVLink, response data includes data for read requests and result of non-reduction atomic requests. This is available for compute capability 6.0.

Device

nvlink_user_write_data_transmitted

User write data bytes transmitted through NVLinks. This is available for compute capability 6.0.

Device

pcie_total_data_received

Total data bytes received through PCIe

Device

pcie_total_data_transmitted

Total data bytes transmitted through PCIe

Device

shared_efficiency

Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

Multi-context

shared_load_throughput

Shared memory load throughput

Multi-context

shared_load_transactions

Number of shared memory load transactions

Multi-context

shared_load_transactions_per_request

Average number of shared memory load transactions performed for each shared memory load

Multi-context

shared_store_throughput

Shared memory store throughput

Multi-context

shared_store_transactions

Number of shared memory store transactions

Multi-context

shared_store_transactions_per_request

Average number of shared memory store transactions performed for each shared memory store

Multi-context

shared_utilization

The utilization level of the shared memory relative to peak utilization on a scale of 0 to 10

Multi-context

single_precision_fu_utilization

The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions on a scale of 0 to 10

Multi-context

sm_efficiency

The percentage of time at least one warp is active on a specific multiprocessor

Multi-context

special_fu_utilization

The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10

Multi-context

stall_constant_memory_dependency

Percentage of stalls occurring because of immediate constant cache miss

Multi-context

stall_exec_dependency

Percentage of stalls occurring because an input required by the instruction is not yet available

Multi-context

stall_inst_fetch

Percentage of stalls occurring because the next assembly instruction has not yet been fetched

Multi-context

stall_memory_dependency

Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding

Multi-context

stall_memory_throttle

Percentage of stalls occurring because of memory throttle

Multi-context

stall_not_selected

Percentage of stalls occurring because warp was not selected

Multi-context

stall_other

Percentage of stalls occurring due to miscellaneous reasons

Multi-context

stall_pipe_busy

Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy

Multi-context

stall_sync

Percentage of stalls occurring because the warp is blocked at a __syncthreads() call

Multi-context

stall_texture

Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests

Multi-context

surface_atomic_requests

Total number of surface atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

surface_load_requests

Total number of surface load requests from Multiprocessor

Multi-context

surface_reduction_requests

Total number of surface reduction requests from Multiprocessor

Multi-context

surface_store_requests

Total number of surface store requests from Multiprocessor

Multi-context

sysmem_read_bytes

Number of bytes read from system memory

Multi-context

sysmem_read_throughput

System memory read throughput

Multi-context

sysmem_read_transactions

Number of system memory read transactions

Multi-context

sysmem_read_utilization

The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 6.0 and 6.1.

Multi-context

sysmem_utilization

The utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 6.0 and 6.1.

Multi-context

sysmem_write_bytes

Number of bytes written to system memory

Multi-context

sysmem_write_throughput

System memory write throughput

Multi-context

sysmem_write_transactions

Number of system memory write transactions

Multi-context

sysmem_write_utilization

The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10. This is available for compute capability 6.0 and 6.1.

Multi-context

tex_cache_hit_rate

Unified cache hit rate

Multi-context

tex_cache_throughput

Unified cache throughput

Multi-context

tex_cache_transactions

Unified cache read transactions

Multi-context

tex_fu_utilization

The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10

Multi-context

tex_utilization

The utilization level of the unified cache relative to the peak utilization on a scale of 0 to 10

Multi-context

texture_load_requests

Total number of texture Load requests from Multiprocessor

Multi-context

unique_warps_launched

Number of warps launched. Value is unaffected by compute preemption.

Multi-context

warp_execution_efficiency

Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

Multi-context

warp_nonpred_execution_efficiency

Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor

Multi-context

2.6.1.3. Metrics for Capability 7.0

Devices with compute capability 7.0 implement the metrics shown in the following table.

Table 3. Capability 7.x (7.0 and 7.2) Metrics

Metric Name

Description

Scope

achieved_occupancy

Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

Multi-context

atomic_transactions

Global memory atomic and reduction transactions

Multi-context

atomic_transactions_per_request

Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction

Multi-context

branch_efficiency

Ratio of branch instruction to sum of branch and divergent branch instruction

Multi-context

cf_executed

Number of executed control-flow instructions

Multi-context

cf_fu_utilization

The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10

Multi-context

cf_issued

Number of issued control-flow instructions

Multi-context

double_precision_fu_utilization

The utilization level of the multiprocessor function units that execute double-precision floating-point instructions on a scale of 0 to 10

Multi-context

dram_read_bytes

Total bytes read from DRAM to L2 cache

Multi-context

dram_read_throughput

Device memory read throughput

Multi-context

dram_read_transactions

Device memory read transactions

Multi-context

dram_utilization

The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10

Multi-context

dram_write_bytes

Total bytes written from L2 cache to DRAM

Multi-context

dram_write_throughput

Device memory write throughput

Multi-context

dram_write_transactions

Device memory write transactions

Multi-context

eligible_warps_per_cycle

Average number of warps that are eligible to issue per active cycle

Multi-context

flop_count_dp

Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.

Multi-context

flop_count_dp_add

Number of double-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_dp_fma

Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_dp_mul

Number of double-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_hp

Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate contributes 2 or 4 to the count based on the number of inputs.

Multi-context

flop_count_hp_add

Number of half-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_hp_fma

Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate contributes 2 or 4 to the count based on the number of inputs.

Multi-context

flop_count_hp_mul

Number of half-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_sp

Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.

Multi-context

flop_count_sp_add

Number of single-precision floating-point add operations executed by non-predicated threads.

Multi-context

flop_count_sp_fma

Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.

Multi-context

flop_count_sp_mul

Number of single-precision floating-point multiply operations executed by non-predicated threads.

Multi-context

flop_count_sp_special

Number of single-precision floating-point special operations executed by non-predicated threads.

Multi-context

flop_dp_efficiency

Ratio of achieved to peak double-precision floating-point operations

Multi-context

flop_hp_efficiency

Ratio of achieved to peak half-precision floating-point operations

Multi-context

flop_sp_efficiency

Ratio of achieved to peak single-precision floating-point operations

Multi-context

gld_efficiency

Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage.

Multi-context

gld_requested_throughput

Requested global memory load throughput

Multi-context

gld_throughput

Global memory load throughput

Multi-context

gld_transactions

Number of global memory load transactions

Multi-context

gld_transactions_per_request

Average number of global memory load transactions performed for each global memory load.

Multi-context

global_atomic_requests

Total number of global atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

global_hit_rate

Hit rate for global load and store in unified l1/tex cache

Multi-context

global_load_requests

Total number of global load requests from Multiprocessor

Multi-context

global_reduction_requests

Total number of global reduction requests from Multiprocessor

Multi-context

global_store_requests

Total number of global store requests from Multiprocessor. This does not include atomic requests.

Multi-context

gst_efficiency

Ratio of requested global memory store throughput to required global memory store throughput expressed as percentage.

Multi-context

gst_requested_throughput

Requested global memory store throughput

Multi-context

gst_throughput

Global memory store throughput

Multi-context

gst_transactions

Number of global memory store transactions

Multi-context

gst_transactions_per_request

Average number of global memory store transactions performed for each global memory store

Multi-context

half_precision_fu_utilization

The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions on a scale of 0 to 10. Note that this doesn’t specify the utilization level of tensor core unit

Multi-context

inst_bit_convert

Number of bit-conversion instructions executed by non-predicated threads

Multi-context

inst_compute_ld_st

Number of compute load/store instructions executed by non-predicated threads

Multi-context

inst_control

Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)

Multi-context

inst_executed

The number of instructions executed

Multi-context

inst_executed_global_atomics

Warp level instructions for global atom and atom cas

Multi-context

inst_executed_global_loads

Warp level instructions for global loads

Multi-context

inst_executed_global_reductions

Warp level instructions for global reductions

Multi-context

inst_executed_global_stores

Warp level instructions for global stores

Multi-context

inst_executed_local_loads

Warp level instructions for local loads

Multi-context

inst_executed_local_stores

Warp level instructions for local stores

Multi-context

inst_executed_shared_atomics

Warp level shared instructions for atom and atom CAS

Multi-context

inst_executed_shared_loads

Warp level instructions for shared loads

Multi-context

inst_executed_shared_stores

Warp level instructions for shared stores

Multi-context

inst_executed_surface_atomics

Warp level instructions for surface atom and atom cas

Multi-context

inst_executed_surface_loads

Warp level instructions for surface loads

Multi-context

inst_executed_surface_reductions

Warp level instructions for surface reductions

Multi-context

inst_executed_surface_stores

Warp level instructions for surface stores

Multi-context

inst_executed_tex_ops

Warp level instructions for texture

Multi-context

inst_fp_16

Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_fp_32

Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_fp_64

Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)

Multi-context

inst_integer

Number of integer instructions executed by non-predicated threads

Multi-context

inst_inter_thread_communication

Number of inter-thread communication instructions executed by non-predicated threads

Multi-context

inst_issued

The number of instructions issued

Multi-context

inst_misc

Number of miscellaneous instructions executed by non-predicated threads

Multi-context

inst_per_warp

Average number of instructions executed by each warp

Multi-context

inst_replay_overhead

Average number of replays for each instruction executed

Multi-context

ipc

Instructions executed per cycle

Multi-context

issue_slot_utilization

Percentage of issue slots that issued at least one instruction, averaged across all cycles

Multi-context

issue_slots

The number of issue slots used

Multi-context

issued_ipc

Instructions issued per cycle

Multi-context

l2_atomic_throughput

Memory read throughput seen at L2 cache for atomic and reduction requests

Multi-context

l2_atomic_transactions

Memory read transactions seen at L2 cache for atomic and reduction requests

Multi-context

l2_global_atomic_store_bytes

Bytes written to L2 from L1 for global atomics (ATOM and ATOM CAS)

Multi-context

l2_global_load_bytes

Bytes read from L2 for misses in L1 for global loads

Multi-context

l2_local_global_store_bytes

Bytes written to L2 from L1 for local and global stores. This does not include global atomics.

Multi-context

l2_local_load_bytes

Bytes read from L2 for misses in L1 for local loads

Multi-context

l2_read_throughput

Memory read throughput seen at L2 cache for all read requests

Multi-context

l2_read_transactions

Memory read transactions seen at L2 cache for all read requests

Multi-context

l2_surface_load_bytes

Bytes read from L2 for misses in L1 for surface loads

Multi-context

l2_surface_store_bytes

Bytes read from L2 for misses in L1 for surface stores

Multi-context

l2_tex_hit_rate

Hit rate at L2 cache for all requests from texture cache

Multi-context

l2_tex_read_hit_rate

Hit rate at L2 cache for all read requests from texture cache

Multi-context

l2_tex_read_throughput

Memory read throughput seen at L2 cache for read requests from the texture cache

Multi-context

l2_tex_read_transactions

Memory read transactions seen at L2 cache for read requests from the texture cache

Multi-context

l2_tex_write_hit_rate

Hit Rate at L2 cache for all write requests from texture cache

Multi-context

l2_tex_write_throughput

Memory write throughput seen at L2 cache for write requests from the texture cache

Multi-context

l2_tex_write_transactions

Memory write transactions seen at L2 cache for write requests from the texture cache

Multi-context

l2_utilization

The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10

Multi-context

l2_write_throughput

Memory write throughput seen at L2 cache for all write requests

Multi-context

l2_write_transactions

Memory write transactions seen at L2 cache for all write requests

Multi-context

ldst_executed

Number of executed local, global, shared and texture memory load and store instructions

Multi-context

ldst_fu_utilization

The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions on a scale of 0 to 10

Multi-context

ldst_issued

Number of issued local, global, shared and texture memory load and store instructions

Multi-context

local_hit_rate

Hit rate for local loads and stores

Multi-context

local_load_requests

Total number of local load requests from Multiprocessor

Multi-context

local_load_throughput

Local memory load throughput

Multi-context

local_load_transactions

Number of local memory load transactions

Multi-context

local_load_transactions_per_request

Average number of local memory load transactions performed for each local memory load

Multi-context

local_memory_overhead

Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage

Multi-context

local_store_requests

Total number of local store requests from Multiprocessor

Multi-context

local_store_throughput

Local memory store throughput

Multi-context

local_store_transactions

Number of local memory store transactions

Multi-context

local_store_transactions_per_request

Average number of local memory store transactions performed for each local memory store

Multi-context

nvlink_overhead_data_received

Ratio of overhead data to the total data, received through NVLink.

Device

nvlink_overhead_data_transmitted

Ratio of overhead data to the total data, transmitted through NVLink.

Device

nvlink_receive_throughput

Number of bytes received per second through NVLinks.

Device

nvlink_total_data_received

Total data bytes received through NVLinks including headers.

Device

nvlink_total_data_transmitted

Total data bytes transmitted through NVLinks including headers.

Device

nvlink_total_nratom_data_transmitted

Total non-reduction atomic data bytes transmitted through NVLinks.

Device

nvlink_total_ratom_data_transmitted

Total reduction atomic data bytes transmitted through NVLinks.

Device

nvlink_total_response_data_received

Total response data bytes received through NVLink, response data includes data for read requests and result of non-reduction atomic requests.

Device

nvlink_total_write_data_transmitted

Total write data bytes transmitted through NVLinks.

Device

nvlink_transmit_throughput

Number of Bytes Transmitted per second through NVLinks.

Device

nvlink_user_data_received

User data bytes received through NVLinks, doesn’t include headers.

Device

nvlink_user_data_transmitted

User data bytes transmitted through NVLinks, doesn’t include headers.

Device

nvlink_user_nratom_data_transmitted

Total non-reduction atomic user data bytes transmitted through NVLinks.

Device

nvlink_user_ratom_data_transmitted

Total reduction atomic user data bytes transmitted through NVLinks.

Device

nvlink_user_response_data_received

Total user response data bytes received through NVLink, response data includes data for read requests and result of non-reduction atomic requests.

Device

nvlink_user_write_data_transmitted

User write data bytes transmitted through NVLinks.

Device

pcie_total_data_received

Total data bytes received through PCIe

Device

pcie_total_data_transmitted

Total data bytes transmitted through PCIe

Device

shared_efficiency

Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

Multi-context

shared_load_throughput

Shared memory load throughput

Multi-context

shared_load_transactions

Number of shared memory load transactions

Multi-context

shared_load_transactions_per_request

Average number of shared memory load transactions performed for each shared memory load

Multi-context

shared_store_throughput

Shared memory store throughput

Multi-context

shared_store_transactions

Number of shared memory store transactions

Multi-context

shared_store_transactions_per_request

Average number of shared memory store transactions performed for each shared memory store

Multi-context

shared_utilization

The utilization level of the shared memory relative to peak utilization on a scale of 0 to 10

Multi-context

single_precision_fu_utilization

The utilization level of the multiprocessor function units that execute single-precision floating-point instructions on a scale of 0 to 10

Multi-context

sm_efficiency

The percentage of time at least one warp is active on a specific multiprocessor

Multi-context

special_fu_utilization

The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10

Multi-context

stall_constant_memory_dependency

Percentage of stalls occurring because of immediate constant cache miss

Multi-context

stall_exec_dependency

Percentage of stalls occurring because an input required by the instruction is not yet available

Multi-context

stall_inst_fetch

Percentage of stalls occurring because the next assembly instruction has not yet been fetched

Multi-context

stall_memory_dependency

Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding

Multi-context

stall_memory_throttle

Percentage of stalls occurring because of memory throttle

Multi-context

stall_not_selected

Percentage of stalls occurring because warp was not selected

Multi-context

stall_other

Percentage of stalls occurring due to miscellaneous reasons

Multi-context

stall_pipe_busy

Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy

Multi-context

stall_sleeping

Percentage of stalls occurring because warp was sleeping

Multi-context

stall_sync

Percentage of stalls occurring because the warp is blocked at a __syncthreads() call

Multi-context

stall_texture

Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests

Multi-context

surface_atomic_requests

Total number of surface atomic(Atom and Atom CAS) requests from Multiprocessor

Multi-context

surface_load_requests

Total number of surface load requests from Multiprocessor

Multi-context

surface_reduction_requests

Total number of surface reduction requests from Multiprocessor

Multi-context

surface_store_requests

Total number of surface store requests from Multiprocessor

Multi-context

sysmem_read_bytes

Number of bytes read from system memory

Multi-context

sysmem_read_throughput

System memory read throughput

Multi-context

sysmem_read_transactions

Number of system memory read transactions

Multi-context

sysmem_read_utilization

The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10

Multi-context

sysmem_utilization

The utilization level of the system memory relative to the peak utilization on a scale of 0 to 10

Multi-context

sysmem_write_bytes

Number of bytes written to system memory

Multi-context

sysmem_write_throughput

System memory write throughput

Multi-context

sysmem_write_transactions

Number of system memory write transactions

Multi-context

sysmem_write_utilization

The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10

Multi-context

tensor_precision_fu_utilization

The utilization level of the multiprocessor function units that execute tensor core instructions on a scale of 0 to 10

Multi-context

tensor_int_fu_utilization

The utilization level of the multiprocessor function units that execute tensor core int8 instructions on a scale of 0 to 10. This metric is only available for device with compute capability 7.2.

Multi-context

tex_cache_hit_rate

Unified cache hit rate

Multi-context

tex_cache_throughput

Unified cache to Multiprocessor read throughput

Multi-context

tex_cache_transactions

Unified cache to Multiprocessor read transactions

Multi-context

tex_fu_utilization

The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10

Multi-context

tex_utilization

The utilization level of the unified cache relative to the peak utilization on a scale of 0 to 10

Multi-context

texture_load_requests

Total number of texture Load requests from Multiprocessor

Multi-context

warp_execution_efficiency

Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

Multi-context

warp_nonpred_execution_efficiency

Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor

Multi-context

2.7. CUPTI Profiling API

Starting with CUDA 10.0, a new set of metric APIs are added for devices with compute capability 7.0 and higher. These APIs provide low and deterministic profiling overhead on the target system. These are supported on all CUDA supported platforms except Android, and are not supported under MPS (Multi-Process Service), Confidential Compute, or SLI configured systems. In order to determine whether a device is compatible with this API, a new function cuptiProfilerDeviceSupported is introduced in CUDA 11.5 which exposes overall Profiling API support and specific requirements for a given device. Profiling API must be initialized by calling cuptiProfilerInitialize before testing device support.

This section covers performance profiling Host and Target APIs for CUDA. Broadly profiling APIs are divided into following four sections:

  • Enumeration (Host)

  • Configuration (Host)

  • Collection (Target)

  • Evaluation (Host)

Host APIs provide a metric interface for enumeration, configuration and evaluation that doesn’t require a compute(GPU) device, and can also run in an offline mode. In the samples section under extensions, profiler host utility covers the usage of host APIs. Target APIs are used for data collection of the metrics and requires a compute (GPU) device. Refer to samples auto_rangeProfiling and userrange_profiling for usage of profiling APIs.

The list of metrics has been overhauled from earlier generation metrics and event APIs, to support a standard naming convention based upon unit__(subunit?)_(pipestage?)_quantity_qualifiers

2.7.1. Multi Pass Collection

NVIDIA GPU hardware has a limited number of counter registers and cannot collect all possible counters concurrently. There are also limitations on which counters can be collected together in a single pass. This is resolved by replaying the exact same set of GPU workloads multiple times, where each replay is termed a pass. On each pass, a different subset of requested counters are collected. Once all passes are collected, the data is available for evaluation. Certain metrics have many counters as inputs; adding a single metric may require many passes to collect. CUPTI APIs support multi pass collection through different collection attributes.

Sample cupti_metric_properties shows how to query number of passes required to collect a set of counters.

2.7.2. Range Profiling

Each profiling session runs a series of replay passes, where each pass contains a sequence of ranges. Every metric enabled in the session’s configuration is collected separately per unique range-stack in the pass. CUPTI supports auto and user defined ranges.

2.7.2.1. Auto Range

In a session with auto range mode, ranges are defined around each kernel automatically with a unique name assigned to each range, while profiling is enabled. This mode is useful for tight metric collection around each kernel. A user can choose one of the supported replay modes, pseudo code for each is described below:

Kernel Replay

The replay logic (multiple pass, if needed) is done by CUPTI implicitly (opaque to the user), and usage of CUPTI replay API’s cuptiProfilerBeginPass and cuptiProfilerEndPass will be a no-op in this mode. This mode is useful for collecting metrics around a kernel in tight control. Each kernel launch is synchronized to segregate its metrics into a separate range, and a CPU-GPU sync is made to ensure the profiled data is collected from GPU. Counter Collection can be enabled and disabled with cuptiProfilerEnableProfiling and cuptiProfilerDisableProfiling. Refer to the sample autorange_profiling

/* Assume Inputs(counterDataImagePrefix and configImage) from configuration phase at host */
void Collection(std::vector<uint8_t>& counterDataImagePrefix, std::vector<uint8_t>& configImage)
{
    CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
    cuptiProfilerInitialize(&profilerInitializeParams);

    std::vector<uint8_t> counterDataImages;
    std::vector<uint8_t> counterDataScratchBuffer;
    CreateCounterDataImage(counterDataImages, counterDataScratchBuffer, counterDataImagePrefix);

    CUpti_Profiler_BeginSession_Params beginSessionParams = { CUpti_Profiler_BeginSession_Params_STRUCT_SIZE };
    CUpti_ProfilerRange profilerRange = CUPTI_AutoRange;
    CUpti_ProfilerReplayMode profilerReplayMode = CUPTI_KernelReplay;

    beginSessionParams.ctx = NULL;
    beginSessionParams.counterDataImageSize = counterDataImage.size();
    beginSessionParams.pCounterDataImage = &counterDataImage[0];
    beginSessionParams.counterDataScratchBufferSize = counterDataScratchBuffer.size();
    beginSessionParams.pCounterDataScratchBuffer = &counterDataScratchBuffer[0];
    beginSessionParams.range = profilerRange;
    beginSessionParams.replayMode = profilerReplayMode;
    beginSessionParams.maxRangesPerPass = num_ranges;
    beginSessionParams.maxLaunchesPerPass = num_ranges;

    cuptiProfilerBeginSession(&beginSessionParams));

    CUpti_Profiler_SetConfig_Params setConfigParams = { CUpti_Profiler_SetConfig_Params_STRUCT_SIZE };
    setConfigParams.pConfig = &configImage[0];
    setConfigParams.configSize = configImage.size();

    cuptiProfilerSetConfig(&setConfigParams));

    kernelA <<<grid, tids >>>(...);                  // KernelA not profiled

    CUpti_Profiler_EnableProfiling_Params enableProfilingParams = { CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE };
    cuptiProfilerEnableProfiling(&enableProfilingParams);
    {

        kernelB <<<grid, tids >>>(...);              // KernelB profiled and captured in an unique range.
        kernelC <<<grid, tids >>>(...);              // KernelC profiled and captured in an unique range.
        kernelD <<<grid, tids >>>(...);              // KernelD profiled and captured in an unique range.
    }

    CUpti_Profiler_DisableProfiling_Params disableProfilingParams = { CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE };
    cuptiProfilerDisableProfiling(&disableProfilingParams);

    kernelE <<<grid, tids >>>(...);                  // KernelE not profiled

    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = { CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE };
    cuptiProfilerUnsetConfig(&unsetConfigParams);

    CUpti_Profiler_EndSession_Params endSessionParams = { CUpti_Profiler_EndSession_Params_STRUCT_SIZE };
    cuptiProfilerEndSession(&endSessionParams);
}

User Replay

The replay (multiple passes, if needed) is done by the user using the replay API’s cuptiProfilerBeginPass and cuptiProfilerEndPass. It is user responsibility to flush the counter data cuptiProfilerFlushCounterData before ending the session to ensure collection of metric data in CPU. Counter collection can be enabled and disabled with cuptiProfilerEnableProfiling/ cuptiProfilerDisableProfiling. Refer to the sample autorange_profiling

/* Assume Inputs(counterDataImagePrefix and configImage) from configuration phase at host */

void Collection(std::vector<uint8_t>& counterDataImagePrefix, std::vector<uint8_t>& configImage)
{
    CUpti_Profiler_Initialize_Params profilerInitializeParams = {CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
    cuptiProfilerInitialize(&profilerInitializeParams);

    std::vector<uint8_t> counterDataImages;
    std::vector<uint8_t> counterDataScratchBuffer;
    CreateCounterDataImage(counterDataImages, counterDataScratchBuffer, counterDataImagePrefix);

    CUpti_Profiler_BeginSession_Params beginSessionParams = {CUpti_Profiler_BeginSession_Params_STRUCT_SIZE};
    CUpti_ProfilerRange profilerRange = CUPTI_AutoRange;
    CUpti_ProfilerReplayMode profilerReplayMode = CUPTI_UserReplay;

    beginSessionParams.ctx = NULL;
    beginSessionParams.counterDataImageSize = counterDataImage.size();
    beginSessionParams.pCounterDataImage = &counterDataImage[0];
    beginSessionParams.counterDataScratchBufferSize = counterDataScratchBuffer.size();
    beginSessionParams.pCounterDataScratchBuffer = &counterDataScratchBuffer[0];
    beginSessionParams.range = profilerRange;
    beginSessionParams.replayMode = profilerReplayMode;
    beginSessionParams.maxRangesPerPass = num_ranges;
    beginSessionParams.maxLaunchesPerPass = num_ranges;

    cuptiProfilerBeginSession(&beginSessionParams));

    CUpti_Profiler_SetConfig_Params setConfigParams = {CUpti_Profiler_SetConfig_Params_STRUCT_SIZE};
    setConfigParams.pConfig = &configImage[0];
    setConfigParams.configSize = configImage.size();

    cuptiProfilerSetConfig(&setConfigParams));

    CUpti_Profiler_FlushCounterData_Params cuptiFlushCounterDataParams =        {CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE};

    CUpti_Profiler_EnableProfiling_Params enableProfilingParams =       {CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE};

    CUpti_Profiler_DisableProfiling_Params disableProfilingParams =         {CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE};


    kernelA<<<grid, tids>>>(...);                  // KernelA neither profiled, nor replayed

    CUpti_Profiler_BeginPass_Params beginPassParams = {CUpti_Profiler_BeginPass_Params_STRUCT_SIZE};
    CUpti_Profiler_EndPass_Params endPassParams = {CUpti_Profiler_EndPass_Params_STRUCT_SIZE};

    cuptiProfilerBeginPass(&beginPassParams);
    {
        kernelB<<<grid, tids>>>(...);              // KernelB replayed but not profiled

        cuptiProfilerEnableProfiling(&enableProfilingParams);

        kernelC<<<grid, tids>>>(...);              // KernelC profiled and captured in an unique range.
        kernelD<<<grid, tids>>>(...);              // KernelD profiled and captured in an unique range.

        cuptiProfilerDisableProfiling(&disableProfilingParams);
    }
    cuptiProfilerEndPass(&endPassParams);

    cuptiProfilerFlushCounterData(&cuptiFlushCounterDataParams);

    kernelE<<<grid, tids>>>(...);                  // KernelE not profiled

    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = {CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE};
    cuptiProfilerUnsetConfig(&unsetConfigParams);

    CUpti_Profiler_EndSession_Params endSessionParams = {CUpti_Profiler_EndSession_Params_STRUCT_SIZE};
    cuptiProfilerEndSession(&endSessionParams);
}

Application Replay

This replay mode is same as user replay, instead of in process replay, you can replay the whole process again. You will need to update the pass index while setting the config cuptiProfilerSetConfig and reload the intermediate counterDataImage on each pass.

2.7.2.2. User Range

In a session with user range mode, ranges are defined by you, cuptiProfilerPushRange and cuptiProfilerPopRange. Kernel launches are concurrent within a range. This mode is useful for metric data collection around a specific section of code, instead of per-kernel metric collection. Kernel replay is not supported in user range mode. You own the responsibility of replay using cuptiProfilerBeginPass and cuptiProfilerEndPass.

User Replay

The replay (multiple passes, if needed) is done by the user using the replay API’s cuptiProfilerBeginPass and cuptiProfilerEndPass. It is your responsibility to flush the counter data using cuptiProfilerFlushCounterData before ending the session. Counter collection can be enabled/disabled with cuptiProfilerEnableProfiling and cuptiProfilerDisableProfiling. Refer to the sample userrange_profiling

/* Assume Inputs(counterDataImagePrefix and configImage) from configuration phase at host */

void Collection(std::vector<uint8_t>& counterDataImagePrefix, std::vector<uint8_t>& configImage)
{
    CUpti_Profiler_Initialize_Params profilerInitializeParams = {CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
    cuptiProfilerInitialize(&profilerInitializeParams);

    std::vector<uint8_t> counterDataImages;
    std::vector<uint8_t> counterDataScratchBuffer;
    CreateCounterDataImage(counterDataImages, counterDataScratchBuffer, counterDataImagePrefix);

    CUpti_Profiler_BeginSession_Params beginSessionParams = {CUpti_Profiler_BeginSession_Params_STRUCT_SIZE};
    CUpti_ProfilerRange profilerRange = CUPTI_UserRange;
    CUpti_ProfilerReplayMode profilerReplayMode = CUPTI_UserReplay;

    beginSessionParams.ctx = NULL;
    beginSessionParams.counterDataImageSize = counterDataImage.size();
    beginSessionParams.pCounterDataImage = &counterDataImage[0];
    beginSessionParams.counterDataScratchBufferSize = counterDataScratchBuffer.size();
    beginSessionParams.pCounterDataScratchBuffer = &counterDataScratchBuffer[0];
    beginSessionParams.range = profilerRange;
    beginSessionParams.replayMode = profilerReplayMode;
    beginSessionParams.maxRangesPerPass = num_ranges;
    beginSessionParams.maxLaunchesPerPass = num_ranges;

    cuptiProfilerBeginSession(&beginSessionParams));

    CUpti_Profiler_SetConfig_Params setConfigParams = {CUpti_Profiler_SetConfig_Params_STRUCT_SIZE};
    setConfigParams.pConfig = &configImage[0];
    setConfigParams.configSize = configImage.size();

    cuptiProfilerSetConfig(&setConfigParams));

    CUpti_Profiler_FlushCounterData_Params cuptiFlushCounterDataParams = {CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE};

    kernelA<<<grid, tids>>>(...);                  // KernelA neither profiled, nor replayed

    CUpti_Profiler_BeginPass_Params beginPassParams = {CUpti_Profiler_BeginPass_Params_STRUCT_SIZE};
    CUpti_Profiler_EndPass_Params endPassParams = {CUpti_Profiler_EndPass_Params_STRUCT_SIZE};

    cuptiProfilerBeginPass(&beginPassParams);
    {
        kernelB<<<grid, tids>>>(...);              // KernelB replayed but not profiled

        CUpti_Profiler_PushRange_Params enableProfilingParams = {CUpti_Profiler_PushRange_Params_STRUCT_SIZE};
        pushRangeParams.pRangeName = "RangeA";
        cuptiProfilerPushRange(&pushRangeParams);

        kernelC<<<grid, tids>>>(...);
        kernelD<<<grid, tids>>>(...);

        cuptiProfilerPopRange(&popRangeParams);     // Kernel C and Kernel D are captured in rangeA without any serialization introduced by profiler
    }
    cuptiProfilerEndPass(&endPassParams);
    cuptiProfilerFlushCounterData(&cuptiFlushCounterDataParams);

    kernelE<<<grid, tids>>>(...);                  // KernelE not Profiled

    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = {CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE};
    cuptiProfilerUnsetConfig(&unsetConfigParams);

    CUpti_Profiler_EndSession_Params endSessionParams = {CUpti_Profiler_EndSession_Params_STRUCT_SIZE};
    cuptiProfilerEndSession(&endSessionParams);
}

Application Replay

This replay mode is same as user replay, instead of in process replay, you can replay the whole process again. You will need to update the pass index while setting the config using the cuptiProfilerSetConfig API, and reload the intermediate counterDataImage on each pass.

2.7.3. CUPTI Profiler Definitions

Definitions of glossary used in this section.

Counter:

The number of occurrences of a specific event on the device.

Configuration Image:

A Blob to configure the session for counters to be collected.

CounterData Image:

A Blob which contains the values of collected counters

CounterData Prefix:

A metadata header for CounterData Image

Device:

A physical NVIDIA GPU.

Event:

An event is a countable activity, action, or occurrence on device.

Metric:

A high-level value derived from counter values.

Pass:

A repeatable set of operations, with consistently labeled ranges.

Range:

A labeled region of execution

Replay:

Performing the repeatable set of operation.

Session:

A profiling session where GPU resources needed for profiling are allocated. The profiler is in armed state at session boundaries, and power management may be disabled at session boundaries. Outside of a session, the GPU will return to its normal operating state.

2.7.4. Differences from event and metric APIs

Here is the list of features which are supported by the event and metric APIs but these are not available with the Profiling API:

  • Continuous mode or sampling of the metrics.

  • Profiling API provides closest equivalent metrics for most of the events and metrics supported by the event and metric APIs. However, there are some events and metrics, for example NVLink performance metrics, for which there is no equivalent metrics in the Profiling API. Tables Metrics Mapping Table and Events Mapping Table can be referred to find the equivalent Perfworks metrics for compute capability 7.0.

  • Per-instance metrics i.e. users can’t collect metrics for each instance of the hardware units like SM, FB etc separately. However Profiling API provides sub-metrics which can be used to get the avg/sum/min/max across all instances of a hardware unit.

2.8. Perfworks Metric API

Introduction:

The Perfworks Metric API supports the enumeration, configuration and evaluation of metrics. The binary outputs of the configuration phase are inputs to the CUPTI Range Profiling API. The output of Range Profiling is the CounterData, which is passed to the Derived Metrics Evaluation APIs.

GPU Metrics are generally presented as counts, ratios and percentages. The underlying values collected from hardware are raw counters (analogous to CUPTI events), but those details are hidden behind derived metric formulas.

The Metric APIs are split into two layers: Derived Metrics and Raw Metrics. Derived Metrics contains the list of named metrics and performs evaluation to numeric results, serving a similar purpose as the previous CUPTI Metric API. Most user interaction will be with derived metrics. Raw Metrics contains the list of raw counters and generates configuration file images analogous to the previous CUPTI Event API.

Metric Enumeration

Metric Enumeration is the process of listing available counters and metrics.

Refer to file List.cpp used by the cupti_metric_properties sample.

Metrics are grouped into three types i.e. counters, ratios and throughput. Except ratios metric type each metrics have four type of sub-metrics also known as rollup metrics i.e. sum, avg, min, max.

For enumerating supported metrics for a chip, we need to calculate the scratch buffer needed for host operation and to initialize the Metric Evaluator.

  • Call NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize for calculating scratch buffer size required for allocating memory for host operations.

  • Call NVPW_CUDA_MetricsEvaluator_Initialize for initializing the Metrics Evaluator which creates a NVPW_MetricsEvaluator object.

The outline for enumerating supported counter metrics for a chip:

  • Call NVPW_MetricsEvaluator_GetMetricNames for NVPW_METRIC_TYPE_COUNTER metric type for listing all the counter metrics supported.

  • Call NVPW_MetricsEvaluator_GetSupportedSubmetrics to list all the sub-metric supported for NVPW_METRIC_TYPE_COUNTER metric type.

  • Call NVPW_MetricsEvaluator_GetCounterProperties to give description of the counter and the collection hardware unit.

Similarly, for enumerating ratio and throughput metrics we need to pass NVPW_METRIC_TYPE_RATIO and NVPW_METRIC_TYPE_THROUGHPUT as metric types to NVPW_MetricsEvaluator_GetMetricNames and NVPW_MetricsEvaluator_GetSupportedSubmetrics.

For more details about the metric properties call NVPW_MetricsEvaluator_GetRatioMetricProperties and NVPW_MetricsEvaluator_GetThroughputMetricProperties respectively.

Configuration Workflow

Configuration is the process of specifying the metrics that will be collected and how those metrics should be collected. The inputs for this phase are the metric names and metric collection properties. The output for this phase is a ConfigImage and a CounterDataPrefix Image.

Refer to file Metric.cpp used by the userrange_profiling sample.

The outline for configuring metrics:

  • As input, take a list of metric names.

  • Before creating ConfigImage or CounterDataPrefixImage, we need a list of NVPA_RawMetricRequest for the metrics listed for collection.

    • We need to calculate the scratch buffer size required for the host operation and to initialize the Metric Evaluator like in the Enumeration phase.

    • For each metric, Call NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest for creating a NVPW_MetricEvalRequest.

    • Call NVPW_MetricsEvaluator_GetMetricRawDependencies which takes the NVPW_MetricsEvaluator and NVPW_MetricEvalRequest as input, for getting raw dependencies for given metrics.

  • Create an NVPA_RawMetricRequest with keepInstances=true and isolated=true

  • Pass the NVPA_RawMetricRequest to NVPW_RawMetricsConfig_AddMetrics for the ConfigImage.

  • Pass the NVPA_RawMetricRequest to NVPW_CounterDataBuilder_AddMetrics for the CounterDataPrefix.

  • Generate binary configuration “images” (file format in memory):

    • ConfigImage from NVPW_RawMetricsConfig_GetConfigImage

    • CounterDataPrefix from NVPW_CounterDataBuilder_GetCounterDataPrefix

Metric Evaluation

Metric Evaluation is the process of forming metrics from the counters stored in the CounterData image.

Refer to file Eval.cpp used by the userrange_profiling sample.

The outline for configuring metrics:

  • As input, take the same list of metric names as used during configuration.

  • As input, take a CounterDataImage collected on a target device.

  • We need to calculate the scratch buffer size required for the host operation and to initialize the Metric Evaluator like in the Enumeration phase.

  • Query the number of ranges collected via NVPW_CounterData_GetNumRanges.

  • For each metric:

    • Call NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest for creating NVPW_MetricEvalRequest

    • For each range:

      • Call NVPW_Profiler_CounterData_GetRangeDescriptions to retrieve the range’s description, originally set by cuptiProfilerPushRange.

      • Call NVPW_MetricsEvaluator_SetDeviceAttributes to set the current range for evaluation on the NVPW_MetricEvalRequest.

      • Call NVPW_MetricsEvaluator_EvaluateToGpuValues to query an array of numeric values corresponding to each input metric.

2.8.1. Derived metrics

Metrics Overview

The PerfWorks API comes with an advanced metrics calculation system, designed to help you determine what happened (counters and metrics), and how close the program reached to peak GPU performance (throughputs as a percentage). Every counter has associated peak rates in the database, to allow computing its throughput as a percentage.

Throughput metrics return the maximum percentage value of their constituent counters. Constituents can be programmatically queried via NVPW_MetricsEvaluator_GetMetricNames with NVPW_METRIC_TYPE_THROUGHPUT as metric types. These constituents have been carefully selected to represent the sections of the GPU pipeline that govern peak performance. While all counters can be converted to a %-of-peak, not all counters are suitable for peak-performance analysis; examples of unsuitable counters include qualified subsets of activity, and workload residency counters. Using throughput metrics ensures meaningful and actionable analysis.

Two types of peak rates are available for every counter: burst and sustained. Burst rate is the maximum rate reportable in a single clock cycle. Sustained rate is the maximum rate achievable over an infinitely long measurement period, for “typical” operations. For many counters, burst == sustained. Since the burst rate cannot be exceeded, percentages of burst rate will always be less than 100%. Percentages of sustained rate can occasionally exceed 100% in edge cases. Burst metrics are only supported with MetricsContext APIs and these will be deprecated in a future CUDA release. These metrics are not supported with NVPW_MetricsEvaluator APIs.

Metrics Entities

The Metrics layer has 3 major types of entities:

  • Metrics : these are calculated quantities, with the following static properties:

    • Description string.

    • Dimensional Units : a list of (‘name’, exponent) in the style of dimensional analysis. Example string representation: pixels / gpc_clk.

    • Raw Metric dependencies : the list of raw metrics that must be collected, in order to evaluate the metric.

    • Every metric has the following sub-metrics built in.

      .peak_sustained

      the peak sustained rate

      .peak_sustained_active

      the peak sustained rate during unit active cycles

      .peak_sustained_active.per_second

      the peak sustained rate during unit active cycles, per second *

      .peak_sustained_elapsed

      the peak sustained rate during unit elapsed cycles

      .peak_sustained_elapsed.per_second

      the peak sustained rate during unit elapsed cycles, per second *

      .peak_sustained_region

      the peak sustained rate over a user-specified “range”

      .peak_sustained_region.per_second

      the peak sustained rate over a user-specified “range”, per second *

      .peak_sustained_frame

      the peak sustained rate over a user-specified “frame”

      .peak_sustained_frame.per_second

      the peak sustained rate over a user-specified “frame”, per second *

      .per_cycle_active

      the number of operations per unit active cycle

      .per_cycle_elapsed

      the number of operations per unit elapsed cycle

      .per_cycle_in_region

      the number of operations per user-specified “range” cycle

      .per_cycle_in_frame

      the number of operations per user-specified “frame” cycle

      .per_second

      the number of operations per second

      .pct_of_peak_sustained_active

      % of peak sustained rate achieved during unit active cycles

      .pct_of_peak_sustained_elapsed

      % of peak sustained rate achieved during unit elapsed cycles

      .pct_of_peak_sustained_region

      % of peak sustained rate achieved over a user-specified “range” time

      .pct_of_peak_sustained_frame

      % of peak sustained rate achieved over a user-specified “frame” time

      * sub-metrics added in CUPTI 11.3.

  • Counters may be either a raw counter from the GPU, or a calculated counter value. Every counter has four sub-metrics under it, which are also called roll-ups:

    .sum

    The sum of counter values across all unit instances.

    .avg

    The average counter value across all unit instances.

    .min

    The minimum counter value across all unit instances.

    .max

    The maximum counter value across all unit instances.

  • Ratios have three sub-metrics under it:

    .pct

    The value expressed as a percentage.

    .ratio

    The value expressed as a ratio.

    .max_rate

    The ratio’s maximum value.

  • Throughputs indicate how close a portion of the GPU reached to peak rate. Every throughput has the following sub-metrics:

    .pct_of_peak_sustained_active

    % of peak sustained rate achieved during unit active cycles

    .pct_of_peak_sustained_elapsed

    % of peak sustained rate achieved during unit elapsed cycles

    .pct_of_peak_sustained_region

    % of peak sustained rate achieved over a user-specified “range” time

    .pct_of_peak_sustained_frame

    % of peak sustained rate achieved over a user-specified “frame” time

At the configuration step, you must specify metric names. Counters, ratios, and throughputs are not directly schedulable.

Note: Burst metrics are only supported with MetricsContext APIs.

From CUPTI 11.3 onwards, due to not being useful for performance optimization following counter sub-metrics are not present in MetricEvaluator APIs and are only supported with MetricsContext APIs:

.peak_burst

the peak burst rate

.pct_of_peak_burst_active

% of peak burst rate achieved during unit active cycles

.pct_of_peak_burst_elapsed

% of peak burst rate achieved during unit elapsed cycles

.pct_of_peak_burst_region

% of peak burst rate achieved over a user-specified “range”

.pct_of_peak_burst_frame

% of peak burst rate achieved over a user-specified “frame”

From CUPTI 11.3 onwards, due to not being useful for performance optimization following throughput sub-metrics are not present in MetricEvaluator APIs and are only supported with MetricsContext APIs:

.pct_of_peak_burst_active

% of peak burst rate achieved during unit active cycles

.pct_of_peak_burst_elapsed

% of peak burst rate achieved during unit elapsed cycles

.pct_of_peak_burst_region

% of peak burst rate achieved over a user-specified “range” time

.pct_of_peak_burst_frame

% of peak burst rate achieved over a user-specified “frame” time

Metrics Examples

## non-metric names -- *not* directly evaluable
sm__inst_executed                   # counter
smsp__average_warp_latency          # ratio
sm__throughput                      # throughput

## a counter's four roll-ups as sub-metrics -- all evaluable
sm__inst_executed.sum               # metric
sm__inst_executed.avg               # metric
sm__inst_executed.min               # metric
sm__inst_executed.max               # metric

## all names below are metrics -- all evaluable
l1tex__data_bank_conflicts_pipe_lsu.sum
l1tex__data_bank_conflicts_pipe_lsu.sum.peak_burst
l1tex__data_bank_conflicts_pipe_lsu.sum.peak_sustained
l1tex__data_bank_conflicts_pipe_lsu.sum.per_cycle_active
l1tex__data_bank_conflicts_pipe_lsu.sum.per_cycle_elapsed
l1tex__data_bank_conflicts_pipe_lsu.sum.per_cycle_in_region
l1tex__data_bank_conflicts_pipe_lsu.sum.per_cycle_in_frame
l1tex__data_bank_conflicts_pipe_lsu.sum.per_second
l1tex__data_bank_conflicts_pipe_lsu.sum.pct_of_peak_sustained_active
l1tex__data_bank_conflicts_pipe_lsu.sum.pct_of_peak_sustained_elapsed
l1tex__data_bank_conflicts_pipe_lsu.sum.pct_of_peak_sustained_region
l1tex__data_bank_conflicts_pipe_lsu.sum.pct_of_peak_sustained_frame

Metrics Naming Conventions

Counters and metrics _generally_ obey the naming scheme:

  • Unit-Level Counter : unit__(subunit?)_(pipestage?)_quantity_(qualifiers?)

  • Interface Counter : unit__(subunit?)_(pipestage?)_(interface)_quantity_(qualifiers?)

  • Unit Metric : (counter_name).(rollup_metric)

  • Sub-Metric : (counter_name).(rollup_metric).(submetric)

where

  • unit: A logical of physical unit of the GPU

  • subunit: The subunit within the unit where the counter was measured. Sometimes this is a pipeline mode instead.

  • pipestage: The pipeline stage within the subunit where the counter was measured.

  • quantity: What is being measured. Generally matches the “dimensional units”.

  • qualifiers: Any additional predicates or filters applied to the counter. Often, an unqualified counter can be broken down into several qualified sub-components.

  • interface: Of the form sender2receiver, where sender is the source-unit and receiver is the destination-unit.

  • rollup_metric: One of sum,avg,min,max.

  • submetric: refer to section Metric Entities

Components are not always present. Most top-level counters have no qualifiers. Subunit and pipestage may be absent where irrelevant, or there may be many subunit specifiers for detailed counters.

Cycle Metrics

Counters using the term cycles in the name report the number of cycles in the unit’s clock domain. Unit-level cycle metrics include:

  • unit__cycles_elapsed : The number of cycles within a range. The cycles’ DimUnits are specific to the unit’s clock domain.

  • unit__cycles_active : The number of cycles where the unit was processing data.

  • unit__cycles_stalled : The number of cycles where the unit was unable to process new data because its output interface was blocked.

  • unit__cycles_idle : The number of cycles where the unit was idle.

Interface-level cycle counters are often (not always) available in the following variations:

  • unit__(interface)_active : Cycles where data was transferred from source-unit to destination-unit.

  • unit__(interface)_stalled : Cycles where the source-unit had data, but the destination-unit was unable to accept data.

2.8.2. Raw Metrics

The raw metrics layer contains a list of low-level GPU counters, and the “scheduling” logic needed to program the hardware. The binary output files (ConfigImage and CounterDataPrefix) can be generated offline, stored on disk, and used on any compatible GPU. They do not need to be generated on a machine where a GPU is available.

Refer to Metrics Configuration to see where Raw Metrics fit into the overall data flow of the profiler.

2.8.3. Metrics Mapping Table

The table below lists the CUPTI metrics for devices with compute capability 7.0. For each CUPTI metric the closest equivalent Perfworks metric or formula is given. If no equivalent Perfworks metric is available the column is left blank. Note that there can be some difference in the metric values between the CUPTI metric and the Perfworks metrics.

Table 4. Metrics Mapping Table from CUPTI to Perfworks for Compute Capability 7.0

CUPTI Metric

Perfworks Metric or Formula

achieved_occupancy

sm__warps_active.avg.pct_of_peak_sustained_active

atomic_transactions

l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum + l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum

atomic_transactions_per_request

(l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red.sum) / (l1tex__t_requests_pipe_lsu_mem_global_op_atom.sum + l1tex__t_requests_pipe_lsu_mem_global_op_red.sum)

branch_efficiency

smsp__sass_average_branch_targets_threads_uniform.pct

cf_executed

smsp__inst_executed_pipe_cbu.sum + smsp__inst_executed_pipe_adu.sum

cf_fu_utilization

cf_issued

double_precision_fu_utilization

smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active

dram_read_bytes

dram__bytes_read.sum

dram_read_throughput

dram__bytes_read.sum.per_second

dram_read_transactions

dram__sectors_read.sum

dram_utilization

dram__throughput.avg.pct_of_peak_sustained_elapsed

dram_write_bytes

dram__bytes_write.sum

dram_write_throughput

dram__bytes_write.sum.per_second

dram_write_transactions

dram__sectors_write.sum

eligible_warps_per_cycle

smsp__warps_eligible.sum.per_cycle_active

flop_count_dp

smsp__sass_thread_inst_executed_op_dadd_pred_on.sum + smsp__sass_thread_inst_executed_op_dmul_pred_on.sum + smsp__sass_thread_inst_executed_op_dfma_pred_on.sum * 2

flop_count_dp_add

smsp__sass_thread_inst_executed_op_dadd_pred_on.sum

flop_count_dp_fma

smsp__sass_thread_inst_executed_op_dfma_pred_on.sum

flop_count_dp_mul

smsp__sass_thread_inst_executed_op_dmul_pred_on.sum

flop_count_hp

smsp__sass_thread_inst_executed_op_hadd_pred_on.sum + smsp__sass_thread_inst_executed_op_hmul_pred_on.sum + smsp__sass_thread_inst_executed_op_hfma_pred_on.sum * 2

flop_count_hp_add

smsp__sass_thread_inst_executed_op_hadd_pred_on.sum

flop_count_hp_fma

smsp__sass_thread_inst_executed_op_hfma_pred_on.sum

flop_count_hp_mul

smsp__sass_thread_inst_executed_op_hmul_pred_on.sum

flop_count_sp

smsp__sass_thread_inst_executed_op_fadd_pred_on.sum + smsp__sass_thread_inst_executed_op_fmul_pred_on.sum + smsp__sass_thread_inst_executed_op_ffma_pred_on.sum * 2

flop_count_sp_add

smsp__sass_thread_inst_executed_op_fadd_pred_on.sum

flop_count_sp_fma

smsp__sass_thread_inst_executed_op_ffma_pred_on.sum

flop_count_sp_mul

smsp__sass_thread_inst_executed_op_fmul_pred_on.sum

flop_count_sp_special

flop_dp_efficiency

smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_peak_sustained_elapsed

flop_hp_efficiency

smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_peak_sustained_elapsed

flop_sp_efficiency

smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_peak_sustained_elapsed

gld_efficiency

smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct

gld_requested_throughput

gld_throughput

l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second

gld_transactions

l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum

gld_transactions_per_request

l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio

global_atomic_requests

l1tex__t_requests_pipe_lsu_mem_global_op_atom.sum

global_hit_rate

l1tex__t_sectors_pipe_lsu_mem_global_op_{op}_lookup_hit.sum / l1tex__t_sectors_pipe_lsu_mem_global_op_{op}.sum

global_load_requests

l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum

global_reduction_requests

l1tex__t_requests_pipe_lsu_mem_global_op_red.sum

global_store_requests

l1tex__t_requests_pipe_lsu_mem_global_op_st.sum

gst_efficiency

smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct

gst_requested_throughput

gst_throughput

l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second

gst_transactions

l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum

gst_transactions_per_request

l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio

half_precision_fu_utilization

smsp__inst_executed_pipe_fp16.avg.pct_of_peak_sustained_active

inst_bit_convert

smsp__sass_thread_inst_executed_op_conversion_pred_on.sum

inst_compute_ld_st

smsp__sass_thread_inst_executed_op_memory_pred_on.sum

inst_control

smsp__sass_thread_inst_executed_op_control_pred_on.sum

inst_executed

smsp__inst_executed.sum

inst_executed_global_atomics

smsp__sass_inst_executed_op_global_atom.sum

inst_executed_global_loads

smsp__inst_executed_op_global_ld.sum

inst_executed_global_reductions

smsp__inst_executed_op_global_red.sum

inst_executed_global_stores

smsp__inst_executed_op_global_st.sum

inst_executed_local_loads

smsp__inst_executed_op_local_ld.sum

inst_executed_local_stores

smsp__inst_executed_op_local_st.sum

inst_executed_shared_atomics

smsp__inst_executed_op_shared_atom.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum

inst_executed_shared_loads

smsp__inst_executed_op_shared_ld.sum

inst_executed_shared_stores

smsp__inst_executed_op_shared_st.sum

inst_executed_surface_atomics

smsp__inst_executed_op_surface_atom.sum

inst_executed_surface_loads

smsp__inst_executed_op_surface_ld.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum

inst_executed_surface_reductions

smsp__inst_executed_op_surface_red.sum

inst_executed_surface_stores

smsp__inst_executed_op_surface_st.sum

inst_executed_tex_ops

smsp__inst_executed_op_texture.sum

inst_fp_16

smsp__sass_thread_inst_executed_op_fp16_pred_on.sum

inst_fp_32

smsp__sass_thread_inst_executed_op_fp32_pred_on.sum

inst_fp_64

smsp__sass_thread_inst_executed_op_fp64_pred_on.sum

inst_integer

smsp__sass_thread_inst_executed_op_integer_pred_on.sum

inst_inter_thread_communication

smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.sum

inst_issued

smsp__inst_issued.sum

inst_misc

smsp__sass_thread_inst_executed_op_misc_pred_on.sum

inst_per_warp

smsp__average_inst_executed_per_warp.ratio

inst_replay_overhead

ipc

smsp__inst_executed.avg.per_cycle_active

issue_slot_utilization

smsp__issue_active.avg.pct_of_peak_sustained_active

issue_slots

smsp__inst_issued.sum

issued_ipc

smsp__inst_issued.avg.per_cycle_active

l2_atomic_throughput

lts__t_sectors_srcunit_l1_op_atom.sum.per_second

l2_atomic_transactions

lts__t_sectors_srcunit_l1_op_atom.sum

l2_global_atomic_store_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_atom.sum

l2_global_load_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum

l2_local_global_store_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_st.sum + lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_st.sum

l2_local_load_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_ld.sum

l2_read_throughput

lts__t_sectors_op_read.sum.per_second

l2_read_transactions

lts__t_sectors_op_read.sum

l2_surface_load_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_ld.sum

l2_surface_store_bytes

lts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_st.sum

l2_tex_hit_rate

lts__t_sector_hit_rate.pct

l2_tex_read_hit_rate

lts__t_sector_op_read_hit_rate.pct

l2_tex_read_throughput

lts__t_sectors_srcunit_tex_op_read.sum.per_second

l2_tex_read_transactions

lts__t_sectors_srcunit_tex_op_read.sum

l2_tex_write_hit_rate

lts__t_sector_op_write_hit_rate.pct

l2_tex_write_throughput

lts__t_sectors_srcunit_tex_op_read.sum.per_second

l2_tex_write_transactions

lts__t_sectors_srcunit_tex_op_read.sum

l2_utilization

lts__t_sectors.avg.pct_of_peak_sustained_elapsed

l2_write_throughput

lts__t_sectors_op_write.sum.per_second

l2_write_transactions

lts__t_sectors_op_write.sum

ldst_executed

ldst_fu_utilization

smsp__inst_executed_pipe_lsu.avg.pct_of_peak_sustained_active

ldst_issued

local_hit_rate

local_load_requests

l1tex__t_requests_pipe_lsu_mem_local_op_ld.sum

local_load_throughput

l1tex__t_bytes_pipe_lsu_mem_local_op_ld.sum.per_second

local_load_transactions

l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum

local_load_transactions_per_request

l1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_ld.ratio

local_memory_overhead

local_store_requests

l1tex__t_requests_pipe_lsu_mem_local_op_st.sum

local_store_throughput

l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum.per_second

local_store_transactions

l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum

local_store_transactions_per_request

l1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_st.ratio

nvlink_data_receive_efficiency

nvlink_data_transmission_efficiency

nvlink_overhead_data_received

nvlink_overhead_data_transmitted

nvlink_receive_throughput

nvlink_total_data_received

nvlink_total_data_transmitted

nvlink_total_nratom_data_transmitted

nvlink_total_ratom_data_transmitted

nvlink_total_response_data_received

nvlink_total_write_data_transmitted

nvlink_transmit_throughput

nvlink_user_data_received

nvlink_user_data_transmitted

nvlink_user_nratom_data_transmitted

nvlink_user_ratom_data_transmitted

nvlink_user_response_data_received

nvlink_user_write_data_transmitted

pcie_total_data_received

pcie__read_bytes.sum

pcie_total_data_transmitted

pcie__write_bytes.sum

shared_efficiency

smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct

shared_load_throughput

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum.per_second

shared_load_transactions

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum

shared_load_transactions_per_request

shared_store_throughput

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum.per_second

shared_store_transactions

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum

shared_store_transactions_per_request

shared_utilization

l1tex__data_pipe_lsu_wavefronts_mem_shared.avg.pct_of_peak_sustained_elapsed

single_precision_fu_utilization

smsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active

sm_efficiency

smsp__cycles_active.avg.pct_of_peak_sustained_elapsed

sm_tex_utilization

l1tex__texin_sm2tex_req_cycles_active.avg.pct_of_peak_sustained_elapsed

special_fu_utilization

smsp__inst_executed_pipe_xu.avg.pct_of_peak_sustained_active

stall_constant_memory_dependency

smsp__warp_issue_stalled_imc_miss_per_warp_active.pct

stall_exec_dependency

smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct + smsp__warp_issue_stalled_wait_per_warp_active.pct

stall_inst_fetch

smsp__warp_issue_stalled_no_instruction_per_warp_active.pct

stall_memory_dependency

smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct

stall_memory_throttle

smsp__warp_issue_stalled_drain_per_warp_active.pct + smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct

stall_not_selected

smsp__warp_issue_stalled_not_selected_per_warp_active.pct

stall_other

smsp__warp_issue_stalled_misc_per_warp_active.pct + smsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct

stall_pipe_busy

smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct + smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct

stall_sleeping

smsp__warp_issue_stalled_sleeping_per_warp_active.pct

stall_sync

smsp__warp_issue_stalled_membar_per_warp_active.pct + smsp__warp_issue_stalled_barrier_per_warp_active.pct

stall_texture

smsp__warp_issue_stalled_tex_throttle_per_warp_active.pct

surface_atomic_requests

l1tex__t_requests_pipe_tex_mem_surface_op_atom.sum

surface_load_requests

l1tex__t_requests_pipe_tex_mem_surface_op_ld.sum

surface_reduction_requests

l1tex__t_requests_pipe_tex_mem_surface_op_red.sum

surface_store_requests

l1tex__t_requests_pipe_tex_mem_surface_op_st.sum

sysmem_read_bytes

lts__t_sectors_aperture_sysmem_op_read* 32

sysmem_read_throughput

lts__t_sectors_aperture_sysmem_op_read.sum.per_second

sysmem_read_transactions

lts__t_sectors_aperture_sysmem_op_read.sum

sysmem_read_utilization

sysmem_utilization

sysmem_write_bytes

lts__t_sectors_aperture_sysmem_op_write * 32

sysmem_write_throughput

lts__t_sectors_aperture_sysmem_op_write.sum.per_second

sysmem_write_transactions

lts__t_sectors_aperture_sysmem_op_write.sum

sysmem_write_utilization

tensor_precision_fu_utilization

sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active

tex_cache_hit_rate

l1tex__t_sector_hit_rate.pct

tex_cache_throughput

tex_cache_transactions

l1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active + l1tex__tex_writeback_active.avg.pct_of_peak_sustained_active

tex_fu_utilization

smsp__inst_executed_pipe_tex.avg.pct_of_peak_sustained_active

tex_sm_tex_utilization

l1tex__f_tex2sm_cycles_active.avg.pct_of_peak_sustained_elapsed

tex_sm_utilization

sm__mio2rf_writeback_active.avg.pct_of_peak_sustained_elapsed

tex_utilization

texture_load_requests

l1tex__t_requests_pipe_tex_mem_texture.sum

warp_execution_efficiency

smsp__thread_inst_executed_per_inst_executed.ratio

warp_nonpred_execution_efficiency

smsp__thread_inst_executed_per_inst_executed.pct

2.8.4. Events Mapping Table

The table below lists the CUPTI events for devices with compute capability 7.0. For each CUPTI event the closest equivalent Perfworks metric or formula is given. If no equivalent Perfworks metric is available the column is left blank. Note that there can be some difference in the values between the CUPTI event and the Perfworks metrics.

Table 5. Events Mapping Table from CUPTI events to Perfworks metrics for Compute Capability 7.0

CUPTI Event

Perfworks Metric or Formula

active_cycles

sm__cycles_active.sum

active_cycles_pm

sm__cycles_active.sum

active_cycles_sys

sys__cycles_active.sum

active_warps

sm__warps_active.sum

active_warps_pm

sm__warps_active.sum

atom_count

smsp__inst_executed_op_generic_atom_dot_alu.sum

elapsed_cycles_pm

sm__cycles_elapsed.sum

elapsed_cycles_sm

sm__cycles_elapsed.sum

elapsed_cycles_sys

sys__cycles_elapsed.sum

fb_subp0_read_sectors

dram__sectors_read.sum

fb_subp1_read_sectors

dram__sectors_read.sum

fb_subp0_write_sectors

dram__sectors_write.sum

fb_subp1_write_sectors

dram__sectors_write.sum

global_atom_cas

smsp__inst_executed_op_generic_atom_dot_cas.sum

gred_count

smsp__inst_executed_op_global_red.sum

inst_executed

sm__inst_executed.sum

inst_executed_fma_pipe_s0

smsp__inst_executed_pipe_fma.sum

inst_executed_fma_pipe_s1

smsp__inst_executed_pipe_fma.sum

inst_executed_fma_pipe_s2

smsp__inst_executed_pipe_fma.sum

inst_executed_fma_pipe_s3

smsp__inst_executed_pipe_fma.sum

inst_executed_fp16_pipe_s0

smsp__inst_executed_pipe_fp16.sum

inst_executed_fp16_pipe_s1

smsp__inst_executed_pipe_fp16.sum

inst_executed_fp16_pipe_s2

smsp__inst_executed_pipe_fp16.sum

inst_executed_fp16_pipe_s3

smsp__inst_executed_pipe_fp16.sum

inst_executed_fp64_pipe_s0

smsp__inst_executed_pipe_fp64.sum

inst_executed_fp64_pipe_s1

smsp__inst_executed_pipe_fp64.sum

inst_executed_fp64_pipe_s2

smsp__inst_executed_pipe_fp64.sum

inst_executed_fp64_pipe_s3

smsp__inst_executed_pipe_fp64.sum

inst_issued1

sm__inst_issued.sum

l2_subp0_read_sector_misses

lts__t_sectors_op_read_lookup_miss.sum

l2_subp1_read_sector_misses

lts__t_sectors_op_read_lookup_miss.sum

l2_subp0_read_sysmem_sector_queries

lts__t_sectors_aperture_sysmem_op_read.sum

l2_subp1_read_sysmem_sector_queries

lts__t_sectors_aperture_sysmem_op_read.sum

l2_subp0_read_tex_hit_sectors

lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum

l2_subp1_read_tex_hit_sectors

lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum

l2_subp0_read_tex_sector_queries

lts__t_sectors_srcunit_tex_op_read.sum

l2_subp1_read_tex_sector_queries

lts__t_sectors_srcunit_tex_op_read.sum

l2_subp0_total_read_sector_queries

lts__t_sectors_op_read.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum

l2_subp1_total_read_sector_queries

lts__t_sectors_op_read.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum

l2_subp0_total_write_sector_queries

lts__t_sectors_op_write.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum

l2_subp1_total_write_sector_queries

lts__t_sectors_op_write.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum

l2_subp0_write_sector_misses

lts__t_sectors_op_write_lookup_miss.sum

l2_subp1_write_sector_misses

lts__t_sectors_op_write_lookup_miss.sum

l2_subp0_write_sysmem_sector_queries

lts__t_sectors_aperture_sysmem_op_write.sum

l2_subp1_write_sysmem_sector_queries

lts__t_sectors_aperture_sysmem_op_write.sum

l2_subp0_write_tex_hit_sectors

lts__t_sectors_srcunit_tex_op_write_lookup_hit.sum

l2_subp1_write_tex_hit_sectors

lts__t_sectors_srcunit_tex_op_write_lookup_hit.sum

l2_subp0_write_tex_sector_queries

lts__t_sectors_srcunit_tex_op_write.sum

l2_subp1_write_tex_sector_queries

lts__t_sectors_srcunit_tex_op_write.sum

not_predicated_off_thread_inst_executed

smsp__thread_inst_executed_pred_on.sum

pcie_rx_active_pulse

pcie_tx_active_pulse

prof_trigger_00

prof_trigger_01

prof_trigger_02

prof_trigger_03

prof_trigger_04

prof_trigger_05

prof_trigger_06

prof_trigger_07

inst_issued0

smsp__issue_inst0.sum

sm_cta_launched

sm__ctas_launched.sum

shared_load

smsp__inst_executed_op_shared_ld.sum

shared_store

smsp__inst_executed_op_shared_st.sum

generic_load

smsp__inst_executed_op_generic_ld.sum

generic_store

smsp__inst_executed_op_generic_st.sum

global_load

smsp__inst_executed_op_global_ld.sum

global_store

smsp__inst_executed_op_global_st.sum

local_load

smsp__inst_executed_op_local_ld.sum

local_store

smsp__inst_executed_op_local_st.sum

shared_atom

smsp__inst_executed_op_shared_atom.sum

shared_atom_cas

smsp__inst_executed_op_shared_atom_dot_cas.sum

shared_ld_bank_conflict

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum

shared_st_bank_conflict

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum

shared_ld_transactions

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum

shared_st_transactions

l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum

tensor_pipe_active_cycles_s0

smsp__pipe_tensor_cycles_active.sum

tensor_pipe_active_cycles_s1

smsp__pipe_tensor_cycles_active.sum

tensor_pipe_active_cycles_s2

smsp__pipe_tensor_cycles_active.sum

tensor_pipe_active_cycles_s3

smsp__pipe_tensor_cycles_active.sum

thread_inst_executed

smsp__thread_inst_executed.sum

warps_launched

smsp__warps_launched.sum

2.9. Migration to the Profiling API

The CUPTI event APIs from the header cupti_events.h and metric APIs from the header cupti_metrics.h will be deprecated in a future CUDA release. The NVIDIA Volta platform is the last architecture on which these APIs are supported. These are being replaced by the Profiling API in the header cupti_profiler_target.h and Perfworks Metric API in the headers nvperf_host.h and nvperf_target.h. These provide low and deterministic profiling overhead on the target system. These APIs also have other significant enhancements such as:

GPU architectures supported by different CUPTI APIs are listed at the table. Both the event and metric APIs and the profiling APIs are supported for Volta. This is to enable transition of code to the profiling APIs. But one cannot mix the usage of the event and metric APIs and the profiling APIs.

The Profiling APIs are supported on all CUDA supported platforms except Android.

It is important to note that for support of future GPU architectures and feature improvements (such as performance overhead reduction and additional performance metrics), users should use the Profiling APIs. There are few features which are not supported by Profiling APIs, refer to the section for differences from event and metric APIs.

However note that there are no changes to the CUPTI Activity and Callback APIs and these will continue to be supported for the current and future GPU architectures.

2.10. CUPTI PC Sampling API

A new set of CUPTI APIs for PC sampling data collection are provided in the header file cupti_pcsampling.h which support continuous mode data collection without serializing kernel execution and have a lower runtime overhead. Along with these a utility library is provided in the header file cupti_pcsampling_util.h which has APIs for GPU assembly to CUDA-C source correlation and for reading and writing the PC sampling data from/to files.

The PC Sampling APIs are supported on all CUDA supported platforms. These are supported on Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher.

Overview of Features:

  • Two sampling modes – Continuous (concurrent kernels) or Serialized (one kernel at a time)​.

  • Option to select stall reasons to collect.​

  • Ability to collect GPU PC sampling data for entire application duration or for specific CPU code ranges (defined by start and stop APIs).​

  • API to flush GPU PC sampling data.​

  • APIs to support Offline and Runtime correlation of GPU PC samples to CUDA C source lines and GPU assembly instructions​.

Samples are provided to demonstrate how to write the injection library to collect the PC sampling information, and how to parse the generated files using the utility APIs to print the stall reasons counter values and associate those with the GPU assembly instructions and CUDA-C source code. Refer to the samples pc_sampling_continuous, pc_sampling_utility and pc_sampling_start_stop.

Note

PC Sampling APIs from the header cupti_activity.h would be referred as PC Sampling Activity APIs and APIs from the header cupti_pcsampling.h would be referred as PC Sampling APIs.

2.10.1. Configuration Attributes

The following table lists the PC sampling configuration attributes which can be set using the cuptiPCSamplingSetConfigurationAttribute() API.

Table 6. PC Sampling Configuration Attributes

Configuration Attribute

Description

Default Value

Comparison of PC Sampling APIs with CUPTI PC Sampling Activity APIs

Guideline to Tune Configuration Option

Collection mode

PC Sampling collection mode - Continuous or Kernel Serialized

Continuous

Continuous mode is new.

Kernel Serialized mode is equivalent to the kernel level functionality provided by the CUPTI PC sampling Activity APIs.

Sampling period

Sampling period for PC Sampling. Valid values for the sampling periods are between 5 to 31 both inclusive. This will set the sampling period to (2^samplingPeriod) cycles.

e.g. for sampling period = 5 to 31, cycles = 32, 64, 128,…, 2^31

CUPTI defined value is based on number of SMs

Dropped current support for 5 levels(MIN, LOW, MID, HIGH, MAX) for sampling period.

The new “sampling period” is equivalent to the “samplingPeriod2” field in CUpti_ActivityPCSamplingConfig.

Low sampling period means a high sampling frequency which can result in dropping of samples. Very high sampling period can cause low sampling frequency and no sample generation.

Stall reason

Stall reasons to collect

Input is a pointer to an array of the stall reason indexes to collect.

All stall reasons will be collected

With the CUPTI PC sampling Activity APIs there is no option to select which stall reasons to collect. Also the list of supported stall reasons has changed.

Scratch buffer size

Size of SW buffer for raw PC counter data downloaded from HW buffer.

Approximately it takes 16 Bytes (and some fixed size memory) to accommodate one PC with one stall reason

e.g. 1 PC with 1 stall reason = 32 Bytes

1 PC with 2 stall reason = 48 Bytes

1 PC with 4 stall reason = 96 Bytes

1 MB

(which can accommodate approximately 5500 PCs with all stall reasons)

New

Clients can choose scratch buffer size as per memory budget. Very small scratch buffer size can cause runtime overhead as more iterations would be required to accommodate and process more PC samples

Hardware buffer size

Size of HW buffer in bytes.

If sampling period is too less, HW buffer can overflow and drop PC data

512 MB

New

Device accessible buffer for samples. Less hardware buffer size with low sampling periods, can cause overflow and dropping of PC data. High hardware buffer size can impact application execution due to lower amount of device memory being available

Enable start/stop control

Control over PC Sampling data collection range.

1 - Allows user to start and stop PC Sampling using APIs

0 (disabled)

New

2.10.2. Stall Reasons Mapping Table

The table below lists the stall reasons mapping from PC Sampling Activity APIs to PC Sampling APIs. Note: Stall reasons with suffix _not_issued represents latency samples. These samples indicate that no instruction was issued in that cycle from the warp scheduler from where the warp was sampled.

Table 7. Stall Reasons Mapping Table from PC Sampling Activity APIs to PC Sampling APIs

PC Sampling Activity API Stall Reasons

(common prefix: CUPTI_ACTIVITY_PC_SAMPLING_STALL_)

PC Sampling API Stall Reasons

(common prefix: smsp__pcsamp_warps_issue_stalled_)

NONE

selected

selected_not_issued

INST_FETCH

branch_resolving

branch_resolving_not_issued

no_instructions

no_instructions_not_issued

EXEC_DEPENDENCY

short_scoreboard

short_scoreboard_not_issued

wait

wait_not_issued

MEMORY_DEPENDENCY

long_scoreboard

long_scoreboard_not_issued

TEXTURE

tex_throttle

tex_throttle_not_issued

SYNC

barrier

barrier_not_issued

membar

membar_not_issued

CONSTANT_MEMORY_DEPENDENCY

imc_miss

imc_miss_not_issued

PIPE_BUSY

mio_throttle

mio_throttle_not_issued

math_pipe_throttle

math_pipe_throttle_not_issued

MEMORY_THROTTLE

drain

drain_not_issued

lg_throttle

lg_throttle_not_issued

NOT_SELECTED

not_selected

not_selected_not_issued

OTHER

misc

misc_not_issued

dispatch_stall

dispatch_stall_not_issued

SLEEPING

sleeping

sleeping_not_issued

For PC Sampling APIs, total (smsp__pcsamp_sample_count) and dropped (smsp__pcsamp_samples_data_dropped) sample counts are collected by default.

2.10.3. Data Structure Mapping Table

The table below lists the data structure mapping from PC Sampling Activity APIs to PC Sampling APIs.

Table 8. Data structure Mapping Table from PC Sampling Activity APIs to PC Sampling APIs

PC Sampling Activity API structures

PC Sampling API structures

CUpti_ActivityPCSamplingConfig

CUpti_PCSamplingConfigurationInfo

CUpti_ActivityPCSamplingStallReason

CUpti_PCSamplingStallReason

Refer Stall Reasons Mapping Table

CUpti_ActivityPCSampling3

CUpti_PCSamplingPCData

CUpti_ActivityPCSamplingRecordInfo

CUpti_PCSamplingData

2.10.4. Data flushing

CUPTI clients can periodically flush GPU PC sampling data using the API cuptiPCSamplingGetData(). Besides periodic flushing of GPU PC sampling data, CUPTI clients need to also flush the GPU PC sampling data at the following points to maintain the uniqueness of PCs:

  • For continuous collection mode CUPTI_PC_SAMPLING_COLLECTION_MODE_CONTINUOUS - after each module load-unload-load sequence.

  • For serialized collection mode CUPTI_PC_SAMPLING_COLLECTION_MODE_KERNEL_SERIALIZED - after completion of each kernel.

  • For range profiling using the configuration option CUPTI_PC_SAMPLING_CONFIGURATION_ATTR_TYPE_ENABLE_START_STOP_CONTROL - at the end of the range i.e. after cuptiPCSamplingStop() API.

If application is profiled in the continuous collection mode with range profiling disabled, and there is no module unload, CUPTI clients can collect data in two ways:

  • By using cuptiPCSamplingGetData() API periodically.

  • By using cuptiPCSamplingDisable() on application exit and reading GPU PC sampling data from sampling data buffer passed during configuration.

Note

In case, cuptiPCSamplingGetData() API is not called periodically, the sampling data buffer passed during configuration should be big enough to hold the data for all the PCs.

Note

Field remainingNumPcs of the struct CUpti_PCSamplingData helps in identifying the number of PC records available with CUPTI. User can adjust the periodic flush interval based on it. Further user need to ensure that all remaining records can be accommodated in the sampling data buffer passed during configuration before disabling the PC sampling.

2.10.5. SASS Source Correlation

Building SASS source correlation for a PC can be split into two parts:

  • Correlation of a PC to a SASS instruction - PC to SASS correlation is done during PC sampling at run time and the SASS data is available in the PC record. Fields cubinCrc, pcOffset and functionName in the PC record help in correlation of a PC with a SASS instruction. You can extract cubins from the application executable or library using the cuobjdump utility by executing the command cuobjdump -xelf all exe/lib. The cuobjump utility version should match with the CUDA Toolkit version used to build the CUDA application executable or library files. You can find the cubinCrc for extracted cubins using the cuptiGetCubinCrc() API. With the help of cubinCrc you can find out the cubin to which a PC belongs. The cubin can be disassembled using the nvdisasm utility that comes with the CUDA toolkit.

  • Correlation of a SASS instruction to a CUDA source line - Correlation of GPU PC samples to CUDA C source lines can be done offline as well as at runtime with the help of the cuptiGetSassToSourceCorrelation() API.

JIT compiled cubins - In case of JIT compiled cubins, it is not possible to extract the cubin from the executable or library. For this case one can subscribe to one of the CUPTI_CBID_RESOURCE_MODULE_LOADED or CUPTI_CBID_RESOURCE_MODULE_UNLOAD_STARTING or CUPTI_CBID_RESOURCE_MODULE_PROFILED callbacks. It returns a CUpti_ModuleResourceData structure having the CUDA binary. This binary can be stored in a file and can be used for offline CUDA C source correlation.

2.10.6. API Usage

Here is a pseudo code which shows how to collect the PC sampling data for specific CPU code ranges:

void Collection()
{
    // Select collection mode
    CUpti_PCSamplingConfigurationInfoParams pcSamplingConfigurationInfoParams = {};

    CUpti_PCSamplingConfigurationInfo collectionMode = {};
    collectionMode.attributeData.collectionModeData.collectionMode = CUPTI_PC_SAMPLING_COLLECTION_MODE_CONTINUOUS;

    pcSamplingConfigurationInfoParams.numAttributes = 1;
    pcSamplingConfigurationInfoParams.pPCSamplingConfigurationInfo = &collectionMode;
    cuptiPCSamplingSetConfigurationAttribute(&pcSamplingConfigurationInfoParams);

    // Select stall reasons to collect
    {
        // Get number of supported stall reasons
        cuptiPCSamplingGetNumStallReasons();
        // Get number of supported stall reason names and corresponding indexes
        cuptiPCSamplingGetStallReasons();
        // Set selected stall reasons
        cuptiPCSamplingSetConfigurationAttribute();
    }

    // Select code range using start/stop APIs
    // Opt-in for start and stop PC Sampling using APIs cuptiPCSamplingStart and cuptiPCSamplingStop
    CUpti_PCSamplingConfigurationInfo enableStartStop = {};
    enableStartStop.attributeType = CUPTI_PC_SAMPLING_CONFIGURATION_ATTR_TYPE_ENABLE_START_STOP_CONTROL;
    enableStartStop.attributeData.enableStartStopControlData.enableStartStopControl = true;

    pcSamplingConfigurationInfoParams.numAttributes = 1;
    pcSamplingConfigurationInfoParams.pPCSamplingConfigurationInfo = &enableStartStop;
    cuptiPCSamplingSetConfigurationAttribute(&pcSamplingConfigurationInfoParams);

    // Enable PC Sampling
    cuptiPCSamplingEnable();

    kernelA <<<blocks, threads, 0, s0>>>(...);                  // KernelA is not sampled

    // Start PC sampling collection
    cuptiPCSamplingStart();
    {
        // KernelB and KernelC might run concurrently since 'continuous' sampling collection mode is selected
        kernelB <<<blocks, threads, 0, s0>>>(...);              // KernelB is sampled
        kernelC <<<blocks, threads, 0, s1>>>(...);              // KernelC is sampled
    }
    // Stop PC sampling collection
    cuptiPCSamplingStop();
    // Flush PC sampling data
    cuptiPCSamplingGetData();

    kernelD <<<blocks, threads, 0, s0>>>(...);                  // KernelD is not sampled

    // Start PC sampling collection
    cuptiPCSamplingStart();
    {
        kernelE <<<blocks, threads, 0, s0>>>(...);              // KernelE is sampled
    }
    // Stop PC sampling collection
    cuptiPCSamplingStop();
    // Flush PC sampling data
    cuptiPCSamplingGetData();

    // Disable PC Sampling
    cuptiPCSamplingDisable();
}

2.10.7. Limitations

Known limitations and issues:

  • PC Sampling APIs don’t support simultaneous sampling of multiple CUDA contexts on a GPU. However, simultaneous sampling of single CUDA context per GPU is supported. Before enabling and configuring the PC sampling on a different CUDA context on the same GPU, PC sampling needs to be disabled on the other context.

2.11. CUPTI SASS Metric API

The SASS metric APIs support collecting metric data at SASS assembly instruction level. These support a larger set of SASS instruction level metrics compared to the CUPTI Activity APIs. The set of sass metrics supported for each GPU architecture can be queried. These APIs are supported on Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher.

These APIs support SASS instruction to CUDA C source line correlation in offline mode. Hence the runtime overhead during data collection is lower.

2.11.1. API usage

  • Enumerate metrics: Use the API cuptiSassMetricsGetNumOfMetrics() for the number of metrics supported by the chip. Then allocate the buffer of type CUpti_SassMetrics_MetricDetails and pass it to the API cuptiSassMetricsGetMetrics() where CUPTI will list out all the SASS metrics and put it in the user-allocated buffer.

  • Create config image: For all the selected SASS metrics, create a list of CUpti_SassMetrics_Config structures. For creating the config buffer for a metric we need the metric id and the output granularity for the metric. The metric id can be queried by using the API cuptiSassMetricsGetProperties(). The output granularity tells at what level data will be collected. CUPTI supports collection at three levels -

    • CUPTI_SASS_METRICS_OUTPUT_GRANULARITY_GPU (at GPU level),

    • CUPTI_SASS_METRICS_OUTPUT_GRANULARITY_SM (at Streaming Multiprocessor level, the metric instance count will be the number of SMs present in the chip),

    • CUPTI_SASS_METRICS_OUTPUT_GRANULARITY_SMSP (SM sub-partition level, the number of instances will be the sum of all the SMSP present in the chip i.e num of SMs * num of sub-partitions in each SM)

  • Set config for the CUDA device: API cuptiSassMetricsSetConfig() should be used for setting the config on the device for SASS metrics collection. This API takes the device index and list of CUpti_SassMetrics_Config structs as input parameters. Then set the config for the device on which the kernel is running else CUPTI will report a CUPTI_ERROR_INVALID_OPERATION error.

  • Enable SASS metric profiling: After setting the config for the CUDA device one needs to enable SASS patching for the context on which the kernel will be launched using the API cuptSassMetricsEnable(). CUPTI provides control over when the kernel will be patched. For Lazy patching mode, CUPTI will only patch the kernel at the first launch instance and then unpatch the kernel when the API cuptiSassMetricsDisable is called. Otherwise, CUPTI will patch all the kernels in the module for the context, regardless of whether kernels would be launched in the enable/disable range. Set the enableLazyPatching flag to enable the lazy patching mode for profiling. Lazy patching is suitable for applications that have a large number of kernels in the module and a small set of kernels are launched.

  • Flush SASS metric profiling data: Once kernel execution is completed, metric data is stored in an internal format. One needs to query the size of the buffer to store the metrics data. API cuptiSassMetricsGetDataProperties() can be used to query the number of patched instructions and the number of hardware instances. Then allocate the buffer based on retrieved data, where CUPTI will flush the profiled metric data. For flushing the data, call the API cuptiSassMetricsFlushData().

  • Disable SASS metric profiling: Once the profiling of the kernel is done, call the API cuptiSassMetricsDisable() for resetting the patched kernel and remove all the profiled metric data which has been collected for the kernels. One thing to note is that CUPTI will remove all the metric data which has been collected for kernels launched since the API cuptiSassMetricsFlushData() call. So it is the user’s responsibility to call flush data API for retrieving all the metric data. Calling API cuptiSassMetricsFlushData() after cuptiSassMetricsDisable() will report the error CUPTI_ERROR_INVALID_OPERATION.

  • Unset configuration for the CUDA device: CUPTI maintains internal state for each CUDA device for which SASS metric collection is enabled. API cuptiSassMetricsUnsetConfig() should be called to clean-up the state. This API should be called for each device for which SASS metric collection has been configured.

2.11.2. Sample code

CUPTI sample sass_metric has two core functions – function ListSupportedMetrics() shows how to enumerate all metrics supported by the chip and function CollectSassMetrics() show how to collect SASS metrics. Code snippet for enumerating SASS metrics (refer the ListSupportedMetrics() function in the CUPTI sass_metric sample):

CUpti_Device_GetChipName_Params getChipParams{ CUpti_Device_GetChipName_Params_STRUCT_SIZE };
cuptiDeviceGetChipName(&getChipParams);

CUpti_SassMetrics_GetNumOfMetrics_Params getNumOfMetricParams;
getNumOfMetricParams.pChipName = getChipParams.pChipName;
cuptiSassMetricsGetNumOfMetrics(&getNumOfMetricParams);

std::vector<CUpti_SassMetrics_MetricDetails> supportedMetrics(getNumOfMetricParams.numOfMetrics);
CUpti_SassMetrics_GetMetrics_Params getMetricsParams {CUpti_SassMetrics_GetMetrics_Params_STRUCT_SIZE};
getMetricsParams.pChipName = getChipParams.pChipName;
getMetricsParams.pMetricsList = supportedMetrics.data();
getMetricsParams.numOfMetrics = supportedMetrics.size();
cuptiSassMetricsGetMetrics(&getMetricsParams);
for (size_t i = 0; i < supportedMetrics.size(); ++i)
{
    std::cout << "Metric Name: " << supportedMetrics[i].pMetricName
            << ", MetricID: " << supportedMetrics[i].metricId
            << ", Metric Description: " << supportedMetrics[i].pMetricDescription << "\n";
}

Code snippet for collecting SASS metrics (refer the CollectSassMetrics() function in the CUPTI sass_metric sample):

cuptiSassMetricsSetConfig();

// Enable SASS Patching
sassMetricsEnableParams.enableLazyPatching = 1;
cuptiSassMetricsEnable();

// As lazy patching has been enabled, VectorAdd will be patched here at the first launch instance
VectorAdd<<<gridSize, blockSize>>>();

cuptiSassMetricsGetDataProperties();

if (getDataPropParams.numOfInstances != 0 && getDataPropParams.numOfPatchedInstructionRecords != 0)
{
    // allocate memory for getting patched data.

    flushDataParams.numOfInstances = getDataPropParams.numOfInstances;
    flushDataParams.numOfPatchedInstructionRecords = getDataPropParams.numOfPatchedInstructionRecords;
    flushDataParams.pMetricsData =
            (CUpti_SassMetrics_Data*)malloc(getDataPropParams.numOfPatchedInstructionRecords * sizeof(CUpti_SassMetrics_Data));

    for (size_t recordIndex = 0;
         recordIndex < getDataPropParams.numOfPatchedInstructionRecords;
         ++recordIndex)
    {
        flushDataParams.pMetricsData[recordIndex].pInstanceValues =
            (CUpti_SassMetrics_InstanceValue*) malloc(getDataPropParams.numOfInstances * sizeof(CUpti_SassMetrics_InstanceValue));
    }

    cuptiSassMetricsFlushData();
    // Store the data for post-processing the data (e.g. SASS to source correlation)
    // Cleanup memory
}

// As this is the first VectorSub launch, the patching will be done here.
VectorSub<<<gridSize, blockSize>>>();

// As cuptiSassMetricsFlushData() API is not called, VectorSub SASS metric data will be discarded.
// All the kernels which were patched earlier will be reset to its original state.
cuptiSassMetricsDisable();

// VectorMultiply function will not get patched as it is called outside the enable/disable range.
VectorMultiply<<<gridSize, blockSize>>>();

cuptiSassMetricsUnsetConfig();

2.12. CUPTI PM Sampling API

In the CUDA 12.6 release, CUPTI introduced new PM sampling APIs which are included in the header file cupti_pmsampling.h for collecting a set of metrics by sampling the GPU’s performance monitors (PM) periodically at fixed intervals. Each sample is composed of metric values and the GPU timestamp when it was collected in nanoseconds.

These APIs are supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher.

PM sampling follows a similar approach to range profiling, where the process is divided into 2 types of operations i.e. host (enumeration, configuration, evaluation) and target(collection).

2.12.1. API usage

  • Enumerate metrics (enumeration):

    CUPTI released a new set of host APIs with cuptiProfilerHost prefix, where users need to create a profiler host object for all the host operations. For PM sampling specific host operation, users need to set the profilerType to CUPTI_PROFILER_TYPE_PM_SAMPLING in the CUpti_Profiler_Host_Initialize_Params object.

    CUPTI has following host APIs for enumerating metrics and it properties:

    • cuptiProfilerHostGetBaseMetrics() for listing base metrics for a metric type (counter, throughput and ratio).

    • cuptiProfilerHostGetSubMetrics() for listing the submetric for a metric.

    • cuptiProfilerHostGetMetricProperties() for querying the details about a metric like associated hardware unit, metric type and a short description about the metric.

      CUPTI lists some of the useful metrics in the PM Sampling Metric Table which can be used for initial metric selection which list out various GPU and its component attributes like SM active cycles, GPC and SYS clock frequency and many more.

  • Create config image (configuration):

    Similar to range profiler, for collecting PM sampling data users need to create a config image blob which will have scheduling information for metrics which were selected for collection. As configuration is a host operation similar to enumeration, users need to initialize the profiler host object before calling any of the configuration APIs.

    For creating a config image, CUPTI exposes new profiler host APIs like cuptiProfilerHostConfigAddMetrics() API where users will pass the list of metrics as input and then call cuptiProfilerHostGetConfigImageSize() API for getting the size of config image which user need to allocate and finally call the cuptiProfilerHostGetConfigImage() API where users can pass the allocated buffer for storing the scheduling information in the config image.

    CUPTI also adds another optional API i.e. cuptiProfilerHostGetNumOfPasses() for checking the number of passes required for collecting the sampling data for a given config image.

    Note

    Config images which need more than one pass for collecting sampling data are not supported.

  • Collecting Sampling Data (collection):

    This operation instructs CUPTI to begin collecting sampling data on a CUDA device at specific intervals or cycles, determined by the trigger type specified in the cuptiPmSamplingSetConfig() API.

    Collection phase can be divided into 6 subparts:

    • Enable PM Sampling:

      This is the entry point of the PM sampling process where the user passes the device index on which sampling data will be collected. Use the cuptiPmSamplingEnable() API to create a CUpti_PmSampling_Object object. This stores all the intermediate data and act as an identifier for other target APIs.

    • Set configuration:

      CUPTI has cuptiPmSamplingSetConfig() API for customizing configuration to the PM sampling process like hardware buffer size where the raw sampling data will be stored, sampling interval specifies the frequency at which sampling triggers will collect the sampling data. This will vary depending on the trigger mode set in the config API. Along with these parameters users need to pass the config image which has the scheduling information, which has been created earlier in the configuration phase.

      The maximum sampling frequency without buffer overflow events depends on GPU (SM count), GPU load intensity, and overall system load. The bigger the chip and the higher the load, the lower the maximum frequency. If you need higher frequency, you can increase it until you get the overflow event which can be queried while decoding the pm sampling data using cuptiPmSamplingDecodeData() API.

      CUPTI supports two trigger modes, GPU_SYSCLK_INTERVAL which is based on sys clock frequency and the sample intervals are in terms of clock cycles. And the 2nd one is GPU_TIME_INTERVAL which has fixed frequency and the intervals are in terms of nanoseconds.

      Note

      The GPU_TIME_INTERVAL trigger is not supported in Turing and GA100 chips.

    • Start PM Sampling:

      After enabling and setting up the configuration for the PM sampling, users need to call the cuptiPmSamplingStart() API which signals CUPTI to start the collection, the raw sampling data will be stored in the hardware buffer.

    • Stop PM Sampling:

      Users need to call the cuptiPmSamplingStop() API for stopping the collection of sampling data.

    • Decode PM Sampling data:

      While collection phase all the raw sampling data will be stored in the hardware buffer. CUPTI exposes cuptiPmSamplingDecodeData() API which decodes the raw data and stores it in a counter data image which users need to pass into the API as input. For creating the counter data image refer to this.

      It is users responsibility to call this decode API for freeing up the hardware buffer for allowing new raw data to get stored in the hardware buffer. This API also outputs some attributes like hardware buffer overflow status, decode stop reasons like end of all the raw data or if the counter data image passed is full. So for long running workload users can call this decode API between the Start and Stop API. The ideal way would be calling it in a separate thread. Refer pm_sampling public sample which shows the decode operation running in parallel with the collection.

    • Disable PM Sampling:

      For destroying all the resources allocated for PM sampling and ending the PM sampling users can call the cuptiPmSamplingDisable() API.

    • Create Counter data image:

      For storing the decoded data and using it in the evaluation phase users need to allocate a buffer which CUPTI refer as counter data image. Creating a counter data image is a target operation and should be done after enabling the PM sampling and before calling the decode API call. For creating the counter buffer image, first users need to call cuptiPmSamplingGetCounterDataSize() API for getting the size of buffer needed for allocation. Once users allocate the buffer the buffer needs to be in counter data format where the samples will be stored so to initialize the buffer users have to call cuptiPmSamplingCounterDataImageInitialize() API. This same API can also be used to reset the counter buffer image.

  • Evaluating Counter Data (evaluation):

    Once the raw data is decoded to counter buffer image, users need to use profiler host APIs for evaluating the counter data for getting sample data in readable format. Users can query the number of completed samples in the counter data using the cuptiPmSamplingGetCounterDataInfo() API. For PM sampling each sample is defined by its start and end time stamps. For getting sample info like start and end timestamps CUPTI has cuptiPmSamplingCounterDataGetSampleInfo() API. The timestamps reported are CPU based time stamps. Then to get the collected metrics values for the sample, cuptiProfilerHostEvaluateToGpuValues() API is used.

2.12.2. Sample code

CUPTI sample pm_sampling has two core functions – function PmSamplingQueryMetrics() shows how to enumerate all metrics supported by the chip and function PmSamplingCollection() show how to collect PM sampling data for a list of metrics while launching CUDA workloads. Code snippet for enumerating supported PM sampling metrics (refer the PmSamplingQueryMetrics() function in the CUPTI pm_sampling sample):

CUpti_Device_GetChipName_Params getChipParams{ CUpti_Device_GetChipName_Params_STRUCT_SIZE };
cuptiDeviceGetChipName(&getChipParams);

CUpti_Profiler_Host_Initialize_Params hostInitializeParams = {CUpti_Profiler_Host_Initialize_Params_STRUCT_SIZE};
hostInitializeParams.profilerType = CUPTI_PROFILER_TYPE_PM_SAMPLING;
hostInitializeParams.pChipName = m_chipName.c_str();
hostInitializeParams.pCounterAvailabilityImage = counterAvailibilityImage.data();
cuptiProfilerHostInitialize(&hostInitializeParams);
m_pHostObject = hostInitializeParams.pHostObject;

for (size_t metricTypeIndex = 0; metricTypeIndex < CUPTI_METRIC_TYPE__COUNT; ++metricTypeIndex)
{
    CUpti_Profiler_Host_GetBaseMetrics_Params getBaseMetricsParams {CUpti_Profiler_Host_GetBaseMetrics_Params_STRUCT_SIZE};
    getBaseMetricsParams.pHostObject = m_pHostObject;
    getBaseMetricsParams.metricType = (CUpti_MetricType)metricTypeIndex;
    cuptiProfilerHostGetBaseMetrics(&getBaseMetricsParams);

    for (size_t metricIndex = 0; metricIndex < getBaseMetricsParams.numMetrics; ++metricIndex) {
        metricsList.push_back(getBaseMetricsParams.ppMetricNames[metricIndex]);
    }
}

CUpti_Profiler_Host_Deinitialize_Params deinitializeParams = {CUpti_Profiler_Host_Deinitialize_Params_STRUCT_SIZE};
deinitializeParams.pHostObject = m_pHostObject;
cuptiProfilerHostDeinitialize(&deinitializeParams);

Code snippet for collecting PM sampling data (refer the PmSamplingCollection() function in the CUPTI pm_sampling sample):

void PmSamplingCollection()
{
    // 1. Create config image
    std::vector<uint8_t> configImage;
    CreateConfigImage(configImage, metrics);

    // 2. Enable PM sampling and set config for the PM sampling data collection.
    EnablePmSampling(deviceIndex);
    SetConfig(configImage, hardwareBufferSize, samplingInterval);

    // 3. Create counter data image
    std::vector<uint8_t> counterDataImage;
    CreateCounterDataImage(maxSamples, metrics, counterDataImage);

    VectorLaunchWorkLoad vectorWorkLoad;
    vectorWorkLoad.SetUp();

    // 4. Start the PM sampling and launch the CUDA workload
    StartPmSampling();

    // 5. Launch the kernel NUM_OF_ITERATIONS times
    const size_t NUM_OF_ITERATIONS = 100000;
    for (size_t ii = 0; ii < NUM_OF_ITERATIONS; ++ii)
    {
        vectorWorkLoad.LaunchKernel();
    }
    cudaDeviceSynchronize();

    // 6. Stop the PM sampling and join the decode thread
    StopPmSampling();

    // 7. Decode PM Sampling Data
    DecodeCounterData(counterDataImage);

    // 8. Print the sample ranges for the collected metrics
    PrintSampleRanges(counterDataImage);

    // 9. Disable PM Sampling
    DisablePmSampling();
}

// PrintSampleRanges function
void PrintSampleRanges(std::vector<uint8_t> counterDataImage)
{
    CUpti_PmSampling_GetCounterDataInfo_Params counterDataInfo {CUpti_PmSampling_GetCounterDataInfo_Params_STRUCT_SIZE};
    counterDataInfo.pCounterDataImage = counterDataImage.data();
    counterDataInfo.counterDataImageSize = counterDataImage.size();
    cuptiPmSamplingGetCounterDataInfo(&counterDataInfo);

    for (size_t sampleIndex = 0; sampleIndex < counterDataInfo.numCompletedSamples; ++sampleIndex)
    {
        pmSamplingHost.EvaluateCounterData(sampleIndex, metricsList, counterDataImage);
    }

    // For reusing the counter data image, reset the counter data image
    ResetCounterDataImage(counterDataImage);
}

2.12.3. Metrics Table

PM sampling supports the collection of a wide variety of metrics. The table below lists some useful metrics that provide insights into the utilization of different units in the GPU.

Table 9. PM Sampling Metrics Table

Metric Name

Metric details

gpc__cycles_elapsed.avg.per_second

GPC Clock Frequency:

The average GPC clock frequency in hertz.

sys__cycles_elapsed.avg.per_second

SYS Clock Frequency:

The average SYS clock frequency in hertz. The GPU front end (command processor), copy engines, and the performance monitor run at the SYS clock. On Turing and NVIDIA GA100 GPUs the sampling frequency is based upon a period of SYS clocks (not time) so samples per second will vary with SYS clock. On NVIDIA GA10x GPUs the sampling frequency is based upon a fixed frequency clock. The maximum frequency scales linearly with the SYS clock.

gr__cycles_active.sum.pct_of_peak_sustained_elapsed

GR Active:

The percentage of cycles the compute engine is active. The compute engine is active if there is any work in the compute pipe.

gr__dispatch_count.avg.pct_of_peak_sustained_elapsed

Dispatch Started:

The ratio of compute grid launches (dispatches) to the compute pipe to the maximum sustained rate of the compute pipe.

tpc__warps_inactive_sm_active_realtime.avg.pct_of_peak_sustained_elapsed

Active SM Unused Warp Slots:

The ratio of inactive warp slots on the SMs to the maximum number of warps per SM as a percentage. This is an indication of how many more warps may fit on the SMs if occupancy is not limited by a resource such as max warps of a shader type, shared memory, registers per thread, or thread blocks per SM.

tpc__warps_inactive_sm_idle_realtime.avg.pct_of_peak_sustained_elapsed

Idle SM Unused Warp Slots:

The ratio of inactive warps slots due to idle SMs to the the maximum number of warps per SM as a percentage.

This is an indicator that the current workload on the SM is not sufficient to put work on all SMs. This can be due to either CPU starving the GPU, current work is too small to saturate the GPU or current work is trailing off but blocking next work.

sm__cycles_active.avg.pct_of_peak_sustained_elapsed

SMs Active:

The ratio of cycles SMs had at least 1 warp in flight (allocated on SM) to the number of cycles as a percentage. A value of 0 indicates all SMs were idle (no warps in flight). A value of 50% can indicate some gradient between all SMs active 50% of the sample period or 50% of SMs active 100% of the sample period.

sm__inst_executed_realtime.avg.pct_of_peak_sustained_elapsed

SM Issue:

The ratio of cycles that SM sub-partitions (warp schedulers) issued an instruction to the number of cycles in the sample period as a percentage.

sm__pipe_tensor_cycles_active_realtime.avg.pct_of_peak_sustained_elapsed

Tensor Active:

The ratio of cycles the SM tensor pipes were active issuing tensor instructions to the number of cycles in the sample period as a percentage.

This metric is not available on Turing GPUs for periodic sampling.

sm__pipe_shared_cycles_active_realtime.avg.pct_of_peak_sustained_elapsed

Tensor Active / FP16 Active:

The ratio of cycles the SM tensor pipes or FP16x2 pipes were active issuing tensor instructions to the number of cycles in the sample period as a percentage.

This metric is only available for Turing GPUs for periodic sampling.

dramc__read_throughput.avg.pct_of_peak_sustained_elapsed

DRAM Read Bandwidth:

The ratio of cycles the DRAM interface was active reading data to the elapsed cycles in the same period as a percentage.

dramc__write_throughput.avg.pct_of_peak_sustained_elapsed

DRAM Write Bandwidth:

The ratio of cycles the DRAM interface was active writing data to the elapsed cycles in the same period as a percentage.

pcie__read_bytes.avg.pct_of_peak_sustained_elapsed

PCIe Read Throughput:

The ratio of bytes received on the PCIe interface to the maximum number of bytes receivable in the sample period as a percentage. The theoretical value is calculated based upon the PCIe generation and number of lanes.

pcie__write_bytes.avg.pct_of_peak_sustained_elapsed

PCIe Write Throughput:

The ratio of bytes transmitted on the PCIe interface to the maximum number of bytes receivable in the sample period as a percentage. The theoretical value is calculated based upon the PCIe generation and number of lanes.

nvlrx__bytes.avg.pct_of_peak_sustained_elapsed

NVLink bytes received:

The ratio of bytes received on the NVLink interface to the maximum number of bytes receivable in the sample period as a percentage.

nvltx__bytes.avg.pct_of_peak_sustained_elapsed

NVLink bytes transmitted:

The ratio of bytes transmitted on the NVLink interface to the maximum number of bytes transmittable in the sample period as a percentage.

pcie__rx_requests_aperture_bar1_op_read.sum pcie__rx_requests_aperture_bar1_op_write.sum

PCIe Read/Write Requests to BAR1:

BAR1 is a PCI Express (PCIe) interface used to allow the CPU or other devices to directly access GPU memory. The GPU normally transfers memory with its copy engines, which would not show up as BAR1 activity. The GPU drivers on the CPU do a small amount of BAR1 accesses, but heavier traffic is typically coming from other technologies.

2.13. CUPTI Checkpoint API

Starting with CUDA 11.5, CUPTI ships with a new library to assist tool developers who wish to replay kernels under direct control, such as tools using the Profiling API User Replay mode. This new Checkpoint library provides support for automatically saving and restoring device state for many common uses.

A device checkpoint is a managed copy of device functional state - including values in memory, along with some (but not all) other user visible state of the device. When a checkpoint is saved, this state is saved to internal buffers, preferentially using free device, then host, and finally filesystem space to save the data. The user tool maintains a handle to a checkpoint, and is able to restore the checkpoint with a single call, restoring the state so a kernel may be re-executed and expect to have the same device state as when the checkpoint was saved.

Once saved, a checkpoint may be restored any time including after multiple kernels have been launched, though currently there are limitations on which user calls (CUDA or driver API calls) have been validated to work between a Save and Restore. It currently is known safe to launch multiple kernels on a context and to do memcpy calls before restoring a checkpoint. Future versions of CUPTI will extend this to support additional API calls between a Save and Restore.

Checkpoints may be saved during injected kernel launch callbacks or directly coded into a target application.

Certain APIs are known to not work with the version of the Checkpoint API shipped with CUPTI 11.5, including Stream Capture mode.

2.13.1. Usage

There is one header for the library, cupti_checkpoint.h, which needs to be included, and libcheckpoint needs to be linked in to the application or injection library. Though the checkpoint library doesn’t depend on cupti, the error codes returned by the API are shared with cupti, so linking libcupti in is needed in order to translate the return codes to string representations.

The Checkpoint API follows a similar design to other CUPTI APIs. API behavior is controlled through a structure, CUpti_Checkpoint, which is initialized by a tool or application, then passed to cuptiCheckpointSave. If the call is successful, the structure saves a handle to a checkpoint. At this point, the application may make a series of calls which modify device state (kernels which update memory, memcopies, etc), and when the device state should be restored, the tool can use the same structure in calls to cuptiCheckpointRestore, and finally a call to cuptiCheckpointFree to release the resources used by the checkpoint object.

Multiple checkpoints may be saved at the same time. If multiple checkpoints exist, they operate entirely independently - each checkpoint consumes the full resources needed to restore the device state at the point it was saved. Order of operations between multiple checkpoints is not enforced by the API - while a common use for multiple checkpoints may be a nested pattern, it is also possible to interleave checkpoint operations.

Between a cuptiCheckpointSave and cuptiCheckpointRestore, any number of standard kernel launches (or equivalent API calls such as cuLaunchKernel) or memcpy calls may be made. Additionally, any host (cpu) side calls may be made that do not affect device state. It is possible that other CUDA or driver API calls may be made, but have not been validated with the 11.5 release.

Several options exist in the CUpti_Checkpoint structure. They must be set prior to the initial cuptiCheckpointSave using that structure. Any further changes to the structure are ignored until after a call to cuptiCheckpointFree, at which point the structure can be re-configured and re-used.

Important per-checkpoint options:

  • structSize - must be set to the value of CUpti_Checkpoint_STRUCT_SIZE

  • ctx - if NULL, the checkpoint will be of the default CUDA context, otherwise, specifies which context

  • reserveDeviceMB - Restrict a checkpoint save from using at least this much device memory

  • reserveHostMB - Restrict a checkpoint save from using at least this much host memory

  • allowOverwrite - It is normally an error to call Save using an existing checkpoint handle (one which has not been Freed). When set, this option allows the Save operation to be called multiple times on a handle. Note that when using this option, the CUpti_Checkpoint options are not re-read on any subsequent Save. To read new options, the handle must be passed to cuptiCheckpointFree prior to the cuptiCheckpointSave call.

  • optimizations - Bitmask of options for checkpoint behavior

    • CUPTI_CHECKPOINT_OPT_TRANSFER - Normally when restoring a checkpoint, all existing device memory at the time of the save is restored. This optimization adds a test to see whether a block of memory has changed before restoring it and caches the results for subsequent calls to Restore. Use of this option requires that all Restore calls be done at the same point in an application for a given checkpoint. As the optimization may be computationally expensive, it is most useful when there is a significant amount of data that can be skipped and there will be several calls to Restore the checkpoint.

2.13.2. Restrictions

Checkpoints API calls may not be made during a stream capture. They also may not be inserted into a graph. Beyond kernel launches (cuLaunchKernel, standard kernel<<<>>> launches, etc) and memcpy calls, the remaining CUDA and driver API calls have not been validated within a CheckpointSave and Restore region. Any other CUDA or driver API calls (example - device malloc or free) may work, or may cause undetermined behavior. Additional APIs will be validated to work with the Checkpoint API in future releases.

The Checkpoint API does not have visibility into which API calls have been made between cuptiCheckpointSave and cuptiCheckpointRestore calls, and may not be able to correctly detect error cases if unsupported calls have been made. In this case it is possible that device state may only be partially restored by cuptiCheckpointRestore, which may cause functionally incorrect behavior in subsequent device calls.

The Checkpoint API only restores functionally visible device state, not performance critical state. Some performance characteristics, such as state of the caches, will not be saved by a checkpoint, and saving or restoring a checkpoint may change the occupancy and alter performance for subsequent device calls.

The Checkpoint API makes no attempt to restore host (non-device) state, beyond freeing the resources it internally uses during a call to cuptiCheckpointFree.

The Checkpoint API by default uses device memory, host memory, and finally the filesystem to back up the device state. It is possible that addition of a cuptiCheckpointSave causes a later device allocation to fail due to the increased device memory usage. (Similarly, host memory is also used, and may be affected by a checkpoint). To allow the user to guarantee a certain amount of device or host memory remains available for later use, reserveDeviceMB and reserveHostMB fields in the CUpti_Checkpoint struct may be set. Use of these fields will guarantee that the device or host memory will leave that much memory free during a cuptiCheckpointSave call, but may cause the Checkpoint API call performance to degrade due to increased use of slower storage spaces.

2.13.3. Examples

The Checkpoint API does not require any other CUPTI calls. A simple use case could be to compare the output of three different implementations of a kernel. Pseudocode for this could look like:

CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };

int kernel = 0;
do
{
  if (kernel == 0)
    cuptiCheckpointSave(&cp);
  else
    cuptiCheckpointRestore(&cp);

  if (kernel == 0)
    kernel_1<<<>>>(...);
  else if (kernel == 1)
    kernel_2<<<>>>(...);
  else if (kernel == 2)
    kernel_3<<<>>>(...);
} while (kernel++ < 3);

cuptiCheckpointFree(&cp);

In this example, even if any of the kernels modify their own input data, the subsequent passes through the loop will still run correctly - the modified input data would be restored by each call to cuptiCheckpointRestore before the next kernel runs. This is particularly useful when a programmer does not know the exact state of the device prior to a kernel call - the Checkpoint API ensures that all needed data is saved and restored, which would not otherwise be practical or perhaps even possible in some complex cases.

Another possible use case could be for fuzzing - randomly modifying input to a kernel, and ensuring it performs as expected. Instead of manually restoring device state to a known good point, the Checkpoint API and initialize a good state, and the fuzzer can modify only what is needed.

CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };

int i = 0;
do
{
  if (i == 0)
    cuptiCheckpointSave(&cp);
  else
    cuptiCheckpointRestore(&cp);

  setup_test<<<>>>(i, ...);

  kernel<<<>>>(...);

  validate_result<<<>>>(i, ...);
} while (i++ < num_tests);

cuptiCheckpointFree(&cp);

Finally, the Checkpoint API is very useful for the User Replay mode of the CUPTI Profiling API. The User Replay mode can be very desireable as it allows kernels to run concurrently, which Kernel Replay mode does not, and only replays parts of the application which are within a performance region, unlike Applicatin Replay mode. However, in this mode, a kernel potentially needs to be launched multiple times in order to gather all requested metrics. This is complicated when the kernel may modify some of its own input data, and without the Checkpoint API, would require the tool developer to handle restoring any modified input data manually. It is difficult for a tool to automatically know whether any data needs to be restored before each iteration, or even what the existing state of the device is. Using the Checkpoint API, the tool can guarantee that input data will be restored each pass.

CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };

// Pseudocode - assume all Profiling API structures are already initialized correctly
cuptiProfilerBeginSession(&beginSessionParams);
cuptiProfilerSetConfig(&setConfigParams);
int numPasses = 0;
bool lastPass = false;
do
{
  if (numPasses == 0)
    cuptiCheckpointSave(&cp);
  else
    cuptiCheckpointRestore(&cp);

  cuptiProfilerBeginPass(&beginPassParams);
  cuptiProfilerEnableProfiling(&enableProfilingParams);
  cuptiProfilerPushRange(&pushRangeParams);

  // Kernel launch on N separate streams - will be profiled while running concurrently
  kernel<<<..., stream0>>>(...);
  kernel<<<..., stream1>>>(...);
  ...
  kernel<<<..., streamN>>>(...);

  cudaStreamSynchronize(stream0);
  cudaStreamSynchronize(stream1);
  ...
  cudaStreamSynchronize(streamN);

  cuptiProfilerPopRange(&popRangeParams);
  cuptiProfilerDisableProfiling(&disableProfilingParams);
  lastPass = cuptiProfilerEndPass(&endPassParams);
} while (lastPass == false);
cuptiProfilerFlushCounterData(&flushCounterDataParams);
cuptiProfilerUnsetConfig(&unsetConfigParams);
cuptiProfilerEndSession(&endSessionParams);

In this example, the Profiler range will span all concurrently running kernels, which may modify their own input data - each pass through the loop will restore the initial values.

2.14. CUPTI overhead

CUPTI incurs overhead when used for tracing or profiling of the CUDA application. Overhead can vary significantly from one application to another. It largely depends on the density of the CUDA activities in the application; lesser the CUDA activities, less the CUPTI overhead. In general overhead of tracing i.e. activity APIs is much lesser than the profiling i.e. event and metric APIs.

2.14.1. Tracing Overhead

One of the goal of the tracing APIs is to provide a non-invasive collection of the timing information of the CUDA activities. Tracing is a low-overhead mechanism for collecting fine-grained runtime information.

2.14.1.1. Execution overhead

Consider below points which can affect the execution overhead of the application:

  • Enable only the activities and callbacks which are of interest.

  • Return from the callbacks as early as possible. Callbacks are issued from the host, these can block the work submission on the GPU if not returned early since CUPTI and thus the CUDA driver can’t make the forward progress on the host thread which issues the callback.

  • APIs cuptiActivityEnableDriverApi and cuptiActivityEnableRuntimeApi can be used to limit the tracing of CUDA APIs that are of interest.

  • For CUDA Graphs, if node level visibility is not desired, switching to the graph-level tracing from node-level tracing can help in reducing the collection overhead significantly. Use activity kind CUPTI_ACTIVITY_KIND_GRAPH_TRACE to enable graph-level tracing.

  • For activity buffer requested callback, the client should return the buffer as quickly as possible as this callback is issued from the application thread. Client can pre-allocate a pool of activity buffers and return an empty buffer from the pool when requested by CUPTI.

  • CUPTI initializes the new activity buffer with zero values using the memset call before using it. This operation can be skipped if user provides the zero value buffer and sets the attribute CUPTI_ACTIVITY_ATTR_ZEROED_OUT_ACTIVITY_BUFFER of the enum CUpti_ActivityAttribute.

  • Client can request CUPTI to maintain the activity buffers at the thread level instead of global buffers. This can be achieved by setting the option CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER of the enum CUpti_ActivityAttribute. This can help in reducing the collection overhead for applications which launch CUDA activities from multiple host threads.

  • Reduce the frequency of buffer flushing as it can be an expensive operation. This can be achieved by setting a high flush period using the API cuptiActivityFlushPeriod to avoid internal flushing done by CUPTI and by reducing the frequency of the API cuptiActivityFlushAll. This approach might result in increased memory footprint on host and device.

  • For device buffers, CUPTI allocates a new buffer when it runs out of the buffers from the pool, and this happens in the main application thread, which might result in stalls in the critical path. This can be avoided by either pre-allocating more device buffers or increasing the size of the device buffer using the attributes CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE respectively from the enum CUpti_ActivityAttribute.

  • Serial kernel trace enabled using the activity kind CUPTI_ACTIVITY_KIND_KERNEL can significantly change the overall performance characteristics of the application because all kernel executions are serialized on the GPU. For applications which use only a single CUDA stream and therefore cannot have concurrent kernel execution, this mode can be useful as it usually (not always) incurs less profiling overhead compared to the concurrent kernel mode.

  • Concurrent kernel trace enabled using the activity kind CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL doesn’t affect the concurrency of the kernels in the application. In this mode, CUPTI instruments the kernel code to collect the timing information. A single instrumentation code is generated at the time of loading the CUDA module and applied to each kernel during the kernel execution. Instrumentation code generation overhead is attributed as CUPTI_ACTIVITY_OVERHEAD_CUPTI_INSTRUMENTATION in the activity record CUpti_ActivityOverhead2.

  • Due to the code instrumentation, concurrent kernel mode can add significant runtime overhead if used on kernels that execute a large number of blocks and that have short execution duration.

2.14.1.2. Memory overhead

CUPTI allocates device and pinned system memory for storing the tracing information:

  • Static memory allocation: CUPTI allocates 3 buffers of 3 MB each in the pinned system memory for each CUDA context by default during the context creation phase. This is used for storing the concurrent kernel, serial kernel, memcopy and memset tracing information and these buffers are sufficient for storing information for about 300K such activities. The number of buffers is controlled using the attribute CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE and the size of the buffer is determined by the attribute CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE. User can change the buffer size at any time during the profiling session, but this setting takes effect only for new buffer allocations. It is recommended to adjust the buffer size before the creation of any CUDA context to make sure that all the pre-allocated buffers are of the adjusted size.

  • Dynamic memory allocation: Once profiling buffers to store the tracing information are exhausted, CUPTI allocates another buffer of the same size. Note that memory footprint will not always scale with the kernel, memcopy, memset count because CUPTI reuses the buffer after processing all the records in the buffer. For applications with a high density of these activities CUPTI may allocate more buffers.

All of the CUPTI allocated memory associated with a context is freed when the context is destroyed. Memory allocation overhead is attributed as CUPTI_ACTIVITY_OVERHEAD_CUPTI_RESOURCE in the activity record CUpti_ActivityOverhead2. If there are no CUDA contexts created then CUPTI will not allocate corresponding buffers.

CUPTI allocates memory to store unique kernel names, NVTX ranges, CUDA module cubin:

  • Kernel trace: For kernel tracing enabled using the activity kind CUPTI_ACTIVITY_KIND_KERNEL or CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL CUPTI allocates memory to store the kernel name in the records. It is recommended to not free the memory allocated for the kernel name in the kernel activity record as the kernel name memory space might be common across all kernel records having the same kernel name.

  • NVTX ranges: For NVTX enabled using the activity kind CUPTI_ACTIVITY_KIND_MARKER CUPTI allocates memory to store the range name in the records. It is recommended to not free the memory allocated for the NVTX range name in the marker activity record as the NVTX range name memory space will be common across all NVTX range records having the same name.

  • CUDA module cubin: CUPTI caches copies of cubin images at the time of loading CUDA modules. This is done only for the profiling features that need it are enabled. All of the CUPTI allocated memory associated with the cubin image of the module is freed when the module is unloaded.

2.14.2. Profiling Overhead

Events and metrics collection using CUPTI incurs runtime overhead. This overhead depends on the number and type of events and metrics selected. Since each metric is computed from one or more events, metric overhead depends on the number and type of underlying events. The overhead includes time spent in configuration of hardware events and reading of hardware event values.

Factors affecting the execution overhead under profiling are:

  • Overhead is less for hardware provided events and metrics.

    • For event and metric APIs, events which are collected using the collection method CUPTI_EVENT_COLLECTION_METHOD_PM or CUPTI_EVENT_COLLECTION_METHOD_SM fall in this category.

    • For Profiling APIs, metrics which don’t have string “sass” in the name fall in this category.

  • Software instrumented events and metrics are expensive as CUPTI needs to instrument the kernel to collect these. Further these events and metrics cannot be combined with any other event or metric in the same pass as otherwise instrumented code will also contribute to the event value.

    • For event and metric APIs, the collection method CUPTI_EVENT_COLLECTION_METHOD_INSTRUMENTED fall in this category.

    • For Profiling APIs, metrics which have string “sass” in the name fall in this category.

  • In the serial mode, profiling may significantly change the overall performance characteristics of the application because all kernel executions are serialized on the GPU. This is done to enable tight event or metric collection around each kernel.

    • For event and metric APIs, the collection mode CUPTI_EVENT_COLLECTION_MODE_KERNEL, serializes all kernel executions on the GPU that occur between the APIs cuptiEventGroupEnable and cuptiEventGroupDisable. On the other hand, kernel concurrency can be maintained by using the collection mode CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS and restricting profiling to events and metrics that can be collected in a single pass.

    • For Profiling APIs, auto range mode serializes all kernel executions on the GPU. On the other hand, kernel concurrency can be maintained by using the user range mode.

  • When all the requested events or metrics cannot be collected in the single pass due to hardware or software limitations, one needs to replay the exact same set of GPU workloads multiple times. This can be achieved at the kernel granularity by replaying kernel multiple times or by launching the entire application multiple times. CUPTI provides support for kernel replay only. Application replay can be done by the CUPTI client.

  • When kernel replay is used the overhead to save and restore kernel state for each replay pass depends on the amount of device memory used by the kernel. Application replay is expected to perform better than kernel replay for the case when the size of device memory used by the kernel is high.

2.15. Reproducibility

Some CUPTI APIs are not guaranteed to return perfectly reproducible results between runs. Numerous factors introduce measurable run-to-run variation in software and hardware performance. There are several suggestions for users who want more reproducible results.

2.15.1. Fixed Clock Rate

Many metrics are directly affected by GPU SM and memory clock frequencies. By default, the GPU keeps clock rates low until work is launched, but clock rates do not boost to full speed immediately, so initial work launched after an idle period may run at low clock speed. Additionally, the target clock rates may vary based on power, thermal, and other factors. Complex interactions between different part of the system mean that these dynamic clock rates may not be reproducible between runs.

To reduce the effect of dynamic clock rates, it is possible to set a fixed clock rate. The GPU will no longer opportunistically boost clock rates above this rate, but it will eliminate the variability after GPU idle and effects of power and thermal variation. Several different methods exist to fix the SM or memory clock rates. The simplest may be nvidia-smi, but see this NVIDIA blog entry for more suggestions.

2.15.2. Serialization

Work may be submitted to the GPU which can run asynchronously and concurrently. This improves performance by using more of the GPU resources at once, but complicates profiling in two ways - first, kernels running concurrently can impact each other through contention for shared resources. Measurements of these shared resources will include the impact of any concurrently kernels, and it may not be possible to determine the particular impact of any given kernel. Second, by contending for resources with other kernels that are running without precisely guaranteed timing, the timing for a given kernel may be impacted in irreproducible ways.

When CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL is used to measure kernel timing, kernels are allowed to run concurrently on device. CUPTI_ACTIVITY_KIND_KERNEL may be used instead to measure serialized kernel timing. This will eliminate GPU concurrency within this process, and should provide better run-to-run reproducibility, but the timing may not be as realistic in this mode - kernels will not have to contend for shared resources, which can impact their performance.

2.15.3. Other Issues

Beyond variable clock rates and concurrent kernel execution, several other factors can affect application and kernel performance.

The driver normally does not stay loaded when not in use. It takes some time to load and initialize the driver, which may affect performance in noticeable and somewhat irreproducible ways. It is possible to keep the driver persistently loaded which will eliminate this initialization overhead. nvidia-persistenced is one tool to configure this; it can also be configured through nvidia-smi.

2.16. Samples

The CUPTI installation includes several samples that demonstrate the use of the CUPTI APIs. These samples can be referred to for the usage of different APIs supported by CUPTI. A sample might not be supported on all GPU architectures, please refer to the section GPU Support for the GPU architectures supported by different CUPTI APIs used in the sample. The samples are:

Activity API

activity_trace_async

This sample shows how to collect a trace of CPU and GPU activity using the new asynchronous activity buffer APIs.

callback_timestamp

This sample shows how to use the callback API to record a trace of API start and stop times.

cuda_graphs_trace

This sample shows how to collect the trace of CUDA graphs and correlate the graph node launch to the node creation API using CUPTI callbacks.

cuda_memory_trace

This sample shows how to collect the trace of CUDA memory operations. The sample also traces CUDA memory operations done via default memory pool.

cupti_correlation

This sample shows how to do the correlation between CUDA APIs and corresponding GPU activities.

cupti_external_correlation

This sample shows how to do the correlation of CUDA API activity records with external APIs.

cupti_finalize

This sample shows how to use API cuptiFinalize() to dynamically detach and attach CUPTI.

cupti_nvtx

This sample shows how to receive NVTX callbacks and collect NVTX records in CUPTI.

cupti_trace_injection

This sample shows how to build an injection library using the CUPTI activity and callback APIs. It can be used to trace CUDA APIs and GPU activities for any CUDA application. It does not require the CUDA application to be modified.

openacc_trace

This sample shows how to use CUPTI APIs for OpenACC data collection.

pc_sampling

This sample shows how to collect PC Sampling profiling information for a kernel using the PC Sampling Activity APIs.

sass_source_map

This sample shows how to generate CUpti_ActivityInstructionExecution records and how to map SASS assembly instructions to CUDA C source.

unified_memory

This sample shows how to collect information about page transfers for unified memory.

Event and Metric APIs

callback_event

This sample shows how to use both the callback and event APIs to record the events that occur during the execution of a simple kernel. The sample shows the required ordering for synchronization, and for event group enabling, disabling, and reading.

callback_metric

This sample shows how to use both the callback and metric APIs to record the metric’s events during the execution of a simple kernel, and then use those events to calculate the metric value.

cupti_query

This sample shows how to query CUDA-enabled devices for their event domains, events, and metrics.

event_multi_gpu

This sample shows how to use the CUPTI event and CUDA APIs to sample events on a setup with multiple GPUs. The sample shows the required ordering for synchronization, and for event group enabling, disabling, and reading.

event_sampling

This sample shows how to use the event APIs to sample events using a separate host thread.

Profiling API

extensions

This includes utilities used in some of the samples.

autorange_profiling

This sample shows how to use profiling APIs to collect metrics in autorange mode.

callback_profiling

This sample shows how to use callback and profiling APIs to collect the metrics during the execution of a kernel. It shows how to use different phases of profiling i.e. enumeration, configuration, collection and evaluation in the appropriate callbacks.

concurrent_profiling

This sample shows how to use the profiling API to record metrics from concurrent kernels launched in two different ways - using multiple streams on a single device, and using multiple threads with multiple devices.

cupti_metric_properties

This sample shows how to query various properties of metrics using the Profiling APIs. The sample shows collection method (hardware or software) and number of passes required to collect a list of metrics.

nested_range_profiling

This sample shows how to profile nested ranges using the Profiling APIs.

profiling_injection

This sample for Linux systems shows how to build an injection library which can automatically enable CUPTI’s Profiling API using Auto Ranges with Kernel Replay mode. It can attach to an application which was not instrumented using CUPTI and profile any kernel launches.

userrange_profiling

This sample shows how to use profiling APIs to collect metrics in user specified range mode.

PC Sampling API

pc_sampling_continuous

This injection sample shows how to collect PC Sampling profiling information using the PC Sampling APIs. A perl script libpc_sampling_continuous.pl is provided to run the CUDA application with different PC sampling options. Use the command ./libpc_sampling_continuous.pl –help to list all the options. The CUDA application code does not need to be modified. Refer the README.txt file shipped with the sample for instructions to build and use the injection library.

pc_sampling_start_stop

This sample shows how to collect PC Sampling profiling information for kernels within a range using the PC Sampling start/stop APIs.

pc_sampling_utility

This utility takes the pc sampling data file generated by the pc_sampling_continuous injection library as input. It prints the stall reason counter values at the GPU assembly instruction level. It also does GPU assembly to CUDA-C source correlation and shows the CUDA-C source file name and line number. Refer the README.txt file shipped with the sample for instructions to build and run the utility.

PM Sampling API

pm_sampling

This sample shows the usage of the PM sampling APIs for collecting sampling data for a list of metrics for kernels launched within a range using the PM sampling start/stop APIs.

SASS Metric API

sass_metric

This sample shows how to use the SASS metric API to enumerate metrics supported by a device and how to collect metrics at the source level using SASS patching.

Checkpoint API

checkpoint_kernels

This sample shows how to use the Checkpoint API to restore device memory, allowing a kernel to be replayed, even if it modifies its input data.