1. Release Notes

CUPTI Release Notes.

Release notes, including new features and important bug fixes. Supported platforms and GPUs.

1.1. Release Notes

1.1.1. Updates in CUDA 12.4 Update 1

Resolved Issues

Fixed a crash when API cuptiFinalize is used for applications using CUDA Graph.
Fixed a crash which can occur while tracing memcpy and memset nodes in a graph when using graph level tracing.
Skip delivering worker thread buffers on internal flush if the worker thread buffer is not full.

1.1.2. Updates in CUDA 12.4

New Features

Added tracing support for applications using Green contexts. Added two new fields isGreenContext and parentContextId in the context activity record. The activity record CUpti_ActivityContext is deprecated and it is replaced by a new activity record CUpti_ActivityContext2.
CUDA API calls are completed asynchronously from the perspective of the host CPU. This is accomplished by queuing the work slated for the GPU into a structure known as a command buffer. If there is insufficient space available in the command buffer when attempting to call a CUDA API, the host call will block until space becomes available. The user should be able to identify when this situation occurs. This is indicated using the new attribute CUPTI_ACTIVITY_OVERHEAD_COMMAND_BUFFER_FULL added in the activity overhead enum CUpti_ActivityOverheadKind. To provide additional details about the overhead, a new field overheadData is added in the overhead activity record. Activity record CUpti_ActivityOverhead2 is deprecated and it is replaced by the new activity record CUpti_ActivityOverhead3.
Added process ID and thread ID in the JIT activity record. To accomodate this change, activity record CUpti_ActivityJit is deprecated and it is replaced by a new activity record CUpti_ActivityJit2.
To correlate the sampling data for a kernel with the launch API in the serial mode of the PC Sampling APIs, a new field correlationId is added in the struct CUpti_PCSamplingPCData.
For PC Sampling APIs, total (smsp__pcsamp_sample_count) and dropped (smsp__pcsamp_samples_data_dropped) sample counts are collected by default.

Resolved Issues

Fixed the issue for overhead records showing the default thread ID than the one requested using the API cuptiSetThreadIdType().
Fixed instruction level SASS metrics profiling for CUDA Graph applications.
When a device graph is first launched from the device and it is not launched from the host earlier, end timestamp could be 0 for graph-level tracing on Ampere and later GPU architectures. This issue is fixed.

1.1.3. Updates in CUDA 12.3 Update 1

Resolved Issues

To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This was broken with CUDA 12.3 causing spurious gaps or overlap on Tegra platforms. Fixed the issue.

1.1.4. Updates in CUDA 12.3

New Features

New attributes CUPTI_ACTIVITY_OVERHEAD_RUNTIME_TRIGGERED_MODULE_LOADING and CUPTI_ACTIVITY_OVERHEAD_LAZY_FUNCTION_LOADING are added in the activity overhead enum CUpti_ActivityOverheadKind to provide the overhead information for CUDA runtime triggered module loading and lazy function loading respectively.
New API cuptiGetGraphExecId provides the unique ID of the executable graph.
Added support for collecting graph level trace for device launched graphs. A new API cuptiActivityEnableDeviceGraph is added to enable the collection of records for device launched graphs.
CUDA Graphs can be executed on multiple devices i.e. the root node could be launched on one device and the leaf node could be launched on the another device. New fields endDeviceId and endContextId are added to identify the ids of device and context respectively which are used to execute the last node of the graph. To accomodate this change, activity record CUpti_ActivityGraphTrace is deprecated and it is replaced by a new activity record CUpti_ActivityGraphTrace2.
Added WSL profiling support on Windows 10 WSL with OS build version 19044 and greater. WSL profiling is not supported on Windows 10 WSL for systems that exceed 1 TB of system memory.
Several performance improvements are done in the tracing path. One of the key improvements is to allow clients to request CUPTI to maintain the activity buffers at the thread level instead of global buffers. This can be achieved by setting the option CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER of the enum CUpti_ActivityAttribute. This can help in reducing the collection overhead for applications which launch CUDA activities from multiple host threads.
Frame pointers are enabled for Linux x86_64 libraries.
The deprecated Activity APIs and structures have been moved to a new header cupti_activity_deprecated.h, which is included in the header cupti_activity.h. Header cupti_activity.h contains only the latest version of APIs and structures.
CUPTI no longer uses profiling semaphore pool to store the profiling data. Coresponding attributes CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE, CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT and CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE have been deprecated.

Resolved Issues

Fixed SASS metric profiling for cuda graph.
Fixed race condition in the API cuptiSetThreadIdType for late subscription.

1.1.5. Updates in CUDA 12.2 Update 2

New Features

SASS Metric APIs introduced in the CUDA 12.2 GA release are transitioning from the beta to the production release.
- Added support for collecting SASS metrics for CUDA Graphs which are launched from host.
- Added a new field numOfDroppedRecords in the struct CUpti_SassMetricsDisable_Params to indicate the number of dropped records when SASS data is flushed prior to calling the disable API.
Added a new field api in the struct CUpti_Profiler_DeviceSupported_Params which can be used to check the configuration support level for profiler APIs like Profiling, PC Sampling and SASS Metric APIs.

Resolved Issues

Fixed the tracing and profiling support for the GA103 GPU.
Fixed a hang which can occur when activity buffer gets full while collecting the sampling data using the PC Sampling Activity API.
Fixed the issue of incorrect timestamps for graph level trace when a graph node is disabled using the APIs cuGraphNodeSetEnabled or cudaGraphNodeSetEnabled.

1.1.6. Updates in CUDA 12.2 Update 1

Support for Confidential Computing

CUPTI supports some APIs while running in CC-devtools mode:

Callback API
Activity API

The profiling APIs are not supported in CC-devtools mode with this release. Using these APIs should return an error indicating the configuration is not supported:

Profiling API
PC Sampling API
Checkpoint API
SASS Metrics API

Additionally, CUPTI is not supported at all in full CC mode. CC-devtools mode must be used for tools support. Some CUDA APIs are not supported or behave differently when running in CC or CC-devtools mode; notably, host pinned memory requests will be traced as managed memory requests, and any CUDA memcopies on these converted pointers are traced as Device to Device copies irrespective of the locality of the source or destination pointers. For details on how to configure CC or CC-devtools mode, system and software requirements, as well as documentation on CUDA API changes, please see the confidential compute release documentation at https://docs.nvidia.com/confidential-computing/.

Resolved Issues

Fixed timestamps for graph-level tracing for CUDA graphs running across multiple GPUs.
Fixed a potential hang when CUPTI is unable to fetch attributes for an activity.

1.1.7. Updates in CUDA 12.2

New Features

A new set of CUPTI APIs for collection of SASS metric data at the source level are provided in the header file cupti_sass_metrics.h. These support a larger set of metrics compared to the CUPTI Activity APIs for source-level analysis. SASS to source correlation can be done in the offline mode, similar to the PC sampling APIs. Hence the runtime overhead during data collection is lower. Refer to the section CUPTI SASS Metrics API for more details. Please note that this is a Beta feature, interface and functionality are subject to change in a future release.
CUPTI now reports fatal errors, non-fatal errors and warnings instantaneously through callbacks. A new callback domain CUPTI_CB_DOMAIN_STATE is added for subscribing to the instantaneous error reporting. Corresponding callback ids are provided in the struct CUpti_CallbackIdState.
Added support for profiling of device graphs and host graphs that launch device graphs. There are some known limitations, please refer to the Known Issues section for details.
Change in the stream attribute value is communicated by issuing the resource callback. Refer to the struct CUpti_StreamAttrData and callback id CUPTI_CBID_RESOURCE_STREAM_ATTRIBUTE_CHANGED added in the enum CUpti_CallbackIdResource.
New API cuptiGetErrorMessage provides descriptive message for CUPTI error codes.
Removed the deprecated API cuptiDeviceGetTimestamp from the header cupti_events.h.
Added metrics for Tensor core operations to count different types of tensor instructions. These metrics are named as sm[sp]__ops_path_tensor_src_{src}[_dst_{dst}[_sparsity_{on,off}]]. These are available for devices with compute capability 7.0 and higher, except for Turing TU11x GPUs.

Resolved Issues

Fixed crash for the graph-level trace for device graphs which are launched from the host.

1.1.8. Updates in CUDA 12.1 Update 1

Resolved Issues

Fixed CUPTI tracing failure when just-in-time compilation of embedded PTX code is disabled using the environment variable CUDA_DISABLE_PTX_JIT.
Fixed a crash in the API cuptiFinalize.

1.1.9. Updates in CUDA 12.1

New Features

Field wsl is added in the struct CUpti_Profiler_DeviceSupported_Params to indicate whether Profiling API is supported on Windows Subsystem for Linux (WSL) system or not.

1.1.10. Updates in CUDA 12.0 Update 1

Resolved Issues

Reduced the host memory overhead by avoiding caching copies of cubin images at the time of loading CUDA modules. Copies of cubin images are now created only when profiling features that need it are enabled.
By default CUPTI switches back to the device memory, instead of the pinned host memory, for allocation of the profiling buffer for concurrent kernel tracing. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED of the activity attribute enum CUpti_ActivityAttribute.
CUPTI now captures the cudaGraphLaunch API and its kernels when CUPTI is attached after the graph is instantiated using the API cudaGraphInstantiate but it is attached before the graph is launched using the API cudaGraphLaunch. Some data in the kernel record would be missing i.e. cacheConfig, sharedMemoryExecuted, partitionedGlobalCacheRequested, partitionedGlobalCacheExecuted, sharedMemoryCarveoutRequested etc. This fix requires the matching CUDA driver which ships with the CUDA 12.0 Update 1 release.

1.1.11. Updates in CUDA 12.0

New Features

Added new fields maxPotentialClusterSize and maxActiveClusters to help in calculating the cluster occupancy correctly. These fields are valid for devices with compute capability 9.0 and higher. To accomodate this change, activity record CUpti_ActivityKernel8 is deprecated and replaced by a new activity record CUpti_ActivityKernel9.
Enhancements for PC Sampling APIs:
- CUPTI creates few worker threads to offload certain operations like decoding of the hardware data to the CUPTI PC sampling data and correlation of the PC data to the SASS instructions. CUPTI wakes up these threads periodically. To control the sleep time of the worker threads, a new attribute CUPTI_PC_SAMPLING_CONFIGURATION_ATTR_TYPE_WORKER_THREAD_PERIODIC_SLEEP_SPAN is added in the enum CUpti_PCSamplingConfigurationAttributeType.
- Improved error reporting for hardware buffer overflow. When hardware buffer overflows, CUPTI returns the out of memory error code. And a new field hardwareBufferFull added in the struct CUpti_PCSamplingData is set to differentiate it from other out of memory cases. User can either increase the hardware buffer size or flush the hardware buffer at a higher frequency to avoid overflow.
Profiling APIs are supported on Windows Subsystem for Linux (WSL) with WSL version 2, NVIDIA display driver version 525 or higher and Windows 11.
CUPTI support for Kepler GPUs is dropped in CUDA Toolkit 12.0.

Resolved Issues

Removed minor CUDA version from the SONAME of the CUPTI shared library for compatibility reasons. For example, SONAME of CUPTI library is libcupti.so.12 instead of libcupti.so.12.0 in CUDA 12.0 release.
Activity kinds CUPTI_ACTIVITY_KIND_MARKER and CUPTI_ACTIVITY_KIND_MARKER_DATA can be enabled together.

1.1.12. Updates in CUDA 11.8

New Features

CUPTI adds tracing and profiling support for Hopper and Ada Lovelace GPU families.
Added new fields clusterX, clusterY, clusterZ and clusterSchedulingPolicy to output the Thread Block Cluster dimensions and scheduling policy. These fields are valid for devices with compute capability 9.0 and higher. To accomodate this change, activity record CUpti_ActivityKernel7 is deprecated and replaced by a new activity record CUpti_ActivityKernel8.
A new activity kind CUPTI_ACTIVITY_KIND_JIT and corresponding activity record CUpti_ActivityJit are introduced to capture the overhead involved in the JIT (just-in-time) compilation and caching of the PTX or NVVM IR code to the binary code. New record also provides the information about the size and path of the compute cache where the binary code is stored.
PC Sampling API is supported on Tegra platforms - QNX, Linux (aarch64) and Linux (x86_64) (Drive SDK).

Resolved Issues

Resolved an issue that might cause crash when the size of the device buffer is changed, using the attribute CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE, after creation of the CUDA context.

1.1.13. Updates in CUDA 11.7 Update 1

Resolved Issues

Resolved an issue for PC Sampling API cuptiPCSamplingGetData which might not always return all the samples when called after the PC sampling range defined by using the APIs cuptiPCSamplingStart and cuptiPCSamplingStop. Remaining samples were delivered in the successive call of the API cuptiPCSamplingGetData after the next range.
Disabled tracing of nodes in the CUDA Graph when user enables tracing at the Graph level using the activity kind CUPTI_ACTIVITY_KIND_GRAPH_TRACE.
Fixed missing channelID and channelType information for kernel records. Earlier these fields were populated for CUDA Graph launches only.

1.1.14. Updates in CUDA 11.7

New Features

A new activity kind CUPTI_ACTIVITY_KIND_GRAPH_TRACE and activity record CUpti_ActivityGraphTrace are introduced to represent the execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads involved in tracing each node separately. This activity can only be enabled for drivers of version 515 and above.
A new API cuptiActivityEnableAndDump is added to provide snapshot of certain activities like device, context, stream, NVLink and PCIe at any point during the profiling session.
Added sample cupti_correlation to show correlation between CUDA APIs and corresponding GPU activities.
Added sample cupti_trace_injection to show how to build an injection library using the activity and callback APIs which can be used to trace any CUDA application.

Resolved Issues

Fixed corruption in the function name for PC Sampling API records.
Fixed incorrect timestamps for GPU activities when user calls the API cuptiActivityRegisterTimestampCallback in the late CUPTI attach scenario.
Fixed incomplete records for device to device memcopies in the late CUPTI attach scenario. This issue manifests mainly when application has a mix of CUDA graph and normal kernel launches.

1.1.15. Updates in CUDA 11.6 Update 1

Resolved Issues

Fixed hang for the PC Sampling API cuptiPCSamplingStop. This issue is seen for the PC sampling start and stop resulting in generation of large number of sampling records.
Fixed timing issue for specific device to device memcpy operations.

1.1.16. Updates in CUDA 11.6

New Features

Two new fields channelID and channelType are added in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the ID and type of the hardware channel on which these activities happen. Activity records CUpti_ActivityKernel6, CUpti_ActivityMemcpy4, CUpti_ActivityMemcpyPtoP3 and CUpti_ActivityMemset3 are deprecated and replaced by new activity records CUpti_ActivityKernel7, CUpti_ActivityMemcpy5, CUpti_ActivityMemcpyPtoP4 and CUpti_ActivityMemset4.
New fields isMigEnabled, gpuInstanceId, computeInstanceId and migUuid are added in the device activity record to provide MIG information for the MIG enabled GPU. Activity record CUpti_ActivityDevice3 is deprecated and replaced by a new activity record CUpti_ActivityDevice4.
A new field utilizedSize is added in the memory pool and memory activity record to provide the utilized size of the memory pool. Activity record CUpti_ActivityMemoryPool and CUpti_ActivityMemory2 are deprecated and replaced by a new activity record CUpti_ActivityMemoryPool2 and CUpti_ActivityMemory3 respectively.
API cuptiActivityRegisterTimestampCallback and callback function CUpti_TimestampCallbackFunc are added to register a callback function to obtain timestamp of user’s choice instead of using CUPTI provided timestamp in activity records.
Profiling API supports profiling OptiX application.

Resolved Issues

Fixed multi-pass metric collection using the Profiling API in the auto range and kernel replay mode for Cuda Graph.
Fixed the performance issue for the PC sampling API cuptiPCSamplingStop.
Fixed corruption in variable names for OpenACC activity records.
Fixed corruption in the fields of the struct memoryPoolConfig in the activity record CUpti_ActivityMemory3.
Filled the fields of the struct memoryPoolConfig in the activity record CUpti_ActivityMemory3 when a memory pointer allocated via memory pool is released using cudaFree CUDA API.

1.1.17. Updates in CUDA 11.5 Update 1

Resolved Issues

Resolved an issue that causes incorrect range name for NVTX event attributes. The issue was introduced in CUDA 11.4.
Made NVTX initialization APIs InitializeInjectionNvtx and InitializeInjectionNvtx2 thread-safe.

1.1.18. Updates in CUDA 11.5

New Features

A new API cuptiProfilerDeviceSupported is introduced to expose overall Profiling API support and specific requirements for a given device. Profiling API must be initialized by calling cuptiProfilerInitialize before testing device support.
PC Sampling struct CUpti_PCSamplingData introduces a new field nonUsrKernelsTotalSamples to provide information about the number of samples collected for all non-user kernels.
Activity record CUpti_ActivityDevice2 for device information has been deprecated and replaced by a new activity record CUpti_ActivityDevice3. New record adds a flag isCudaVisible to indicate whether device is visible to CUDA.
Activity record CUpti_ActivityNvLink3 for NVLink information has been deprecated and replaced by a new activity record CUpti_ActivityNvLink4. New record can accommodate NVLink port information upto a maximum of 32 ports.
A new CUPTI Checkpoint API is introduced, enabling automatic saving and restoring of device state, and facilitating development of kernel replay tools. This is helpful for User Replay mode of the CUPTI Profiling API, but is not limited to use with CUPTI.
Tracing is supported on the Windows Subsystem for Linux version 2 (WSL 2).
CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). A new error code CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED is introduced to indicate it.

Resolved Issues

Resolved an issue that causes crash for tracing of device to device memcopy operations.
Resolved an issue that causes crash for OpenACC activity when it is enabled before other activities.

1.1.19. Updates in CUDA 11.4 Update 1

Resolved Issues

Resolved serialization of CUDA Graph launches for applications which use multiple threads to launch work.
Previously, for applications that use CUDA Dynamic Parallelism (CDP), CUPTI detects the presence of the CDP kernels in the CUDA module. Even if CDP kernels are not called, it fails to trace the application. There is a change in the behavior, CUPTI now traces all the host launched kernels until it encounters a host launched kernel which launches child kernels. Subsequent kernels are not traced.

1.1.20. Updates in CUDA 11.4

New Features

Profiling APIs support profiling of the CUDA kernel nodes launched by a CUDA Graph. Auto range profiling with kernel replay mode and user range profiling with user replay and application replay modes are supported. Other combinations of range profiling and replay modes are not supported.
Added support for tracing and profiling on NVIDIA virtual GPUs (vGPUs) on an upcoming GRID/vGPU release.
Added sample profiling_injection to show how to build injection library using the Profiling API.
Added sample concurrent_profiling to show how to retain the kernel concurrency across streams and devices using the Profiling API.

Resolved Issues

Resolved the issue of not tracing the device to device memcopy nodes in a CUDA Graph.
Fixed the issue of reporting zero size for local memory pool for mempool creation record.
Resolved the issue of non-collection of samples for the default CUDA context for PC Sampling API.
Enabled tracking of all domains and registered strings in NVTX irrespective of whether the NVTX activity kind or callbacks are enabled. This state tracking is needed for proper working of the tool which creates these NVTX objects before enabling the NVTX activity kind or callback.

1.1.21. Updates in CUDA 11.3

New Features

A new set of CUPTI APIs for PC sampling data collection are provided in the header file cupti_pcsampling.h which support continuous mode data collection without serializing kernel execution and have a lower runtime overhead. Along with these a utility library is provided in the header file cupti_pcsampling_util.h which has APIs for GPU assembly to CUDA-C source correlation and for reading and writing the PC sampling data from/to files. Refer to the section CUPTI PC Sampling API for more details.
Enum CUpti_PcieGen is extended to include PCIe Gen 5.
The following functions are deprecated and will be removed in a future release:
- Struct NVPA_MetricsContext and related APIs NVPW_MetricsContext_* from the header nvperf_host.h. It is recommended to use the struct NVPW_MetricsEvaluator and related APIs NVPW_MetricsEvaluator_* instead. Profiling API samples have been updated to show how to use these APIs.
- cuptiDeviceGetTimestamp from the header cupti_events.h.

Resolved Issues

Overhead reduction for tracing of CUDA memcopies.
To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This method can cause spurious gaps or overlap on the timeline. CUPTI improves the conversion function to provide more precise timestamps.
Generate overhead activity record for semaphore pool allocation.

1.1.22. Updates in CUDA 11.2

New Features

A new activity kind CUPTI_ACTIVITY_KIND_MEMORY_POOL and activity record CUpti_ActivityMemoryPool are introduced to represent the creation, destruction and trimming of a memory pool. Enum CUpti_ActivityMemoryPoolType lists types of memory pool.
A new activity kind CUPTI_ACTIVITY_KIND_MEMORY2 and activity record CUpti_ActivityMemory2 are introduced to provide separate records for memory allocation and release operations. This helps in correlation of records of these operations to the corresponding CUDA APIs, which otherwise is not possible using the existing activity record CUpti_ActivityMemory which provides a single record for both the memory operations.
Added a new pointer field of type CUaccessPolicyWindow in the kernel activity record to provide the access policy window which specifies a contiguous region of global memory and a persistence property in the L2 cache for accesses within that region. To accomodate this change, activity record CUpti_ActivityKernel5 is deprecated and replaced by a new activity record CUpti_ActivityKernel6. This attribute is not collected by default. To control the collection of launch attributes, a new API cuptiActivityEnableLaunchAttributes is introdcued.
New attributes CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE and CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE are added in the activity attribute enum CUpti_ActivityAttribute to set and get the number of device buffers and profiling semaphore pools which are preallocated for the context.
CUPTI now allocates profiling buffer for concurrent kernel tracing in the pinned host memory in place of device memory. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED of the activity attribute enum CUpti_ActivityAttribute.
The compiler generated line information for inlined functions is improved due to which CUPTI can associate inlined functions with the line information of the function call site that has been inlined.
Removed support for NVLink performance metrics (nvlrx__* and nvltx__*) from the Profiling API due to a potential application hang during data collection. The metrics will be added back in a future CUDA release.

Resolved Issues

Execution overheads introduced by CUPTI in the tracing path is reduced.
For the concurrent kernel activity kind CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL, CUPTI instruments the kernel code to collect the timing information. Previously, every kernel in the CUDA module was instrumented, thus the overhead is proportional to the number of different kernels in the module. This is a static overhead which happens at the time of loading the CUDA module. To reduce this overhead, kernels are not instrumented at the module load time, instead a single instrumentation code is generated at the time of loading the CUDA module and it is applied to each kernel during the kernel execution, thus avoiding most of the static overhead at the CUDA module load time.

1.1.23. Updates in CUDA 11.1

New Features

CUPTI adds tracing and profiling support for the NVIDIA Ampere GPUs with compute capability 8.6.
Added a new field graphId in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the unique ID of the CUDA graph that launches the activity through CUDA graph APIs. To accomodate this change, activity records CUpti_ActivityMemcpy3, CUpti_ActivityMemcpyPtoP2 and CUpti_ActivityMemset2 are deprecated and replaced by new activity records CUpti_ActivityMemcpy4, CUpti_ActivityMemcpyPtoP3 and CUpti_ActivityMemset3. And kernel activity record CUpti_ActivityKernel5 replaces the padding field with graphId. Added a new API cuptiGetGraphId to query the unique ID of the CUDA graph.
Added a new API cuptiActivityFlushPeriod to set the flush period for the worker thread.
Added support for profiling cooperative kernels using Profiling APIs.
Added NVLink performance metrics (nvlrx__* and nvltx__*) using the Profiling APIs. These metrics are available on devices with compute capability 7.0, 7.5 and 8.0, and these can be collected at the context level. Refer to the table Metrics Mapping Table for mapping between earlier CUPTI metrics and the Perfworks NVLink metrics for devices with compute capability 7.0.

Resolved Issues

Resolved an issue that causes CUPTI to not return full and completed activity buffers for a long time, CUPTI now attempts to return buffers early.
To reduce the runtime overhead, CUPTI wakes up the worker thread based on certain heuristics instead of waking it up at a regular interval. New API cuptiActivityFlushPeriod can be used to control the flush period of the worker thread. This setting overrides the CUPTI heurtistics.

1.1.24. Updates in CUDA 11.0

New Features

CUPTI adds tracing and profiling support for devices with compute capability 8.0 i.e. NVIDIA A100 GPUs and systems that are based on A100.
Enhancements for CUDA Graph:
- Support to correlate the CUDA Graph node with the GPU activities: kernel, memcpy, memset.
  - Added a new field graphNodeId for Node Id in the activity records for kernel, memcpy, memset and P2P transfers. Activity records CUpti_ActivityKernel4, CUpti_ActivityMemcpy2, CUpti_ActivityMemset and CUpti_ActivityMemcpyPtoP are deprecated and replaced by new activity records CUpti_ActivityKernel5, CUpti_ActivityMemcpy3, CUpti_ActivityMemset2 and CUpti_ActivityMemcpyPtoP2.
  - graphNodeId is the unique ID for the graph node.
  - graphNodeId can be queried using the new CUPTI API cuptiGetGraphNodeId().
  - Callback CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED is issued between a pair of the API enter and exit callbacks.
- Introduced new callback CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED to indicate the cloning of the CUDA Graph node.
- Retain CUDA driver performance optimization in case memset node is sandwiched between kernel nodes. CUPTI no longer disables the conversion of memset nodes into kernel nodes for CUDA graphs.
- Added support for cooperative kernels in CUDA graphs.
Added support to trace Optix applications. Refer the Optix Profiling section.
CUPTI overhead is associated with the thread rather than process. Object kind of the overhead record CUpti_ActivityOverhead is switched to CUPTI_ACTIVITY_OBJECT_THREAD.
Added error code CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED to indicate the presense of another CUPTI subscriber. API cuptiSubscribe() returns the new error code than CUPTI_ERROR_MAX_LIMIT_REACHED.
Added a new enum CUpti_FuncShmemLimitConfig to indicate whether user has opted in for maximun dynamic shared memory size on devices with compute capability 7.x by using function attributes CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES or cudaFuncAttributeMaxDynamicSharedMemorySize with CUDA driver and runtime respectively. Field shmemLimitConfig in the kernel activity record CUpti_ActivityKernel5 shows the user choice. This helps in correct occupancy calulation. Value FUNC_SHMEM_LIMIT_OPTIN in the enum cudaOccFuncShmemConfig is the corresponding option in the CUDA occupancy calculator.

Resolved Issues

Resolved an issue that causes incorrect or stale timing for memcopy and serial kernel activities.
Overhead for PC Sampling Activity APIs is reduced by avoiding the reconfiguration of the GPU when PC sampling period doesn’t change between successive kernels. This is applicable for devices with compute capability 7.0 and higher.
Fixed issues in the API cuptiFinalize() including the issue which may cause the application to crash. This API provides ability for safe and full detach of CUPTI during the execution of the application. More details in the section Dynamic Detach.

1.1.25. Updates in CUDA 10.2

New Features

CUPTI allows tracing features for non-root and non-admin users on desktop platforms. Note that events and metrics profiling is still restricted for non-root and non-admin users. More details about the issue and the solutions can be found on this web page.
CUPTI no longer turns off the performance characteristics of CUDA Graph when tracing the application.
CUPTI now shows memset nodes in the CUDA graph.
Fixed the incorrect timing issue for the asynchronous cuMemset/cudaMemset activity.
Several performance improvements are done in the tracing path.

1.1.26. Updates in CUDA 10.1 Update 2

New Features

This release is focused on bug fixes and stability of the CUPTI.
A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, CUPTI cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.

1.1.27. Updates in CUDA 10.1 Update 1

New Features

Support for the IBM POWER platform is added for the
- Profiling APIs in the header cupti_profiler_target.h
- Perfworks metric APIs in the headers nvperf_host.h and nvperf_target.h

1.1.28. Updates in CUDA 10.1

New Features

This release is focused on bug fixes and performance improvements.
The new set of profiling APIs and Perfworks metric APIs which were introduced in the CUDA Toolkit 10.0 are now integrated into the CUPTI library distributed in the CUDA Toolkit. Refer to the sections CUPTI Profiling API and Perfworks Metric APIs for documentation of the new APIs.
Event collection mode CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS is now supported on all device classes including Geforce and Quadro.
Support for the NVTX string registration API nvtxDomainRegisterStringA().
Added enum CUpti_PcieGen to list PCIe generations.

1.1.29. Updates in CUDA 10.0

New Features

Added tracing support for devices with compute capability 7.5.
A new set of metric APIs are added for devices with compute capability 7.0 and higher. These provide low and deterministic profiling overhead on the target system. These APIs are currently supported only on Linux x86 64-bit and Windows 64-bit platforms. Refer to the CUPTI web page for documentation and details to download the package with support for these new APIs. Note that both the old and new metric APIs are supported for compute capability 7.0. This is to enable transition of code to the new metric APIs. But one cannot mix the usage of the old and new metric APIs.
CUPTI supports profiling of OpenMP applications. OpenMP profiling information is provided in the form of new activity records CUpti_ActivityOpenMp. New API cuptiOpenMpInitialize is used to initialize profiling for supported OpenMP runtimes.
Activity record for kernel CUpti_ActivityKernel4 provides shared memory size set by the CUDA driver.
Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph.
Added support for resource callbacks for resources associated with the CUDA Graph. Refer enum CUpti_CallbackIdResource for new callback IDs.

1.1.30. Updates in CUDA 9.2

New Features

Added support to query PCI devices information which can be used to construct the PCIe topology. See activity kind CUPTI_ACTIVITY_KIND_PCIE and related activity record CUpti_ActivityPcie.
To view and analyze bandwidth of memory transfers over PCIe topologies, new set of metrics to collect total data bytes transmitted and recieved through PCIe are added. Those give accumulated count for all devices in the system. These metrics are collected at the device level for the entire application. And those are made available for devices with compute capability 5.2 and higher.
CUPTI added support for new metrics:
- Instruction executed for different types of load and store
- Total number of cached global/local load requests from SM to texture cache
- Global atomic/non-atomic/reduction bytes written to L2 cache from texture cache
- Surface atomic/non-atomic/reduction bytes written to L2 cache from texture cache
- Hit rate at L2 cache for all requests from texture cache
- Device memory (DRAM) read and write bytes
- The utilization level of the multiprocessor function units that execute tensor core instructions for devices with compute capability 7.0
A new attribute CUPTI_EVENT_ATTR_PROFILING_SCOPE is added under enum CUpti_EventAttribute to query the profiling scope of a event. Profiling scope indicates if the event can be collected at the context level or device level or both. See Enum CUpti_EventProfilingScope for avaiable profiling scopes.
A new error code CUPTI_ERROR_VIRTUALIZED_DEVICE_NOT_SUPPORTED is added to indicate that tracing and profiling on virtualized GPU is not supported.

1.1.31. Updates in CUDA 9.1

New Features

Added a field for correlation ID in the activity record CUpti_ActivityStream.

1.1.32. Updates in CUDA 9.0

New Features

CUPTI extends tracing and profiling support for devices with compute capability 7.0.
Usage of compute device memory can be tracked through CUPTI. A new activity record CUpti_ActivityMemory and activity kind CUPTI_ACTIVITY_KIND_MEMORY are added to track the allocation and freeing of memory. This activity record includes fields like virtual base address, size, PC (program counter), timestamps for memory allocation and free calls.
Unified memory profiling adds new events for thrashing, throttling, remote map and device-to-device migration on 64 bit Linux platforms. New events are added under enum CUpti_ActivityUnifiedMemoryCounterKind. Enum CUpti_ActivityUnifiedMemoryRemoteMapCause lists possible causes for remote map events.
PC sampling supports wide range of sampling periods ranging from 2^5 cycles to 2^31 cycles per sample. This can be controlled through new field samplingPeriod2 in the PC sampling configuration struct CUpti_ActivityPCSamplingConfig.
Added API cuptiDeviceSupported() to check support for a compute device.
Activity record CUpti_ActivityKernel3 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel4. New record gives information about queued and submit timestamps which can help to determine software and hardware latencies associated with the kernel launch. These timestamps are not collected by default. Use API cuptiActivityEnableLatencyTimestamps() to enable collection. New field launchType of type CUpti_ActivityLaunchType can be used to determine if it is a cooperative CUDA kernel launch.
Activity record CUpti_ActivityPCSampling2 for PC sampling has been deprecated and replaced by new activity record CUpti_ActivityPCSampling3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
Activity record CUpti_ActivityNvLink for NVLink attributes has been deprecated and replaced by new activity record CUpti_ActivityNvLink2. New record accomodates increased port numbers between two compute devices.
Activity record CUpti_ActivityGlobalAccess2 for source level global accesses has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
New attributes CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE and CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT are added in the activity attribute enum CUpti_ActivityAttribute to set and get the profiling semaphore pool size and the pool limit.

1.1.33. Updates in CUDA 8.0

New Features

Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field latencySamples of new activity record CUpti_ActivityPCSampling2 provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.
Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record CUpti_ActivityNvLink. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.
CUPTI supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records CUpti_ActivityOpenAccData, CUpti_ActivityOpenAccLaunch and CUpti_ActivityOpenAccOther. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New API cuptiOpenACCInitialize is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.
Unified memory profiling provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum CUpti_ActivityUnifiedMemoryAccessType lists memory access types for GPU page fault events and enum CUpti_ActivityUnifiedMemoryMigrationCause lists migration causes for data transfer events.
Unified Memory profiling support is extended to Mac platform.
Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE added to CUpti_DeviceAttribute.
Added new activity kinds CUPTI_ACTIVITY_KIND_SYNCHRONIZATION, CUPTI_ACTIVITY_KIND_STREAM and CUPTI_ACTIVITY_KIND_CUDA_EVENT, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity record CUpti_ActivitySynchronization. Enum CUpti_ActivitySynchronizationType lists different types of CUDA synchronization constructs.
APIs cuptiSetThreadIdType()/cuptiGetThreadIdType() to set/get the mechanism used to fetch the thread-id used in CUPTI records. Enum CUpti_ActivityThreadIdType lists all supported mechanisms.
Added API cuptiComputeCapabilitySupported() to check the support for a specific compute capability by the CUPTI.
Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs cuptiActivityPushExternalCorrelationId() and cuptiActivityPopExternalCorrelationId() should be used to push and pop external correlation ids for the calling thread. Generated records of type CUpti_ActivityExternalCorrelation contain both external and CUPTI assigned correlation ids.
Added containers to store the information of events and metrics in the form of activity records CUpti_ActivityInstantaneousEvent, CUpti_ActivityInstantaneousEventInstance, CUpti_ActivityInstantaneousMetric and CUpti_ActivityInstantaneousMetricInstance. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.
Support for domains and annotation of synchronization objects added in NVTX v2. New activity record CUpti_ActivityMarker2 and enums to indicate various stages of synchronization object i.e. CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED and CUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE are added.
Unused field runtimeCorrelationId of the activity record CUpti_ActivityMemset is broken into two fields flags and memoryKind to indicate the asynchronous behaviour and the kind of the memory used for the memset operation. It is supported by the new flag CUPTI_ACTIVITY_FLAG_MEMSET_ASYNC added in the enum CUpti_ActivityFlag.
Added flag CUPTI_ACTIVITY_MEMORY_KIND_MANAGED in the enum CUpti_ActivityMemoryKind to indicate managed memory.
API cuptiGetStreamId has been deprecated. A new API cuptiGetStreamIdEx is introduced to provide the stream id based on the legacy or per-thread default stream flag.

1.1.34. Updates in CUDA 7.5

New Features

Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
API cuptiGetLastError is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.
Unified memory profiling is supported with MPS (Multi-Process Service)
Callback is provided to collect replay information after every kernel run during kernel replay. See API cuptiKernelReplaySubscribeUpdate and callback type CUpti_KernelReplayUpdateFunc.
Added new attributes in enum CUpti_DeviceAttribute to query maximum shared memory size for different cache preferences for a device function.

1.1.35. Updates in CUDA 7.0

New Features

CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record CUpti_ActivityPCSampling enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING outputs stall reason along with PC and other related information. Enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. This feature is available on devices with compute capability 5.2.
Added new activity record CUpti_ActivityInstructionCorrelation which can be used to dump source locator records for all the PCs of the function.
All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record CUpti_ActivityUnifiedMemoryCounter2, deprecating old record CUpti_ActivityUnifiedMemoryCounter.
MPS tracing and profiling support is extended on multi-gpu setups.
Activity record CUpti_ActivityDevice for device information has been deprecated and replaced by new activity record CUpti_ActivityDevice2. New record adds device UUID which can be used to uniquely identify the device across profiler runs.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.

1.1.36. Updates in CUDA 6.5

New Features

Instruction classification is done for source-correlated Instruction Execution activity CUpti_ActivityInstructionExecution. See CUpti_ActivityInstructionClass for instruction classes.
Two new device attributes are added to the activity CUpti_DeviceAttribute:
- CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Two new metric properties are added:
- CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Activity record CUpti_ActivityGlobalAccess for source level global access information has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.
Activity record CUpti_ActivityBranch for source level branch information has been deprecated and replaced by new activity record CUpti_ActivityBranch2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.
Sample sass_source_map is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.
Default event collection mode is changed to Kernel (CUPTI_EVENT_COLLECTION_MODE_KERNEL) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS). Also Continuous mode is supported only on Tesla devices.
Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API cuptiGetAutoBoostState is added to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don’t support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.

1.1.37. Updates in CUDA 6.0

New Features

Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The Instruction Execution kind collects SASS-level instruction execution counts, divergence data, and predication data. The Shared Access kind collects source correlated data indication inefficient shared memory accesses.
CUPTI provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
CUPTI recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
The CUpti_ActivityContext activity record CUpti_ActivityContext has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bit computeApiKind field was replaced with two 16 bit fields, computeApiKind and defaultStreamId. Because all valid computeApiKind values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value for computeApiKind and have a value of zero for defaultStreamId. The CUPTI client is responsible for versioning the persisted context data to recognize when the defaultStreamId field is valid.
To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function cuptiMetricGetRequiredEventGroupSets can be used to get the groups of events that should be collected at the same time.
Execution overheads introduced by CUPTI have been dramatically decreased.
The new activity buffer API introduced in CUDA Toolkit 5.5 is required. The legacy cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions have been removed.

1.1.38. Updates in CUDA 5.5

New Features

Applications that use CUDA Dynamic Parallelism can be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
Device attributes such as power usage, clocks, thermals, etc. are reported via a new activity kind.
A new activity buffer API uses callbacks to request and return buffers of activity records. The existing cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions are still supported but are deprecated and will be removed in a future release.
The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
A new metric API cuptiMetricGetValue2 allows metric values to be calculated for any device, even if that device is not available on the system.
CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.

1.2. Known Issues

The following are known issues with the current release.

A security vulnerability issue required profiling tools to disable features using GPU performance counters for non-root or non-admin users when using a Windows 419.17 or Linux 418.43 or later driver. By default, NVIDIA drivers require elevated permissions to access GPU performance counters. On Tegra platforms, profile as root or using sudo. On other platforms, you can either start profiling as root or using sudo, or by enabling non-admin profiling. More details about the issue and the solutions can be found on the ERR_NVGPUCTRPERM web page.

Note

CUPTI allows tracing features for non-root and non-admin users on desktop platforms only, Tegra platforms require root or sudo access.
Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default. But it might fail to do so in some conditions and profiling will continue and results will be inconsistent. API cuptiGetAutoBoostState() can be used to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don’t support auto boost. Note that auto boost is supported only on certain Tesla devices with compute capability 3.0 and higher.
CUPTI doesn’t populate the activity structures which are deprecated, instead the newer version of the activity structure is filled with the information.
Because of the low resolution of the timer on Windows, the start and end timestamps can be same for activities having short execution duration on Windows.
The application which calls CUPTI APIs cannot be used with Nvidia tools like nvprof, Nvidia Visual Profiler, Nsight Compute, Nsight Systems, Nvidia Nsight Visual Studio Edition, cuda-gdb and cuda-memcheck.
PCIe and NVLink records, when enabled using the API cuptiActivityEnable, are not captured when CUPTI is initialized lazily after the CUDA initialization. API cuptiActivityEnableAndDump can be used to dump the records for these activities at any point during the profiling session.
CUPTI fails to profile the OpenACC application when the OpenACC library linked with the application has missing definition of the OpenACC API routine/s. This is indicated by the error code CUPTI_ERROR_OPENACC_UNDEFINED_ROUTINE.
OpenACC profiling might fail when OpenACC library is linked statically in the user application. This happens due to the missing definition of the OpenACC API routines needed for the OpenACC profiling, as compiler might ignore definitions for the functions not used in the application. This issue can be mitigated by linking the OpenACC library dynamically.
Unified memory profiling is not supported on the ARM architecture.
Profiling a C++ application which overloads the new operator at the global scope and uses any CUDA APIs like cudaMalloc() or cudaMallocManaged() inside the overloaded new operator will result in a hang.
Devices with compute capability 6.0 and higher introduce a new feature, compute preemption, to give fair chance for all compute contexts while running long tasks. With compute preemption feature-
- If multiple contexts are running in parallel it is possible that long kernels will get preempted.
- Some kernels may get preempted occasionally due to timeslice expiry for the context.
If kernel has been preempted, the time the kernel spends preempted is still counted towards kernel duration.

Compute preemption can affect events and metrics collection. The following are known issues with the current release:

Events and metrics collection for a MPS client can result in higher counts than expected on devices with compute capability 7.0 and higher, since MPS client may get preempted due to termination of another MPS client.
Events warps_launched and sm_cta_launched and metric inst_per_warp might provide higher counts than expected on devices with compute capability 6.0 and higher. Metric unique_warps_launched can be used in place of warps_launched to get correct count of actual warps launched as it is not affected by compute preemption.

To avoid compute preemption affecting profiler results try to isolate the context being profiled:
- Run the application on secondary GPU where display is not connected.
- On Linux if the application is running on the primary GPU where the display driver is connected then unload the display driver.
- Run only one process that uses GPU at one time.
Devices with compute capability 6.0 and higher support demand paging. When the kernel is scheduled for the first time, all the pages allocated using cudaMallocManaged and that are required for execution of the kernel are fetched in the global memory when GPU faults are generated. Profiler requires multiple passes to collect all the metrics required for kernel analysis. The kernel state needs to be saved and restored for each kernel replay pass. For devices with compute capability 6.0 and higher and platforms supporting Unified memory, in the first kernel iteration the GPU faults will be generated and all pages will be fetched in the global memory. Second iteration onwards GPU page faults will not occur. This will significantly affect the memory related events and timing. The time taken from trace will include the time required to fetch the pages but most of the metrics profiled in multiple iterations will not include time/cycles required to fetch the pages. This causes inconsistency in the profiler results.
When profiling an application that uses CUDA Dynamic Parallelism (CDP) there are several limitations to the profiling tools. CUDA 12.0 adds support for revamped CUDA Dynamic Parallelism APIs (referred to as CDP2), offering substantial performance improvements vs. the legacy CUDA Dynamic Parallelism APIs (referred to as CDP1).
- For Legacy CUDA Dynamic Parallelism (CDP1), CUPTI supports tracing of all host and device kernels for devices with compute capability 5.x and 6.x. For devices with compute capability 7.0 and higher, CUPTI traces all the host launched kernels until it encounters a host launched kernel which launches child kernels; subsequent kernels are not traced.
- For CUDA Dynamic Parallelism (CDP2), CUPTI supports tracing of host launched kernels only, it can’t trace device launched kernels.
- CUPTI doesn’t report CUDA API calls for device launched kernels.
- CUPTI doesn’t support profiling of device launched kernels i.e. it doesn’t report detailed event, metric, and source-level results for device launched kernels. Event, metric, and source-level results collected for CPU-launched kernels will include event, metric, and source-level results for the entire call-tree of kernels launched from within that kernel.
When profiling an application that uses CUDA Device Graphs, there are some limitations to the profiling tools.
- CUPTI traces the device graph when it is launched from the host. When the graph is launched from the device, graph level tracing is supported, but node level tracing is not.
Compilation of samples autorange_profiling and userrange_profiling requires a host compiler which supports C++11 features. For some g++ compilers, it is required to use the flag -std=c++11 to turn on C++11 features.
PC Sampling Activity API is not supported on Tegra platforms, while PC Sampling API is supported on Tegra platforms.
As of CUDA 11.4 and R470 TRD1 driver release, CUPTI is supported in a vGPU environment which requires a vGPU license. If the license is not obtained after 20 minutes, the reported performance data including metrics from the GPU will be inaccurate. This is because of a feature in vGPU environment which reduces performance but retains functionality as specified here.
CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). This is reported using the error code CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED. For more information, please visit the web page.
CUPTI versions shipped in the CUDA Toolkit 11.7 and CUDA Toolkit 11.8 don’t support Kepler (sm_35 and sm_37) devices. Refer to the webpages CUPTI 11.7 and CUPTI 11.8 for location of the CUPTI packages having the support for these Kepler devices.
Support for the GA103 GPU was added in the CUDA 11.6 release but it was broken for releases from CUDA 11.8 to CUDA 12.2 Update 1.
Unified memory profiling is broken for Maxwell devices on Windows platform.
For confidential computing devices, allocation of pinned (page-locked) host memory for profiling buffer for concurrent kernel tracing is not supported. Setting attribute CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED of the activity attribute enum CUpti_ActivityAttribute will return the error code CUPTI_ERROR_NOT_SUPPORTED.
There is no tracing and profiling support for chip-to-chip (C2C) interconnect.
With the new PC Sampling APIs CUPTI doesn’t report any pc sampling data for cuda graph launches in serialized mode.

1.2.1. Profiling

The following are common known issues for both the event and metric APIs and the profiling APIs:

Profiling may significantly change the overall performance characteristics of the application. Refer to the section CUPTI Overhead for more details.
Profiling a kernel while other contexts are active on the same device (e.g. X server, or secondary CUDA or graphics application) can result in varying metric values for L2/FB (Device Memory) related metrics. Specifically, L2/FB traffic from non-profiled contexts cannot be excluded from the metric results. To completely avoid this issue, profile the application on a GPU without secondary contexts accessing the same device (e.g. no X server on Linux).
Profiling is not supported for multidevice cooperative kernels, that is, kernels launched by using the API functions cudaLaunchCooperativeKernelMultiDevice or cuLaunchCooperativeKernelMultiDevice.
Enabling certain events can cause GPU kernels to run longer than the driver’s watchdog time-out limit. In these cases the driver will terminate the GPU kernel resulting in an application error and profiling data will not be available. Please disable the driver watchdog time out before profiling such long running CUDA kernels
- On Linux, setting the X Config option Interactive to false is recommended.
- For Windows, detailed information about TDR (Timeout Detection and Recovery) and how to disable it is available at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/timeout-detection-and-recovery

1.2.1.1. Event and Metric API

The following are known issues related to Event and Metric API:

The CUPTI event APIs from the header cupti_events.h and metric APIs from the header cupti_metrics.h are not supported for the devices with compute capability 7.5 and higher. These are replaced by Profiling API and Perfworks metric API. Refer to the section Migration to the Profiling API.
While collecting events in continuous mode, event reporting may be delayed i.e. event values may be returned by a later call to readEvent(s) API and the event values for the last readEvent(s) API may get lost.
When profiling events, it is possible that the domain instance that gets profiled gives event value 0 due to absence of workload on the domain instance since CUPTI profiles one instance of the domain by default. To profile all instances of the domain, user can set event group attribute CUPTI_EVENT_GROUP_ATTR_PROFILE_ALL_DOMAIN_INSTANCES through API cuptiEventGroupSetAttribute().
Profiling results might be incorrect for CUDA applications compiled with nvcc version older than 9.0 for devices with compute capability 6.0 and 6.1. Profiling session will continue and CUPTI will notify it using error code CUPTI_ERROR_CUDA_COMPILER_NOT_COMPATIBLE. It is advised to recompile the application code with nvcc version 9.0 or later. Ignore this warning if code is already compiled with the recommended nvcc version.
For some metrics, the required events can only be collected for a single CUDA context. For an application that uses multiple CUDA contexts, these metrics will only be collected for one of the contexts. The metrics that can be collected only for a single CUDA context are indicated in the metric reference tables.
Some metric values are calculated assuming a kernel is large enough to occupy all device multiprocessors with approximately the same amount of work. If a kernel launch does not have this characteristic, then those metric values may not be accurate.
Some events and metrics are not available on all devices. For list of metrics, you can refer to the metric reference tables.
CUPTI can give out of memory error for event and metrics profiling, it could be due to large number of instructions in the kernel.
Profiling is not supported for CUDA kernel nodes launched by a CUDA Graph.
These APIs are not supported on below system configurations:
- 64-bit ARM Server CPU architecture (arm64 SBSA).
- Virtual GPUs (vGPU).
- Windows Subsystem for Linux (WSL).

1.2.1.2. Profiling and Perfworks Metric API

The following are known issues related to the Profiling and Perfworks Metric API:

Profiling a kernel while any other GPU work is executing on the same MIG compute instance can result in varying metric values for all units. Care should be taken to serialize, or otherwise prevent concurrent CUDA launches within the target application to ensure those kernels do not influence each other. Be aware that GPU work issued through other APIs in the target process or workloads created by non-target processes running simultaneously in the same MIG compute instance will influence the collected metrics. Note that it is acceptable to run CUDA processes in other MIG compute instances as they will not influence the profiled MIG compute instance.
For devices with compute capability 8.0, the NVLink topology information is available but NVLink performance metrics (nvlrx__* and nvltx__*) are not supported due to a potential application hang during data collection.
Profiling is not supported under MPS (Multi-Process Service).
For profiling the CUDA kernel nodes launched by a CUDA Graph, not all combinations of range profiling and replay modes are supported. Here are some limitations:
- User replay and application replay modes with auto range are not supported.
- In the user range mode, entire graph is profiled as one workload i.e. all the kernel nodes launched by the CUDA Graph will be profiled and single result will be provided, user can’t do the profiling for a range of kernels.
- For Device Graph profiling in the auto range and kernel replay mode, each kernel node will be profiled except for the nodes which launch device graphs.
Profiling kernels executed on a device that is part of an SLI group is not supported.
Refer to the section for differences from event and metric APIs.
Profiling on Windows Subsystem for Linux (WSL) is only supported with WSL version 2, NVIDIA display driver version 525 or higher and Windows 11.
Profiling is not supported for applications using Green Contexts.
Profiling is not supported for device graphs which have been updated after instantiation.

1.3. Support

Information on supported platforms and GPUs.

1.3.1. Platform Support

Table 1. Platforms supported by CUPTI
Platform	Support
Windows	Yes
Windows Subsystem for Linux version 2 (WSL 2)	Yes*
Linux (x86_64)	Yes
Linux (ppc64le)	Yes
Linux (aarch64 sbsa)	Yes*
Linux (x86_64) (Drive SDK)	Yes*
Linux (aarch64)	Yes*
QNX	Yes*
Mac OSX	No
Android	No

Tracing and profiling of 32-bit processes is not supported.

Event and Metric APIs are not supported on Linux (aarch64 sbsa) and WSL 2 platforms.

1.3.2. GPU Support

Table 2. GPU architectures supported by different CUPTI APIs
CUPTI API	Supported GPU architectures	Notes
Activity	Maxwell and later GPU architectures, i.e. devices with compute capability 5.0 and higher
Callback	Maxwell and later GPU architectures, i.e. devices with compute capability 5.0 and higher
Event	Maxwell, Pascal, Volta	Not supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher
Metric	Maxwell, Pascal, Volta	Not supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher
Profiling	Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher	Not supported on Maxwell and Pascal GPUs
PC Sampling	Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher	Not supported on Maxwell and Pascal GPUs
SASS Metric	Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher	Not supported on Maxwell and Pascal GPUs
Checkpoint	Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher	Not supported on Maxwell and Pascal GPUs