CUPTI :: CUDA Toolkit Documentation

6. Changelog

CUPTI changes in CUDA 7.5

List of changes done as part of the CUDA Toolkit 7.5 release.

Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
API cuptiGetLastError is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.
Unified memory profiling is supported with MPS (Multi-Process Service)
Callback is provided to collect replay information after every kernel run during kernel replay. See API cuptiKernelReplaySubscribeUpdate and callback type CUpti_KernelReplayUpdateFunc.
Added new attributes in enum CUpti_DeviceAttribute to query maximum shared memory size for different cache preferences for a device function.

CUPTI changes in CUDA 7.0

List of changes done as part of the CUDA Toolkit 7.0 release.

CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record CUpti_ActivityPCSampling enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING outputs stall reason along with PC and other related information. Enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. This feature is available on devices with compute capability 5.2.
Added new activity record CUpti_ActivityInstructionCorrelation which can be used to dump source locator records for all the PCs of the function.
All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record CUpti_ActivityUnifiedMemoryCounter2, deprecating old record CUpti_ActivityUnifiedMemoryCounter.
MPS tracing and profiling support is extended on multi-gpu setups.
Activity record CUpti_ActivityDevice for device information has been deprecated and replaced by new activity record CUpti_ActivityDevice2. New record adds device UUID which can be used to uniquely identify the device across profiler runs.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.

CUPTI changes in CUDA 6.5

List of changes done as part of the CUDA Toolkit 6.5 release.

Instruction classification is done for source-correlated Instruction Execution activity CUpti_ActivityInstructionExecution. See CUpti_ActivityInstructionClass for instruction classes.
Two new device attributes are added to the activity CUpti_DeviceAttribute:
- CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Two new metric properties are added:
- CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Activity record CUpti_ActivityGlobalAccess for source level global access information has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.
Activity record CUpti_ActivityBranch for source level branch information has been deprecated and replaced by new activity record CUpti_ActivityBranch2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.
Sample sass_source_map is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.
Default event collection mode is changed to Kernel (CUPTI_EVENT_COLLECTION_MODE_KERNEL) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS). Also Continuous mode is now supported only on Tesla devices.
Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API cuptiGetAutoBoostState is added to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don't support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.

CUPTI changes in CUDA 6.0

List of changes done as part of the CUDA Toolkit 6.0 release.

Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The Instruction Execution kind collects SASS-level instruction execution counts, divergence data, and predication data. The Shared Access kind collects source correlated data indication inefficient shared memory accesses.
CUPTI now provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
CUPTI now recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
The CUpti_ActivityContext activity record CUpti_ActivityContext has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bit computeApiKind field was replaced with two 16 bit fields, computeApiKind and defaultStreamId. Because all valid computeApiKind values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value for computeApiKind and have a value of zero for defaultStreamId. The CUPTI client is responsible for versioning the persisted context data to recognize when the defaultStreamId field is valid.
To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function cuptiMetricGetRequiredEventGroupSets can be used to get the groups of events that should be collected at the same time.
Execution overheads introduced by CUPTI have been dramatically decreased.
The new activity buffer API introduced in CUDA Toolkit 5.5 is now required. The legacy cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions have been removed.

CUPTI changes in CUDA 5.5

List of changes done as part of CUDA Toolkit 5.5 release.

Applications that use CUDA Dynamic Parallelism can now be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
Device attributes such as power usage, clocks, thermals, etc. are now reported via a new activity kind.
A new activity buffer API uses callbacks to request and return buffers of activity records. The existing cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions are still supported but are deprecated and will be removed in a future release.
The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
A new metric API cuptiMetricGetValue2 allows metric values to be calculated for any device, even if that device is not available on the system.
CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.