CUPTI :: CUDA Toolkit Documentation

6. Changelog

CUPTI changes in CUDA 9.1

List of changes done as part of the CUDA Toolkit 9.1 release.

Added a field for correlation ID in the activity record CUpti_ActivityStream.

CUPTI changes in CUDA 9.0

List of changes done as part of the CUDA Toolkit 9.0 release.

CUPTI extends tracing and profiling support for devices with compute capability 7.0.
Usage of compute device memory can be tracked through CUPTI. A new activity record CUpti_ActivityMemory and activity kind CUPTI_ACTIVITY_KIND_MEMORY are added to track the allocation and freeing of memory. This activity record includes fields like virtual base address, size, PC (program counter), timestamps for memory allocation and free calls.
Unified memory profiling adds new events for thrashing, throttling, remote map and device-to-device migration on 64 bit Linux platforms. New events are added under enum CUpti_ActivityUnifiedMemoryCounterKind. Enum CUpti_ActivityUnifiedMemoryRemoteMapCause lists possible causes for remote map events.
PC sampling supports wide range of sampling periods ranging from 2^5 cycles to 2^31 cycles per sample. This can be controlled through new field samplingPeriod2 in the PC sampling configuration struct CUpti_ActivityPCSamplingConfig.
Added API cuptiDeviceSupported() to check support for a compute device.
Activity record CUpti_ActivityKernel3 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel4. New record gives information about queued and submit timestamps which can help to determine software and hardware latencies associated with the kernel launch. These timestamps are not collected by default. Use API cuptiActivityEnableLatencyTimestamps() to enable collection. New field launchType of type CUpti_ActivityLaunchType can be used to determine if it is a cooperative CUDA kernel launch.
Activity record CUpti_ActivityPCSampling2 for PC sampling has been deprecated and replaced by new activity record CUpti_ActivityPCSampling3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
Activity record CUpti_ActivityNvLink for NVLink attributes has been deprecated and replaced by new activity record CUpti_ActivityNvLink2. New record accomodates increased port numbers between two compute devices.
Activity record CUpti_ActivityGlobalAccess2 for source level global accesses has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
New attributes CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE and CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT are added in the activity attribute enum CUpti_ActivityAttribute to set and get the profiling semaphore pool size and the pool limit.

CUPTI changes in CUDA 8.0

List of changes done as part of the CUDA Toolkit 8.0 release.

Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field latencySamples of new activity record CUpti_ActivityPCSampling2 provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.
Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record CUpti_ActivityNvLink. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.
CUPTI supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records CUpti_ActivityOpenAccData, CUpti_ActivityOpenAccLaunch and CUpti_ActivityOpenAccOther. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New API cuptiOpenACCInitialize is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.
Unified memory profiling provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum CUpti_ActivityUnifiedMemoryAccessType lists memory access types for GPU page fault events and enum CUpti_ActivityUnifiedMemoryMigrationCause lists migration causes for data transfer events.
Unified Memory profiling support is extended to Mac platform.
Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE added to CUpti_DeviceAttribute.
Added new activity kinds CUPTI_ACTIVITY_KIND_SYNCHRONIZATION, CUPTI_ACTIVITY_KIND_STREAM and CUPTI_ACTIVITY_KIND_CUDA_EVENT, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity record CUpti_ActivitySynchronization. Enum CUpti_ActivitySynchronizationType lists different types of CUDA synchronization constructs.
APIs cuptiSetThreadIdType()/cuptiGetThreadIdType() to set/get the mechanism used to fetch the thread-id used in CUPTI records. Enum CUpti_ActivityThreadIdType lists all supported mechanisms.
Added API cuptiComputeCapabilitySupported() to check the support for a specific compute capability by the CUPTI.
Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs cuptiActivityPushExternalCorrelationId() and cuptiActivityPopExternalCorrelationId() should be used to push and pop external correlation ids for the calling thread. Generated records of type CUpti_ActivityExternalCorrelation contain both external and CUPTI assigned correlation ids.
Added containers to store the information of events and metrics in the form of activity records CUpti_ActivityInstantaneousEvent, CUpti_ActivityInstantaneousEventInstance, CUpti_ActivityInstantaneousMetric and CUpti_ActivityInstantaneousMetricInstance. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.
Support for domains and annotation of synchronization objects added in NVTX v2. New activity record CUpti_ActivityMarker2 and enums to indicate various stages of synchronization object i.e. CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED and CUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE are added.
Unused field runtimeCorrelationId of the activity record CUpti_ActivityMemset is broken into two fields flags and memoryKind to indicate the asynchronous behaviour and the kind of the memory used for the memset operation. It is supported by the new flag CUPTI_ACTIVITY_FLAG_MEMSET_ASYNC added in the enum CUpti_ActivityFlag.
Added flag CUPTI_ACTIVITY_MEMORY_KIND_MANAGED in the enum CUpti_ActivityMemoryKind to indicate managed memory.
API cuptiGetStreamId has been deprecated. A new API cuptiGetStreamIdEx is introduced to provide the stream id based on the legacy or per-thread default stream flag.

CUPTI changes in CUDA 7.5

List of changes done as part of the CUDA Toolkit 7.5 release.

Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
API cuptiGetLastError is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.
Unified memory profiling is supported with MPS (Multi-Process Service)
Callback is provided to collect replay information after every kernel run during kernel replay. See API cuptiKernelReplaySubscribeUpdate and callback type CUpti_KernelReplayUpdateFunc.
Added new attributes in enum CUpti_DeviceAttribute to query maximum shared memory size for different cache preferences for a device function.

CUPTI changes in CUDA 7.0

List of changes done as part of the CUDA Toolkit 7.0 release.

CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record CUpti_ActivityPCSampling enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING outputs stall reason along with PC and other related information. Enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. This feature is available on devices with compute capability 5.2.
Added new activity record CUpti_ActivityInstructionCorrelation which can be used to dump source locator records for all the PCs of the function.
All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record CUpti_ActivityUnifiedMemoryCounter2, deprecating old record CUpti_ActivityUnifiedMemoryCounter.
MPS tracing and profiling support is extended on multi-gpu setups.
Activity record CUpti_ActivityDevice for device information has been deprecated and replaced by new activity record CUpti_ActivityDevice2. New record adds device UUID which can be used to uniquely identify the device across profiler runs.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.

CUPTI changes in CUDA 6.5

List of changes done as part of the CUDA Toolkit 6.5 release.

Instruction classification is done for source-correlated Instruction Execution activity CUpti_ActivityInstructionExecution. See CUpti_ActivityInstructionClass for instruction classes.
Two new device attributes are added to the activity CUpti_DeviceAttribute:
- CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Two new metric properties are added:
- CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
- CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
Activity record CUpti_ActivityGlobalAccess for source level global access information has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.
Activity record CUpti_ActivityBranch for source level branch information has been deprecated and replaced by new activity record CUpti_ActivityBranch2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.
Sample sass_source_map is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.
Default event collection mode is changed to Kernel (CUPTI_EVENT_COLLECTION_MODE_KERNEL) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS). Also Continuous mode is supported only on Tesla devices.
Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API cuptiGetAutoBoostState is added to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don't support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.
Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.

CUPTI changes in CUDA 6.0

List of changes done as part of the CUDA Toolkit 6.0 release.

Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The Instruction Execution kind collects SASS-level instruction execution counts, divergence data, and predication data. The Shared Access kind collects source correlated data indication inefficient shared memory accesses.
CUPTI provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
CUPTI recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
The CUpti_ActivityContext activity record CUpti_ActivityContext has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bit computeApiKind field was replaced with two 16 bit fields, computeApiKind and defaultStreamId. Because all valid computeApiKind values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value for computeApiKind and have a value of zero for defaultStreamId. The CUPTI client is responsible for versioning the persisted context data to recognize when the defaultStreamId field is valid.
To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function cuptiMetricGetRequiredEventGroupSets can be used to get the groups of events that should be collected at the same time.
Execution overheads introduced by CUPTI have been dramatically decreased.
The new activity buffer API introduced in CUDA Toolkit 5.5 is required. The legacy cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions have been removed.

CUPTI changes in CUDA 5.5

List of changes done as part of CUDA Toolkit 5.5 release.

Applications that use CUDA Dynamic Parallelism can be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
Device attributes such as power usage, clocks, thermals, etc. are reported via a new activity kind.
A new activity buffer API uses callbacks to request and return buffers of activity records. The existing cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions are still supported but are deprecated and will be removed in a future release.
The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
A new metric API cuptiMetricGetValue2 allows metric values to be calculated for any device, even if that device is not available on the system.
CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.