What's New

CUPTI contains a number of new features and changes as part of the CUDA Toolkit 7.0 release.
  • CUPTI now supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record CUpti_ActivityPCSampling enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING outputs stall reason along with PC and other related information. Enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. This feature is available on devices with compute capability 5.2. This is a preview feature in the CUDA Toolkit 7.0 release and it is not enabled by default. Users can contact NVIDIA by email ( to get additional information on how this feature can be enabled.
  • Added new activity record CUpti_ActivityInstructionCorrelation which can be used to dump source locator records for all the PCs of the function.
  • All events and metrics for devices with compute capability 3.x and 5.0 can now be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
  • Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record CUpti_ActivityUnifiedMemoryCounter2, deprecating old record CUpti_ActivityUnifiedMemoryCounter.
  • MPS tracing and profiling support is extended on multi-gpu setups.
  • Activity record CUpti_ActivityDevice for device information has been deprecated and replaced by new activity record CUpti_ActivityDevice2. New record adds device UUID which can be used to uniquely identify the device across profiler runs.
  • Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.

Table of Contents