CUPTI

What's New

CUPTI contains below changes as part of the CUDA Toolkit 8.0 release.
  • Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field latencySamples of new activity record CUpti_ActivityPCSampling2 provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.
  • Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record CUpti_ActivityNvLink. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.
  • CUPTI now supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records CUpti_ActivityOpenAccData, CUpti_ActivityOpenAccLaunch and CUpti_ActivityOpenAccOther. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New API cuptiOpenACCInitialize is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.
  • Unified memory profiling now provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum CUpti_ActivityUnifiedMemoryAccessType lists memory access types for GPU page fault events and enum CUpti_ActivityUnifiedMemoryMigrationCause lists migration causes for data transfer events.
  • Unified Memory profiling support is extended to Mac platform.
  • Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE added to CUpti_DeviceAttribute.
  • Added new activity kinds CUPTI_ACTIVITY_KIND_SYNCHRONIZATION, CUPTI_ACTIVITY_KIND_STREAM and CUPTI_ACTIVITY_KIND_CUDA_EVENT, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity record CUpti_ActivitySynchronization. Enum CUpti_ActivitySynchronizationType lists different types of CUDA synchronization constructs.
  • APIs cuptiSetThreadIdType()/cuptiGetThreadIdType() to set/get the mechanism used to fetch the thread-id used in CUPTI records. Enum CUpti_ActivityThreadIdType lists all supported mechanisms.
  • Added API cuptiComputeCapabilitySupported() to check the support for a specific compute capability by the CUPTI.
  • Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs cuptiActivityPushExternalCorrelationId() and cuptiActivityPopExternalCorrelationId() should be used to push and pop external correlation ids for the calling thread. Generated records of type CUpti_ActivityExternalCorrelation contain both external and CUPTI assigned correlation ids.
  • Added containers to store the information of events and metrics in the form of activity records CUpti_ActivityInstantaneousEvent, CUpti_ActivityInstantaneousEventInstance, CUpti_ActivityInstantaneousMetric and CUpti_ActivityInstantaneousMetricInstance. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.
  • Support for domains and annotation of synchronization objects added in NVTX v2. New activity record CUpti_ActivityMarker2 and enums to indicate various stages of synchronization object i.e. CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED and CUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE are added.
  • Unused field runtimeCorrelationId of the activity record CUpti_ActivityMemset is broken into two fields flags and memoryKind to indicate the asynchronous behaviour and the kind of the memory used for the memset operation. It is supported by the new flag CUPTI_ACTIVITY_FLAG_MEMSET_ASYNC added in the enum CUpti_ActivityFlag.
  • Added flag CUPTI_ACTIVITY_MEMORY_KIND_MANAGED in the enum CUpti_ActivityMemoryKind to indicate managed memory.
  • API cuptiGetStreamId has been deprecated. A new API cuptiGetStreamIdEx is introduced to provide the stream id based on the legacy or per-thread default stream flag.

Table of Contents