7. Changelog

CUPTI changes in CUDA 11.1

CUPTI contains below change as part of the CUDA Toolkit 11.1 release.
  • CUPTI adds tracing and profiling support for the NVIDIA Ampere GPUs with compute capability 8.6.
  • Added a new field graphId in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the unique ID of the CUDA graph that launches the activity through CUDA graph APIs. To accomodate this change, activity records CUpti_ActivityMemcpy3, CUpti_ActivityMemcpyPtoP2 and CUpti_ActivityMemset2 are deprecated and replaced by new activity records CUpti_ActivityMemcpy4, CUpti_ActivityMemcpyPtoP3 and CUpti_ActivityMemset3. And kernel activity record CUpti_ActivityKernel5 replaces the padding field with graphId. Added a new API cuptiGetGraphId to query the unique ID of the CUDA graph.
  • Added a new API cuptiActivityFlushPeriod to set the flush period for the worker thread.
  • Added support for profiling cooperative kernels using Profiling APIs.
  • Added NVLink performance metrics (nvlrx__* and nvltx__*) using the Profiling APIs. These metrics are available on devices with compute capability 7.0, 7.5 and 8.0, and these can be collected at the context level. Refer to the table Metrics Mapping Table for mapping between earlier CUPTI metrics and the Perfworks NVLink metrics for devices with compute capability 7.0.

CUPTI changes in CUDA 11.0

CUPTI contains below change as part of the CUDA Toolkit 11.0 release.
  • CUPTI adds tracing and profiling support for devices with compute capability 8.0 i.e. NVIDIA A100 GPUs and systems that are based on A100.
  • Enhancements for CUDA Graph:
    • Support to correlate the CUDA Graph node with the GPU activities: kernel, memcpy, memset.
      • Added a new field graphNodeId for Node Id in the activity records for kernel, memcpy, memset and P2P transfers. Activity records CUpti_ActivityKernel4, CUpti_ActivityMemcpy2, CUpti_ActivityMemset and CUpti_ActivityMemcpyPtoP are deprecated and replaced by new activity records CUpti_ActivityKernel5, CUpti_ActivityMemcpy3, CUpti_ActivityMemset2 and CUpti_ActivityMemcpyPtoP2.
      • graphNodeId is the unique ID for the graph node.
      • graphNodeId can be queried using the new CUPTI API cuptiGetGraphNodeId().
      • Callback CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED is issued between a pair of the API enter and exit callbacks.
    • Introduced new callback CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED to indicate the cloning of the CUDA Graph node.
    • Retain CUDA driver performance optimization in case memset node is sandwiched between kernel nodes. CUPTI no longer disables the conversion of memset nodes into kernel nodes for CUDA graphs.
    • Added support for cooperative kernels in CUDA graphs.
  • Fixed issues in the API cuptiFinalize() including the issue which may cause the application to crash. This API provides ability for safe and full detach of CUPTI during the execution of the application. More details in the section Dynamic Detach.
  • Added support to trace Optix applications. Refer the Optix Profiling section.
  • PC sampling overhead is reduced by avoiding the reconfiguration of the GPU when PC sampling period doesn't change between successive kernels. This is applicable for devices with compute capability 7.0 and higher.
  • CUPTI overhead is associated with the thread rather than process. Object kind of the overhead record CUpti_ActivityOverhead is switched to CUPTI_ACTIVITY_OBJECT_THREAD.
  • Added error code CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED to indicate the presense of another CUPTI subscriber. API cuptiSubscribe() returns the new error code than CUPTI_ERROR_MAX_LIMIT_REACHED.
  • Added a new enum CUpti_FuncShmemLimitConfig to indicate whether user has opted in for maximun dynamic shared memory size on devices with compute capability 7.x by using function attributes CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES or cudaFuncAttributeMaxDynamicSharedMemorySize with CUDA driver and runtime respectively. Field shmemLimitConfig in the kernel activity record CUpti_ActivityKernel5 shows the user choice. This helps in correct occupancy calulation. Value FUNC_SHMEM_LIMIT_OPTIN in the enum cudaOccFuncShmemConfig is the corresponding option in the CUDA occupancy calculator.

CUPTI changes in CUDA 10.2

CUPTI contains below changes as part of the CUDA Toolkit 10.2 release.
  • CUPTI allows tracing features for non-root and non-admin users on desktop platforms. Note that events and metrics profiling is still restricted for non-root and non-admin users. More details about the issue and the solutions can be found on this web page.
  • Added support for tracing features on the virtual GPUs (vGPU).
  • CUPTI no longer turns off the performance characteristics of CUDA Graph when tracing the application.
  • CUPTI now shows memset nodes in the CUDA graph.
  • Fixed the incorrect timing issue for the asynchronous cuMemset/cudaMemset activity.
  • Several performance improvements are done in the tracing path.

CUPTI changes in CUDA 10.1 Update 2

CUPTI contains below changes as part of the CUDA Toolkit 10.1 Update 2 release.
  • This release is focused on bug fixes and stability of the CUPTI.
  • A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, CUPTI cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.

CUPTI changes in CUDA 10.1 Update 1

CUPTI contains below change as part of the CUDA Toolkit 10.1 Update 1 release.
  • Support for the IBM POWER platform is added for the
    • Profiling APIs in the header cupti_profiler_target.h
    • Perfworks metric APIs in the headers nvperf_host.h and nvperf_target.h

CUPTI changes in CUDA 10.1

CUPTI contains below changes as part of the CUDA Toolkit 10.1 release.
  • This release is focused on bug fixes and performance improvements.
  • The new set of profiling APIs and Perfworks metric APIs which were introduced in the CUDA Toolkit 10.0 are now integrated into the CUPTI library distributed in the CUDA Toolkit. Refer to the sections CUPTI Profiling API and Perfworks Metric APIs for documentation of the new APIs.
  • Event collection mode CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS is now supported on all device classes including Geforce and Quadro.
  • Support for the NVTX string registration API nvtxDomainRegisterStringA().
  • Added enum CUpti_PcieGen to list PCIE generations.

CUPTI changes in CUDA 10.0

CUPTI contains below changes as part of the CUDA Toolkit 10.0 release.
  • Added tracing support for devices with compute capability 7.5.
  • A new set of metric APIs are added for devices with compute capability 7.0 and higher. These provide low and deterministic profiling overhead on the target system. These APIs are currently supported only on Linux x86 64-bit and Windows 64-bit platforms. Refer to the CUPTI web page for documentation and details to download the package with support for these new APIs. Note that both the old and new metric APIs are supported for compute capability 7.0. This is to enable transition of code to the new metric APIs. But one cannot mix the usage of the old and new metric APIs.
  • CUPTI supports profiling of OpenMP applications. OpenMP profiling information is provided in the form of new activity records CUpti_ActivityOpenMp. New API cuptiOpenMpInitialize is used to initialize profiling for supported OpenMP runtimes.
  • Activity record for kernel CUpti_ActivityKernel4 provides shared memory size set by the CUDA driver.
  • Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph.
  • Added support for resource callbacks for resources associated with the CUDA Graph. Refer enum CUpti_CallbackIdResource for new callback IDs.

CUPTI changes in CUDA 9.2

CUPTI contains below changes as part of the CUDA Toolkit 9.2 release.
  • Added support to query PCI devices information which can be used to construct the PCIE topology. See activity kind CUPTI_ACTIVITY_KIND_PCIE and related activity record CUpti_ActivityPcie.
  • To view and analyze bandwidth of memory transfers over PCIe topologies, new set of metrics to collect total data bytes transmitted and recieved through PCIe are added. Those give accumulated count for all devices in the system. These metrics are collected at the device level for the entire application. And those are made available for devices with compute capability 5.2 and higher.
  • CUPTI added support for new metrics:
    • Instruction executed for different types of load and store
    • Total number of cached global/local load requests from SM to texture cache
    • Global atomic/non-atomic/reduction bytes written to L2 cache from texture cache
    • Surface atomic/non-atomic/reduction bytes written to L2 cache from texture cache
    • Hit rate at L2 cache for all requests from texture cache
    • Device memory (DRAM) read and write bytes
    • The utilization level of the multiprocessor function units that execute tensor core instructions for devices with compute capability 7.0
  • A new attribute CUPTI_EVENT_ATTR_PROFILING_SCOPE is added under enum CUpti_EventAttribute to query the profiling scope of a event. Profiling scope indicates if the event can be collected at the context level or device level or both. See Enum CUpti_EventProfilingScope for avaiable profiling scopes.
  • A new error code CUPTI_ERROR_VIRTUALIZED_DEVICE_NOT_SUPPORTED is added to indicate that tracing and profiling on virtualized GPU is not supported.

CUPTI changes in CUDA 9.1

List of changes done as part of the CUDA Toolkit 9.1 release.
  • Added a field for correlation ID in the activity record CUpti_ActivityStream.

CUPTI changes in CUDA 9.0

List of changes done as part of the CUDA Toolkit 9.0 release.
  • CUPTI extends tracing and profiling support for devices with compute capability 7.0.
  • Usage of compute device memory can be tracked through CUPTI. A new activity record CUpti_ActivityMemory and activity kind CUPTI_ACTIVITY_KIND_MEMORY are added to track the allocation and freeing of memory. This activity record includes fields like virtual base address, size, PC (program counter), timestamps for memory allocation and free calls.
  • Unified memory profiling adds new events for thrashing, throttling, remote map and device-to-device migration on 64 bit Linux platforms. New events are added under enum CUpti_ActivityUnifiedMemoryCounterKind. Enum CUpti_ActivityUnifiedMemoryRemoteMapCause lists possible causes for remote map events.
  • PC sampling supports wide range of sampling periods ranging from 2^5 cycles to 2^31 cycles per sample. This can be controlled through new field samplingPeriod2 in the PC sampling configuration struct CUpti_ActivityPCSamplingConfig.
  • Added API cuptiDeviceSupported() to check support for a compute device.
  • Activity record CUpti_ActivityKernel3 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel4. New record gives information about queued and submit timestamps which can help to determine software and hardware latencies associated with the kernel launch. These timestamps are not collected by default. Use API cuptiActivityEnableLatencyTimestamps() to enable collection. New field launchType of type CUpti_ActivityLaunchType can be used to determine if it is a cooperative CUDA kernel launch.
  • Activity record CUpti_ActivityPCSampling2 for PC sampling has been deprecated and replaced by new activity record CUpti_ActivityPCSampling3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
  • Activity record CUpti_ActivityNvLink for NVLink attributes has been deprecated and replaced by new activity record CUpti_ActivityNvLink2. New record accomodates increased port numbers between two compute devices.
  • Activity record CUpti_ActivityGlobalAccess2 for source level global accesses has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess3. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.
  • New attributes CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE and CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT are added in the activity attribute enum CUpti_ActivityAttribute to set and get the profiling semaphore pool size and the pool limit.

CUPTI changes in CUDA 8.0

List of changes done as part of the CUDA Toolkit 8.0 release.
  • Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field latencySamples of new activity record CUpti_ActivityPCSampling2 provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.
  • Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record CUpti_ActivityNvLink. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.
  • CUPTI supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records CUpti_ActivityOpenAccData, CUpti_ActivityOpenAccLaunch and CUpti_ActivityOpenAccOther. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New API cuptiOpenACCInitialize is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.
  • Unified memory profiling provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum CUpti_ActivityUnifiedMemoryAccessType lists memory access types for GPU page fault events and enum CUpti_ActivityUnifiedMemoryMigrationCause lists migration causes for data transfer events.
  • Unified Memory profiling support is extended to Mac platform.
  • Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE added to CUpti_DeviceAttribute.
  • Added new activity kinds CUPTI_ACTIVITY_KIND_SYNCHRONIZATION, CUPTI_ACTIVITY_KIND_STREAM and CUPTI_ACTIVITY_KIND_CUDA_EVENT, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity record CUpti_ActivitySynchronization. Enum CUpti_ActivitySynchronizationType lists different types of CUDA synchronization constructs.
  • APIs cuptiSetThreadIdType()/cuptiGetThreadIdType() to set/get the mechanism used to fetch the thread-id used in CUPTI records. Enum CUpti_ActivityThreadIdType lists all supported mechanisms.
  • Added API cuptiComputeCapabilitySupported() to check the support for a specific compute capability by the CUPTI.
  • Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs cuptiActivityPushExternalCorrelationId() and cuptiActivityPopExternalCorrelationId() should be used to push and pop external correlation ids for the calling thread. Generated records of type CUpti_ActivityExternalCorrelation contain both external and CUPTI assigned correlation ids.
  • Added containers to store the information of events and metrics in the form of activity records CUpti_ActivityInstantaneousEvent, CUpti_ActivityInstantaneousEventInstance, CUpti_ActivityInstantaneousMetric and CUpti_ActivityInstantaneousMetricInstance. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.
  • Support for domains and annotation of synchronization objects added in NVTX v2. New activity record CUpti_ActivityMarker2 and enums to indicate various stages of synchronization object i.e. CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS, CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED and CUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE are added.
  • Unused field runtimeCorrelationId of the activity record CUpti_ActivityMemset is broken into two fields flags and memoryKind to indicate the asynchronous behaviour and the kind of the memory used for the memset operation. It is supported by the new flag CUPTI_ACTIVITY_FLAG_MEMSET_ASYNC added in the enum CUpti_ActivityFlag.
  • Added flag CUPTI_ACTIVITY_MEMORY_KIND_MANAGED in the enum CUpti_ActivityMemoryKind to indicate managed memory.
  • API cuptiGetStreamId has been deprecated. A new API cuptiGetStreamIdEx is introduced to provide the stream id based on the legacy or per-thread default stream flag.

CUPTI changes in CUDA 7.5

List of changes done as part of the CUDA Toolkit 7.5 release.
  • Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
  • Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
  • API cuptiGetLastError is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.
  • Unified memory profiling is supported with MPS (Multi-Process Service)
  • Callback is provided to collect replay information after every kernel run during kernel replay. See API cuptiKernelReplaySubscribeUpdate and callback type CUpti_KernelReplayUpdateFunc.
  • Added new attributes in enum CUpti_DeviceAttribute to query maximum shared memory size for different cache preferences for a device function.

CUPTI changes in CUDA 7.0

List of changes done as part of the CUDA Toolkit 7.0 release.
  • CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record CUpti_ActivityPCSampling enabled using activity kind CUPTI_ACTIVITY_KIND_PC_SAMPLING outputs stall reason along with PC and other related information. Enum CUpti_ActivityPCSamplingStallReason lists all the stall reasons. Sampling period is configurable and can be tuned using API cuptiActivityConfigurePCSampling. This feature is available on devices with compute capability 5.2.
  • Added new activity record CUpti_ActivityInstructionCorrelation which can be used to dump source locator records for all the PCs of the function.
  • All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
  • Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record CUpti_ActivityUnifiedMemoryCounter2, deprecating old record CUpti_ActivityUnifiedMemoryCounter.
  • MPS tracing and profiling support is extended on multi-gpu setups.
  • Activity record CUpti_ActivityDevice for device information has been deprecated and replaced by new activity record CUpti_ActivityDevice2. New record adds device UUID which can be used to uniquely identify the device across profiler runs.
  • Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.

CUPTI changes in CUDA 6.5

List of changes done as part of the CUDA Toolkit 6.5 release.
  • Instruction classification is done for source-correlated Instruction Execution activity CUpti_ActivityInstructionExecution. See CUpti_ActivityInstructionClass for instruction classes.
  • Two new device attributes are added to the activity CUpti_DeviceAttribute:
    • CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
    • CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
  • Two new metric properties are added:
    • CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE gives peak single precision flop per cycle for the GPU.
    • CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE gives peak double precision flop per cycle for the GPU.
  • Activity record CUpti_ActivityGlobalAccess for source level global access information has been deprecated and replaced by new activity record CUpti_ActivityGlobalAccess2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.
  • Activity record CUpti_ActivityBranch for source level branch information has been deprecated and replaced by new activity record CUpti_ActivityBranch2. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.
  • Sample sass_source_map is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.
  • Default event collection mode is changed to Kernel (CUPTI_EVENT_COLLECTION_MODE_KERNEL) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS). Also Continuous mode is supported only on Tesla devices.
  • Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API cuptiGetAutoBoostState is added to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don't support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.
  • Activity record CUpti_ActivityKernel2 for kernel execution has been deprecated and replaced by new activity record CUpti_ActivityKernel3. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.

CUPTI changes in CUDA 6.0

List of changes done as part of the CUDA Toolkit 6.0 release.
  • Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The Instruction Execution kind collects SASS-level instruction execution counts, divergence data, and predication data. The Shared Access kind collects source correlated data indication inefficient shared memory accesses.
  • CUPTI provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
  • CUPTI recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
  • The CUpti_ActivityContext activity record CUpti_ActivityContext has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bit computeApiKind field was replaced with two 16 bit fields, computeApiKind and defaultStreamId. Because all valid computeApiKind values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value for computeApiKind and have a value of zero for defaultStreamId. The CUPTI client is responsible for versioning the persisted context data to recognize when the defaultStreamId field is valid.
  • To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function cuptiMetricGetRequiredEventGroupSets can be used to get the groups of events that should be collected at the same time.
  • Execution overheads introduced by CUPTI have been dramatically decreased.
  • The new activity buffer API introduced in CUDA Toolkit 5.5 is required. The legacy cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions have been removed.

CUPTI changes in CUDA 5.5

List of changes done as part of CUDA Toolkit 5.5 release.
  • Applications that use CUDA Dynamic Parallelism can be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
  • Device attributes such as power usage, clocks, thermals, etc. are reported via a new activity kind.
  • A new activity buffer API uses callbacks to request and return buffers of activity records. The existing cuptiActivityEnqueueBuffer and cuptiActivityDequeueBuffer functions are still supported but are deprecated and will be removed in a future release.
  • The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
  • A new metric API cuptiMetricGetValue2 allows metric values to be calculated for any device, even if that device is not available on the system.
  • CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.