1. Release Notes
CUPTI Release Notes.
Release notes, including new features and important bug fixes. Supported platforms and GPUs.
1.1. Release Notes
1.1.1. Updates in CUDA 12.6 Update 1
Resolved Issues
Disabling all callbacks in a domain using the
cuptiEnableDomain
API could unintentionally disable other activities and callbacks. This issue has been addressed and resolved.Fixed issues with Enhanced compatibility (aka minor version compatibility) of a few CUPTI activities with the driver versions shipped with the CUDA 12.2 or prior releases.
1.1.2. Updates in CUDA 12.6
New Features
CUPTI introduces PM Sampling APIs for collecting many hardware metrics by sampling the GPU’s performance monitors (PM) periodically at fixed intervals for a CUDA workload. These APIs are provided in the header file
cupti_pmsampling.h
. These are supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher. Refer to the section PM Sampling API for more details.The Profiling is split into four main phases - Enumeration, Configuration, Collection and Evaluation. APIs for three phases enumeration, configuration and evaluation are provided directly by the low level Perfworks APIs. CUPTI is adding a new set of APIs around these Perfworks host APIs to simplify usage, shield users from low-level concepts, and ease adaptation to changes in Perfworks APIs. These APIs promote consistency, reduce confusion, and enhance integration with CUPTI APIs. New host APIs are provided in the header
cupti_profiler_host.h
. Additionally, new target APIs will be introduced in a future release to simplify profiling for new users and align the call structure with other profiling APIs for faster learning and better adaptability.Added support for identifying the source of the memory allocation. This is useful when memory allocation request is made from a library or shared object which is called from the application. A new field
source
is added in the memory activity record. The activity recordCUpti_ActivityMemory3
is deprecated and it is replaced by a new activity recordCUpti_ActivityMemory4
. Refer to the section Device Memory Allocation Source Tracking for more details.Added a new overhead kind
CUPTI_ACTIVITY_OVERHEAD_ACTIVITY_BUFFER_REQUEST
to report the overhead incurred while requesting activity buffers from the CUPTI client.Added static library
libcupti_static.a
for ARM Server (arm64 SBSA).
Deprecated and dropped features
Source/SASS level metrics from the header
cupti_activity.h
are deprecated on Volta and later GPU architectures and these will be removed in a future release. It includes metricsCUPTI_ACTIVITY_KIND_GLOBAL_ACCESS
,CUPTI_ACTIVITY_KIND_SHARED_ACCESS
,CUPTI_ACTIVITY_KIND_BRANCH
andCUPTI_ACTIVITY_KIND_INSTRUCTION_EXECUTION
and related correlation kindsCUPTI_ACTIVITY_KIND_SOURCE_LOCATOR
andCUPTI_ACTIVITY_KIND_INSTRUCTION_CORRELATION
. It is recommended to move to the SASS Metric API from the headercupti_sass_metrics.h
which is supported on Volta and later GPU architectures.
Resolved Issues
Exported the profiling API
cuptiProfilerIsPassCollected
.Reduced the collection overhead for memcpy tracing.
For some activities, duration for the activity buffer request was getting included in the activity duration. This issue is resolved.
1.1.3. Updates in CUDA 12.5 Update 1
Resolved Issues
Fixed a crash for the graph level tracing for CUDA graphs.
Fixed an issue due to which resource callback
CUPTI_CBID_RESOURCE_MODULE_PROFILED
might not be issued when no other activity is enabled. This issue was introduced in the CUDA 12.4 Update 1 release.
1.1.4. Updates in CUDA 12.5
New Features
Added APIs
cuptiActivityEnableDriverApi
andcuptiActivityEnableRuntimeApi
to limit the tracing of CUDA APIs that are of interest. This can help in reducing the CUDA API tracing overhead.Added new field
cigMode
to communicate the CUDA in Graphics (CIG) mode associated with the CUDA context. The activity record CUpti_ActivityContext2 is deprecated and it is replaced by a new activity record CUpti_ActivityContext3. EnumCUpti_ContextCigMode
describes the supported CIG modes.Added new field
numMultiprocessors
in the context activity record to communicate the number of multiprocessors assigned to the green context.Tracing is supported for MPS (Multi-Process Service) on Tegra platforms.
Obfuscated symbols are provided for Linux x86_64 platform, it helps us in speeding up the debug process for issues like crash, hang etc. See https://developer.nvidia.com/blog/cuda-toolkit-symbol-server/ for information on how to use obfuscated symbols. Symbol server address is: https://cudatoolkit-symbols.nvidia.com/.
Deprecated and dropped features
PC Sampling Activity API from the header
cupti_activity.h
is deprecated for Volta and later GPU architectures and this will be removed in a future release. It is recommended to move to the PC Sampling API from the headercupti_pcsampling.h
which is supported on Volta and later GPU architectures.Removed support for the PowerPC (ppc64le) architecture.
Resolved Issues
Fixed Unified memory profiling on Heterogeneous Memory Management (HMM) and Address Translation Service (ATS) systems.
1.1.5. Updates in CUDA 12.4 Update 1
Resolved Issues
Fixed a crash when API
cuptiFinalize
is used for applications using CUDA Graph.Fixed a crash which can occur while tracing memcpy and memset nodes in a graph when using graph level tracing.
Skip delivering worker thread buffers on internal flush if the worker thread buffer is not full.
1.1.6. Updates in CUDA 12.4
New Features
Added tracing support for applications using Green contexts. Added two new fields
isGreenContext
andparentContextId
in the context activity record. The activity recordCUpti_ActivityContext
is deprecated and it is replaced by a new activity recordCUpti_ActivityContext2
.CUDA API calls are completed asynchronously from the perspective of the host CPU. This is accomplished by queuing the work slated for the GPU into a structure known as a command buffer. If there is insufficient space available in the command buffer when attempting to call a CUDA API, the host call will block until space becomes available. The user should be able to identify when this situation occurs. This is indicated using the new attribute
CUPTI_ACTIVITY_OVERHEAD_COMMAND_BUFFER_FULL
added in the activity overhead enumCUpti_ActivityOverheadKind
. To provide additional details about the overhead, a new fieldoverheadData
is added in the overhead activity record. Activity recordCUpti_ActivityOverhead2
is deprecated and it is replaced by the new activity recordCUpti_ActivityOverhead3
.Added process ID and thread ID in the JIT activity record. To accommodate this change, activity record
CUpti_ActivityJit
is deprecated and it is replaced by a new activity recordCUpti_ActivityJit2
.To correlate the sampling data for a kernel with the launch API in the serial mode of the PC Sampling APIs, a new field
correlationId
is added in the structCUpti_PCSamplingPCData
.For PC Sampling APIs, total (smsp__pcsamp_sample_count) and dropped (smsp__pcsamp_samples_data_dropped) sample counts are collected by default.
Resolved Issues
Fixed the issue for overhead records showing the default thread ID than the one requested using the API
cuptiSetThreadIdType()
.Fixed instruction level SASS metrics profiling for CUDA Graph applications.
When a device graph is first launched from the device and it is not launched from the host earlier, end timestamp could be 0 for graph-level tracing on Ampere and later GPU architectures. This issue is fixed.
1.1.7. Updates in CUDA 12.3 Update 1
Resolved Issues
To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This was broken with CUDA 12.3 causing spurious gaps or overlap on Tegra platforms. Fixed the issue.
1.1.8. Updates in CUDA 12.3
New Features
New attributes
CUPTI_ACTIVITY_OVERHEAD_RUNTIME_TRIGGERED_MODULE_LOADING
andCUPTI_ACTIVITY_OVERHEAD_LAZY_FUNCTION_LOADING
are added in the activity overhead enumCUpti_ActivityOverheadKind
to provide the overhead information for CUDA runtime triggered module loading and lazy function loading respectively.New API
cuptiGetGraphExecId
provides the unique ID of the executable graph.Added support for collecting graph level trace for device launched graphs. A new API
cuptiActivityEnableDeviceGraph
is added to enable the collection of records for device launched graphs.CUDA Graphs can be executed on multiple devices i.e. the root node could be launched on one device and the leaf node could be launched on the another device. New fields
endDeviceId
andendContextId
are added to identify the ids of device and context respectively which are used to execute the last node of the graph. To accommodate this change, activity recordCUpti_ActivityGraphTrace
is deprecated and it is replaced by a new activity recordCUpti_ActivityGraphTrace2
.Added WSL profiling support on Windows 10 WSL with OS build version 19044 and greater. WSL profiling is not supported on Windows 10 WSL for systems that exceed 1 TB of system memory.
Several performance improvements are done in the tracing path. One of the key improvements is to allow clients to request CUPTI to maintain the activity buffers at the thread level instead of global buffers. This can be achieved by setting the option
CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER
of the enumCUpti_ActivityAttribute
. This can help in reducing the collection overhead for applications which launch CUDA activities from multiple host threads.Frame pointers are enabled for Linux x86_64 libraries.
The deprecated Activity APIs and structures have been moved to a new header cupti_activity_deprecated.h, which is included in the header cupti_activity.h. Header cupti_activity.h contains only the latest version of APIs and structures.
CUPTI no longer uses profiling semaphore pool to store the profiling data. Corresponding attributes
CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE
,CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE
have been deprecated.
Resolved Issues
Fixed SASS metric profiling for cuda graph.
Fixed race condition in the API
cuptiSetThreadIdType
for late subscription.
1.1.9. Updates in CUDA 12.2 Update 2
New Features
SASS Metric APIs introduced in the CUDA 12.2 GA release are transitioning from the beta to the production release.
Added support for collecting SASS metrics for CUDA Graphs which are launched from host.
Added a new field
numOfDroppedRecords
in the structCUpti_SassMetricsDisable_Params
to indicate the number of dropped records when SASS data is flushed prior to calling the disable API.
Added a new field
api
in the structCUpti_Profiler_DeviceSupported_Params
which can be used to check the configuration support level for profiler APIs like Profiling, PC Sampling and SASS Metric APIs.
Resolved Issues
Fixed the tracing and profiling support for the GA103 GPU.
Fixed a hang which can occur when activity buffer gets full while collecting the sampling data using the PC Sampling Activity API.
Fixed the issue of incorrect timestamps for graph level trace when a graph node is disabled using the APIs
cuGraphNodeSetEnabled
orcudaGraphNodeSetEnabled
.
1.1.10. Updates in CUDA 12.2 Update 1
Support for Confidential Computing
CUPTI supports some APIs while running in CC-devtools mode:
Callback API
Activity API
The profiling APIs are not supported in CC-devtools mode with this release. Using these APIs should return an error indicating the configuration is not supported:
Profiling API
PC Sampling API
Checkpoint API
SASS Metrics API
Additionally, CUPTI is not supported at all in full CC mode. CC-devtools mode must be used for tools support. Some CUDA APIs are not supported or behave differently when running in CC or CC-devtools mode; notably, host pinned memory requests will be traced as managed memory requests, and any CUDA memcopies on these converted pointers are traced as Device to Device copies irrespective of the locality of the source or destination pointers. For details on how to configure CC or CC-devtools mode, system and software requirements, as well as documentation on CUDA API changes, please see the confidential compute release documentation at https://docs.nvidia.com/confidential-computing/.
Resolved Issues
Fixed timestamps for graph-level tracing for CUDA graphs running across multiple GPUs.
Fixed a potential hang when CUPTI is unable to fetch attributes for an activity.
1.1.11. Updates in CUDA 12.2
New Features
A new set of CUPTI APIs for collection of SASS metric data at the source level are provided in the header file
cupti_sass_metrics.h
. These support a larger set of metrics compared to the CUPTI Activity APIs for source-level analysis. SASS to source correlation can be done in the offline mode, similar to the PC sampling APIs. Hence the runtime overhead during data collection is lower. Refer to the section CUPTI SASS Metrics API for more details. Please note that this is a Beta feature, interface and functionality are subject to change in a future release.CUPTI now reports fatal errors, non-fatal errors and warnings instantaneously through callbacks. A new callback domain
CUPTI_CB_DOMAIN_STATE
is added for subscribing to the instantaneous error reporting. Corresponding callback ids are provided in the structCUpti_CallbackIdState
.Added support for profiling of device graphs and host graphs that launch device graphs. There are some known limitations, please refer to the Known Issues section for details.
Change in the stream attribute value is communicated by issuing the resource callback. Refer to the struct
CUpti_StreamAttrData
and callback idCUPTI_CBID_RESOURCE_STREAM_ATTRIBUTE_CHANGED
added in the enumCUpti_CallbackIdResource
.New API
cuptiGetErrorMessage
provides descriptive message for CUPTI error codes.Removed the deprecated API
cuptiDeviceGetTimestamp
from the headercupti_events.h
.Added metrics for Tensor core operations to count different types of tensor instructions. These metrics are named as
sm[sp]__ops_path_tensor_src_{src}[_dst_{dst}[_sparsity_{on,off}]].
These are available for devices with compute capability 7.0 and higher, except for Turing TU11x GPUs.
Resolved Issues
Fixed crash for the graph-level trace for device graphs which are launched from the host.
1.1.12. Updates in CUDA 12.1 Update 1
Resolved Issues
Fixed CUPTI tracing failure when just-in-time compilation of embedded PTX code is disabled using the environment variable
CUDA_DISABLE_PTX_JIT
.Fixed a crash in the API
cuptiFinalize
.
1.1.13. Updates in CUDA 12.1
New Features
Field
wsl
is added in the structCUpti_Profiler_DeviceSupported_Params
to indicate whether Profiling API is supported on Windows Subsystem for Linux (WSL) system or not.
1.1.14. Updates in CUDA 12.0 Update 1
Resolved Issues
Reduced the host memory overhead by avoiding caching copies of cubin images at the time of loading CUDA modules. Copies of cubin images are now created only when profiling features that need it are enabled.
By default CUPTI switches back to the device memory, instead of the pinned host memory, for allocation of the profiling buffer for concurrent kernel tracing. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute
CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED
of the activity attribute enumCUpti_ActivityAttribute
.CUPTI now captures the
cudaGraphLaunch
API and its kernels when CUPTI is attached after the graph is instantiated using the APIcudaGraphInstantiate
but it is attached before the graph is launched using the APIcudaGraphLaunch
. Some data in the kernel record would be missing i.e.cacheConfig
,sharedMemoryExecuted
,partitionedGlobalCacheRequested
,partitionedGlobalCacheExecuted
,sharedMemoryCarveoutRequested
etc. This fix requires the matching CUDA driver which ships with the CUDA 12.0 Update 1 release.
1.1.15. Updates in CUDA 12.0
New Features
Added new fields
maxPotentialClusterSize
andmaxActiveClusters
to help in calculating the cluster occupancy correctly. These fields are valid for devices with compute capability 9.0 and higher. To accommodate this change, activity recordCUpti_ActivityKernel8
is deprecated and replaced by a new activity recordCUpti_ActivityKernel9
.Enhancements for PC Sampling APIs:
CUPTI creates few worker threads to offload certain operations like decoding of the hardware data to the CUPTI PC sampling data and correlation of the PC data to the SASS instructions. CUPTI wakes up these threads periodically. To control the sleep time of the worker threads, a new attribute
CUPTI_PC_SAMPLING_CONFIGURATION_ATTR_TYPE_WORKER_THREAD_PERIODIC_SLEEP_SPAN
is added in the enumCUpti_PCSamplingConfigurationAttributeType
.Improved error reporting for hardware buffer overflow. When hardware buffer overflows, CUPTI returns the out of memory error code. And a new field
hardwareBufferFull
added in the structCUpti_PCSamplingData
is set to differentiate it from other out of memory cases. User can either increase the hardware buffer size or flush the hardware buffer at a higher frequency to avoid overflow.
Profiling APIs are supported on Windows Subsystem for Linux (WSL) with WSL version 2, NVIDIA display driver version 525 or higher and Windows 11.
CUPTI support for Kepler GPUs is dropped in CUDA Toolkit 12.0.
Resolved Issues
Removed minor CUDA version from the SONAME of the CUPTI shared library for compatibility reasons. For example, SONAME of CUPTI library is libcupti.so.12 instead of libcupti.so.12.0 in CUDA 12.0 release.
Activity kinds
CUPTI_ACTIVITY_KIND_MARKER
andCUPTI_ACTIVITY_KIND_MARKER_DATA
can be enabled together.
1.1.16. Older Versions
1.1.16.1. Updates in CUDA 11.8
New Features
CUPTI adds tracing and profiling support for Hopper and Ada Lovelace GPU families.
Added new fields
clusterX
,clusterY
,clusterZ
andclusterSchedulingPolicy
to output the Thread Block Cluster dimensions and scheduling policy. These fields are valid for devices with compute capability 9.0 and higher. To accommodate this change, activity recordCUpti_ActivityKernel7
is deprecated and replaced by a new activity recordCUpti_ActivityKernel8
.A new activity kind
CUPTI_ACTIVITY_KIND_JIT
and corresponding activity recordCUpti_ActivityJit
are introduced to capture the overhead involved in the JIT (just-in-time) compilation and caching of the PTX or NVVM IR code to the binary code. New record also provides the information about the size and path of the compute cache where the binary code is stored.PC Sampling API is supported on Tegra platforms - QNX, Linux (aarch64) and Linux (x86_64) (Drive SDK).
Resolved Issues
Resolved an issue that might cause crash when the size of the device buffer is changed, using the attribute
CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE
, after creation of the CUDA context.
1.1.16.2. Updates in CUDA 11.7 Update 1
Resolved Issues
Resolved an issue for PC Sampling API
cuptiPCSamplingGetData
which might not always return all the samples when called after the PC sampling range defined by using the APIscuptiPCSamplingStart
andcuptiPCSamplingStop
. Remaining samples were delivered in the successive call of the APIcuptiPCSamplingGetData
after the next range.Disabled tracing of nodes in the CUDA Graph when user enables tracing at the Graph level using the activity kind
CUPTI_ACTIVITY_KIND_GRAPH_TRACE
.Fixed missing
channelID
andchannelType
information for kernel records. Earlier these fields were populated for CUDA Graph launches only.
1.1.16.3. Updates in CUDA 11.7
New Features
A new activity kind
CUPTI_ACTIVITY_KIND_GRAPH_TRACE
and activity recordCUpti_ActivityGraphTrace
are introduced to represent the execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads involved in tracing each node separately. This activity can only be enabled for drivers of version 515 and above.A new API
cuptiActivityEnableAndDump
is added to provide snapshot of certain activities like device, context, stream, NVLink and PCIe at any point during the profiling session.Added sample cupti_correlation to show correlation between CUDA APIs and corresponding GPU activities.
Added sample cupti_trace_injection to show how to build an injection library using the activity and callback APIs which can be used to trace any CUDA application.
Resolved Issues
Fixed corruption in the function name for PC Sampling API records.
Fixed incorrect timestamps for GPU activities when user calls the API
cuptiActivityRegisterTimestampCallback
in the late CUPTI attach scenario.Fixed incomplete records for device to device memcopies in the late CUPTI attach scenario. This issue manifests mainly when application has a mix of CUDA graph and normal kernel launches.
1.1.16.4. Updates in CUDA 11.6 Update 1
Resolved Issues
Fixed hang for the PC Sampling API
cuptiPCSamplingStop
. This issue is seen for the PC sampling start and stop resulting in generation of large number of sampling records.Fixed timing issue for specific device to device memcpy operations.
1.1.16.5. Updates in CUDA 11.6
New Features
Two new fields
channelID
andchannelType
are added in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the ID and type of the hardware channel on which these activities happen. Activity recordsCUpti_ActivityKernel6
,CUpti_ActivityMemcpy4
,CUpti_ActivityMemcpyPtoP3
andCUpti_ActivityMemset3
are deprecated and replaced by new activity recordsCUpti_ActivityKernel7
,CUpti_ActivityMemcpy5
,CUpti_ActivityMemcpyPtoP4
andCUpti_ActivityMemset4
.New fields
isMigEnabled
,gpuInstanceId
,computeInstanceId
andmigUuid
are added in the device activity record to provide MIG information for the MIG enabled GPU. Activity recordCUpti_ActivityDevice3
is deprecated and replaced by a new activity recordCUpti_ActivityDevice4
.A new field
utilizedSize
is added in the memory pool and memory activity record to provide the utilized size of the memory pool. Activity recordCUpti_ActivityMemoryPool
andCUpti_ActivityMemory2
are deprecated and replaced by a new activity recordCUpti_ActivityMemoryPool2
andCUpti_ActivityMemory3
respectively.API
cuptiActivityRegisterTimestampCallback
and callback functionCUpti_TimestampCallbackFunc
are added to register a callback function to obtain timestamp of user’s choice instead of using CUPTI provided timestamp in activity records.Profiling API supports profiling OptiX application.
Resolved Issues
Fixed multi-pass metric collection using the Profiling API in the auto range and kernel replay mode for Cuda Graph.
Fixed the performance issue for the PC sampling API
cuptiPCSamplingStop
.Fixed corruption in variable names for OpenACC activity records.
Fixed corruption in the fields of the struct
memoryPoolConfig
in the activity recordCUpti_ActivityMemory3
.Filled the fields of the struct
memoryPoolConfig
in the activity recordCUpti_ActivityMemory3
when a memory pointer allocated via memory pool is released usingcudaFree
CUDA API.
1.1.16.6. Updates in CUDA 11.5 Update 1
Resolved Issues
Resolved an issue that causes incorrect range name for NVTX event attributes. The issue was introduced in CUDA 11.4.
Made NVTX initialization APIs
InitializeInjectionNvtx
andInitializeInjectionNvtx2
thread-safe.
1.1.16.7. Updates in CUDA 11.5
New Features
A new API
cuptiProfilerDeviceSupported
is introduced to expose overall Profiling API support and specific requirements for a given device. Profiling API must be initialized by callingcuptiProfilerInitialize
before testing device support.PC Sampling struct
CUpti_PCSamplingData
introduces a new fieldnonUsrKernelsTotalSamples
to provide information about the number of samples collected for all non-user kernels.Activity record
CUpti_ActivityDevice2
for device information has been deprecated and replaced by a new activity recordCUpti_ActivityDevice3
. New record adds a flagisCudaVisible
to indicate whether device is visible to CUDA.Activity record
CUpti_ActivityNvLink3
for NVLink information has been deprecated and replaced by a new activity recordCUpti_ActivityNvLink4
. New record can accommodate NVLink port information up to a maximum of 32 ports.A new CUPTI Checkpoint API is introduced, enabling automatic saving and restoring of device state, and facilitating development of kernel replay tools. This is helpful for User Replay mode of the CUPTI Profiling API, but is not limited to use with CUPTI.
Tracing is supported on the Windows Subsystem for Linux version 2 (WSL 2).
CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). A new error code
CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED
is introduced to indicate it.
Resolved Issues
Resolved an issue that causes crash for tracing of device to device memcopy operations.
Resolved an issue that causes crash for OpenACC activity when it is enabled before other activities.
1.1.16.8. Updates in CUDA 11.4 Update 1
Resolved Issues
Resolved serialization of CUDA Graph launches for applications which use multiple threads to launch work.
Previously, for applications that use CUDA Dynamic Parallelism (CDP), CUPTI detects the presence of the CDP kernels in the CUDA module. Even if CDP kernels are not called, it fails to trace the application. There is a change in the behavior, CUPTI now traces all the host launched kernels until it encounters a host launched kernel which launches child kernels. Subsequent kernels are not traced.
1.1.16.9. Updates in CUDA 11.4
New Features
Profiling APIs support profiling of the CUDA kernel nodes launched by a CUDA Graph. Auto range profiling with kernel replay mode and user range profiling with user replay and application replay modes are supported. Other combinations of range profiling and replay modes are not supported.
Added support for tracing and profiling on NVIDIA virtual GPUs (vGPUs) on an upcoming GRID/vGPU release.
Added sample profiling_injection to show how to build injection library using the Profiling API.
Added sample concurrent_profiling to show how to retain the kernel concurrency across streams and devices using the Profiling API.
Resolved Issues
Resolved the issue of not tracing the device to device memcopy nodes in a CUDA Graph.
Fixed the issue of reporting zero size for local memory pool for mempool creation record.
Resolved the issue of non-collection of samples for the default CUDA context for PC Sampling API.
Enabled tracking of all domains and registered strings in NVTX irrespective of whether the NVTX activity kind or callbacks are enabled. This state tracking is needed for proper working of the tool which creates these NVTX objects before enabling the NVTX activity kind or callback.
1.1.16.10. Updates in CUDA 11.3
New Features
A new set of CUPTI APIs for PC sampling data collection are provided in the header file
cupti_pcsampling.h
which support continuous mode data collection without serializing kernel execution and have a lower runtime overhead. Along with these a utility library is provided in the header filecupti_pcsampling_util.h
which has APIs for GPU assembly to CUDA-C source correlation and for reading and writing the PC sampling data from/to files. Refer to the section CUPTI PC Sampling API for more details.Enum
CUpti_PcieGen
is extended to include PCIe Gen 5.The following functions are deprecated and will be removed in a future release:
Struct
NVPA_MetricsContext
and related APIsNVPW_MetricsContext_*
from the headernvperf_host.h
. It is recommended to use the structNVPW_MetricsEvaluator
and related APIsNVPW_MetricsEvaluator_*
instead. Profiling API samples have been updated to show how to use these APIs.cuptiDeviceGetTimestamp
from the headercupti_events.h
.
Resolved Issues
Overhead reduction for tracing of CUDA memcopies.
To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This method can cause spurious gaps or overlap on the timeline. CUPTI improves the conversion function to provide more precise timestamps.
Generate overhead activity record for semaphore pool allocation.
1.1.16.11. Updates in CUDA 11.2
New Features
A new activity kind
CUPTI_ACTIVITY_KIND_MEMORY_POOL
and activity recordCUpti_ActivityMemoryPool
are introduced to represent the creation, destruction and trimming of a memory pool. EnumCUpti_ActivityMemoryPoolType
lists types of memory pool.A new activity kind
CUPTI_ACTIVITY_KIND_MEMORY2
and activity recordCUpti_ActivityMemory2
are introduced to provide separate records for memory allocation and release operations. This helps in correlation of records of these operations to the corresponding CUDA APIs, which otherwise is not possible using the existing activity recordCUpti_ActivityMemory
which provides a single record for both the memory operations.Added a new pointer field of type
CUaccessPolicyWindow
in the kernel activity record to provide the access policy window which specifies a contiguous region of global memory and a persistence property in the L2 cache for accesses within that region. To accommodate this change, activity recordCUpti_ActivityKernel5
is deprecated and replaced by a new activity recordCUpti_ActivityKernel6
. This attribute is not collected by default. To control the collection of launch attributes, a new APIcuptiActivityEnableLaunchAttributes
is introdcued.New attributes
CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE
are added in the activity attribute enumCUpti_ActivityAttribute
to set and get the number of device buffers and profiling semaphore pools which are preallocated for the context.CUPTI now allocates profiling buffer for concurrent kernel tracing in the pinned host memory in place of device memory. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute
CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED
of the activity attribute enumCUpti_ActivityAttribute
.The compiler generated line information for inlined functions is improved due to which CUPTI can associate inlined functions with the line information of the function call site that has been inlined.
Removed support for NVLink performance metrics (
nvlrx__*
andnvltx__*
) from the Profiling API due to a potential application hang during data collection. The metrics will be added back in a future CUDA release.
Resolved Issues
Execution overheads introduced by CUPTI in the tracing path is reduced.
For the concurrent kernel activity kind
CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL
, CUPTI instruments the kernel code to collect the timing information. Previously, every kernel in the CUDA module was instrumented, thus the overhead is proportional to the number of different kernels in the module. This is a static overhead which happens at the time of loading the CUDA module. To reduce this overhead, kernels are not instrumented at the module load time, instead a single instrumentation code is generated at the time of loading the CUDA module and it is applied to each kernel during the kernel execution, thus avoiding most of the static overhead at the CUDA module load time.
1.1.16.12. Updates in CUDA 11.1
New Features
CUPTI adds tracing and profiling support for the NVIDIA Ampere GPUs with compute capability 8.6.
Added a new field
graphId
in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the unique ID of the CUDA graph that launches the activity through CUDA graph APIs. To accommodate this change, activity recordsCUpti_ActivityMemcpy3
,CUpti_ActivityMemcpyPtoP2
andCUpti_ActivityMemset2
are deprecated and replaced by new activity recordsCUpti_ActivityMemcpy4
,CUpti_ActivityMemcpyPtoP3
andCUpti_ActivityMemset3
. And kernel activity recordCUpti_ActivityKernel5
replaces the padding field withgraphId
. Added a new APIcuptiGetGraphId
to query the unique ID of the CUDA graph.Added a new API
cuptiActivityFlushPeriod
to set the flush period for the worker thread.Added support for profiling cooperative kernels using Profiling APIs.
Added NVLink performance metrics (nvlrx__* and nvltx__*) using the Profiling APIs. These metrics are available on devices with compute capability 7.0, 7.5 and 8.0, and these can be collected at the context level. Refer to the table Metrics Mapping Table for mapping between earlier CUPTI metrics and the Perfworks NVLink metrics for devices with compute capability 7.0.
Resolved Issues
Resolved an issue that causes CUPTI to not return full and completed activity buffers for a long time, CUPTI now attempts to return buffers early.
To reduce the runtime overhead, CUPTI wakes up the worker thread based on certain heuristics instead of waking it up at a regular interval. New API
cuptiActivityFlushPeriod
can be used to control the flush period of the worker thread. This setting overrides the CUPTI heurtistics.
1.1.16.13. Updates in CUDA 11.0
New Features
CUPTI adds tracing and profiling support for devices with compute capability 8.0 i.e. NVIDIA A100 GPUs and systems that are based on A100.
Enhancements for CUDA Graph:
Support to correlate the CUDA Graph node with the GPU activities: kernel, memcpy, memset.
Added a new field
graphNodeId
for Node Id in the activity records for kernel, memcpy, memset and P2P transfers. Activity recordsCUpti_ActivityKernel4
,CUpti_ActivityMemcpy2
,CUpti_ActivityMemset
andCUpti_ActivityMemcpyPtoP
are deprecated and replaced by new activity recordsCUpti_ActivityKernel5
,CUpti_ActivityMemcpy3
,CUpti_ActivityMemset2
andCUpti_ActivityMemcpyPtoP2
.graphNodeId
is the unique ID for the graph node.graphNodeId
can be queried using the new CUPTI APIcuptiGetGraphNodeId()
.Callback
CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED
is issued between a pair of the API enter and exit callbacks.
Introduced new callback
CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED
to indicate the cloning of the CUDA Graph node.Retain CUDA driver performance optimization in case memset node is sandwiched between kernel nodes. CUPTI no longer disables the conversion of memset nodes into kernel nodes for CUDA graphs.
Added support for cooperative kernels in CUDA graphs.
Added support to trace Optix applications. Refer the Optix Profiling section.
CUPTI overhead is associated with the thread rather than process. Object kind of the overhead record
CUpti_ActivityOverhead
is switched toCUPTI_ACTIVITY_OBJECT_THREAD
.Added error code
CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED
to indicate the presence of another CUPTI subscriber. APIcuptiSubscribe()
returns the new error code thanCUPTI_ERROR_MAX_LIMIT_REACHED
.Added a new enum
CUpti_FuncShmemLimitConfig
to indicate whether user has opted in for maximum dynamic shared memory size on devices with compute capability 7.x by using function attributesCU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES
orcudaFuncAttributeMaxDynamicSharedMemorySize
with CUDA driver and runtime respectively. FieldshmemLimitConfig
in the kernel activity recordCUpti_ActivityKernel5
shows the user choice. This helps in correct occupancy calculation. ValueFUNC_SHMEM_LIMIT_OPTIN
in the enumcudaOccFuncShmemConfig
is the corresponding option in the CUDA occupancy calculator.
Resolved Issues
Resolved an issue that causes incorrect or stale timing for memcopy and serial kernel activities.
Overhead for PC Sampling Activity APIs is reduced by avoiding the reconfiguration of the GPU when PC sampling period doesn’t change between successive kernels. This is applicable for devices with compute capability 7.0 and higher.
Fixed issues in the API
cuptiFinalize()
including the issue which may cause the application to crash. This API provides ability for safe and full detach of CUPTI during the execution of the application. More details in the section Dynamic Detach.
1.1.16.14. Updates in CUDA 10.2
New Features
CUPTI allows tracing features for non-root and non-admin users on desktop platforms. Note that events and metrics profiling is still restricted for non-root and non-admin users. More details about the issue and the solutions can be found on this web page.
CUPTI no longer turns off the performance characteristics of CUDA Graph when tracing the application.
CUPTI now shows memset nodes in the CUDA graph.
Fixed the incorrect timing issue for the asynchronous cuMemset/cudaMemset activity.
Several performance improvements are done in the tracing path.
1.1.16.15. Updates in CUDA 10.1 Update 2
New Features
This release is focused on bug fixes and stability of the CUPTI.
A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, CUPTI cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.
1.1.16.16. Updates in CUDA 10.1 Update 1
New Features
Support for the IBM POWER platform is added for the
Profiling APIs in the header
cupti_profiler_target.h
Perfworks metric APIs in the headers
nvperf_host.h
andnvperf_target.h
1.1.16.17. Updates in CUDA 10.1
New Features
This release is focused on bug fixes and performance improvements.
The new set of profiling APIs and Perfworks metric APIs which were introduced in the CUDA Toolkit 10.0 are now integrated into the CUPTI library distributed in the CUDA Toolkit. Refer to the sections CUPTI Profiling API and Perfworks Metric APIs for documentation of the new APIs.
Event collection mode
CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS
is now supported on all device classes including Geforce and Quadro.Support for the NVTX string registration API
nvtxDomainRegisterStringA().
Added enum
CUpti_PcieGen
to list PCIe generations.
1.1.16.18. Updates in CUDA 10.0
New Features
Added tracing support for devices with compute capability 7.5.
A new set of metric APIs are added for devices with compute capability 7.0 and higher. These provide low and deterministic profiling overhead on the target system. These APIs are currently supported only on Linux x86 64-bit and Windows 64-bit platforms. Refer to the CUPTI web page for documentation and details to download the package with support for these new APIs. Note that both the old and new metric APIs are supported for compute capability 7.0. This is to enable transition of code to the new metric APIs. But one cannot mix the usage of the old and new metric APIs.
CUPTI supports profiling of OpenMP applications. OpenMP profiling information is provided in the form of new activity records
CUpti_ActivityOpenMp
. New APIcuptiOpenMpInitialize
is used to initialize profiling for supported OpenMP runtimes.Activity record for kernel
CUpti_ActivityKernel4
provides shared memory size set by the CUDA driver.Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph.
Added support for resource callbacks for resources associated with the CUDA Graph. Refer enum
CUpti_CallbackIdResource
for new callback IDs.
1.1.16.19. Updates in CUDA 9.2
New Features
Added support to query PCI devices information which can be used to construct the PCIe topology. See activity kind
CUPTI_ACTIVITY_KIND_PCIE
and related activity recordCUpti_ActivityPcie
.To view and analyze bandwidth of memory transfers over PCIe topologies, new set of metrics to collect total data bytes transmitted and received through PCIe are added. Those give accumulated count for all devices in the system. These metrics are collected at the device level for the entire application. And those are made available for devices with compute capability 5.2 and higher.
CUPTI added support for new metrics:
Instruction executed for different types of load and store
Total number of cached global/local load requests from SM to texture cache
Global atomic/non-atomic/reduction bytes written to L2 cache from texture cache
Surface atomic/non-atomic/reduction bytes written to L2 cache from texture cache
Hit rate at L2 cache for all requests from texture cache
Device memory (DRAM) read and write bytes
The utilization level of the multiprocessor function units that execute tensor core instructions for devices with compute capability 7.0
A new attribute
CUPTI_EVENT_ATTR_PROFILING_SCOPE
is added under enumCUpti_EventAttribute
to query the profiling scope of a event. Profiling scope indicates if the event can be collected at the context level or device level or both. See EnumCUpti_EventProfilingScope
for available profiling scopes.A new error code
CUPTI_ERROR_VIRTUALIZED_DEVICE_NOT_SUPPORTED
is added to indicate that tracing and profiling on virtualized GPU is not supported.
1.1.16.20. Updates in CUDA 9.1
New Features
Added a field for correlation ID in the activity record
CUpti_ActivityStream
.
1.1.16.21. Updates in CUDA 9.0
New Features
CUPTI extends tracing and profiling support for devices with compute capability 7.0.
Usage of compute device memory can be tracked through CUPTI. A new activity record
CUpti_ActivityMemory
and activity kindCUPTI_ACTIVITY_KIND_MEMORY
are added to track the allocation and freeing of memory. This activity record includes fields like virtual base address, size, PC (program counter), timestamps for memory allocation and free calls.Unified memory profiling adds new events for thrashing, throttling, remote map and device-to-device migration on 64 bit Linux platforms. New events are added under enum
CUpti_ActivityUnifiedMemoryCounterKind
. EnumCUpti_ActivityUnifiedMemoryRemoteMapCause
lists possible causes for remote map events.PC sampling supports wide range of sampling periods ranging from 2^5 cycles to 2^31 cycles per sample. This can be controlled through new field
samplingPeriod2
in the PC sampling configuration structCUpti_ActivityPCSamplingConfig
.Added API
cuptiDeviceSupported()
to check support for a compute device.Activity record
CUpti_ActivityKernel3
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel4
. New record gives information about queued and submit timestamps which can help to determine software and hardware latencies associated with the kernel launch. These timestamps are not collected by default. Use APIcuptiActivityEnableLatencyTimestamps()
to enable collection. New fieldlaunchType
of typeCUpti_ActivityLaunchType
can be used to determine if it is a cooperative CUDA kernel launch.Activity record
CUpti_ActivityPCSampling2
for PC sampling has been deprecated and replaced by new activity recordCUpti_ActivityPCSampling3
. New record accommodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.Activity record
CUpti_ActivityNvLink
for NVLink attributes has been deprecated and replaced by new activity recordCUpti_ActivityNvLink2
. New record accommodates increased port numbers between two compute devices.Activity record
CUpti_ActivityGlobalAccess2
for source level global accesses has been deprecated and replaced by new activity recordCUpti_ActivityGlobalAccess3
. New record accommodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.New attributes
CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT
are added in the activity attribute enumCUpti_ActivityAttribute
to set and get the profiling semaphore pool size and the pool limit.
1.1.16.22. Updates in CUDA 8.0
New Features
Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field
latencySamples
of new activity recordCUpti_ActivityPCSampling2
provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record
CUpti_ActivityNvLink
. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.CUPTI supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records
CUpti_ActivityOpenAccData
,CUpti_ActivityOpenAccLaunch
andCUpti_ActivityOpenAccOther
. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New APIcuptiOpenACCInitialize
is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.Unified memory profiling provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum
CUpti_ActivityUnifiedMemoryAccessType
lists memory access types for GPU page fault events and enumCUpti_ActivityUnifiedMemoryMigrationCause
lists migration causes for data transfer events.Unified Memory profiling support is extended to Mac platform.
Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum
CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE
added toCUpti_DeviceAttribute
.Added new activity kinds
CUPTI_ACTIVITY_KIND_SYNCHRONIZATION
,CUPTI_ACTIVITY_KIND_STREAM
andCUPTI_ACTIVITY_KIND_CUDA_EVENT
, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity recordCUpti_ActivitySynchronization
. EnumCUpti_ActivitySynchronizationType
lists different types of CUDA synchronization constructs.APIs
cuptiSetThreadIdType()
/cuptiGetThreadIdType()
to set/get the mechanism used to fetch the thread-id used in CUPTI records. EnumCUpti_ActivityThreadIdType
lists all supported mechanisms.Added API
cuptiComputeCapabilitySupported()
to check the support for a specific compute capability by the CUPTI.Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs
cuptiActivityPushExternalCorrelationId()
andcuptiActivityPopExternalCorrelationId()
should be used to push and pop external correlation ids for the calling thread. Generated records of typeCUpti_ActivityExternalCorrelation
contain both external and CUPTI assigned correlation ids.Added containers to store the information of events and metrics in the form of activity records
CUpti_ActivityInstantaneousEvent
,CUpti_ActivityInstantaneousEventInstance
,CUpti_ActivityInstantaneousMetric
andCUpti_ActivityInstantaneousMetricInstance
. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.Support for domains and annotation of synchronization objects added in NVTX v2. New activity record
CUpti_ActivityMarker2
and enums to indicate various stages of synchronization object i.e.CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE
,CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS
,CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED
andCUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE
are added.Unused field
runtimeCorrelationId
of the activity recordCUpti_ActivityMemset
is broken into two fieldsflags
andmemoryKind
to indicate the asynchronous behavior and the kind of the memory used for the memset operation. It is supported by the new flagCUPTI_ACTIVITY_FLAG_MEMSET_ASYNC
added in the enumCUpti_ActivityFlag
.Added flag
CUPTI_ACTIVITY_MEMORY_KIND_MANAGED
in the enumCUpti_ActivityMemoryKind
to indicate managed memory.API
cuptiGetStreamId
has been deprecated. A new APIcuptiGetStreamIdEx
is introduced to provide the stream id based on the legacy or per-thread default stream flag.
1.1.16.23. Updates in CUDA 7.5
New Features
Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
API
cuptiGetLastError
is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.Unified memory profiling is supported with MPS (Multi-Process Service)
Callback is provided to collect replay information after every kernel run during kernel replay. See API
cuptiKernelReplaySubscribeUpdate
and callback typeCUpti_KernelReplayUpdateFunc
.Added new attributes in enum
CUpti_DeviceAttribute
to query maximum shared memory size for different cache preferences for a device function.
1.1.16.24. Updates in CUDA 7.0
New Features
CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record
CUpti_ActivityPCSampling
enabled using activity kindCUPTI_ACTIVITY_KIND_PC_SAMPLING
outputs stall reason along with PC and other related information. EnumCUpti_ActivityPCSamplingStallReason
lists all the stall reasons. Sampling period is configurable and can be tuned using APIcuptiActivityConfigurePCSampling
. This feature is available on devices with compute capability 5.2.Added new activity record
CUpti_ActivityInstructionCorrelation
which can be used to dump source locator records for all the PCs of the function.All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record
CUpti_ActivityUnifiedMemoryCounter2
, deprecating old recordCUpti_ActivityUnifiedMemoryCounter
.MPS tracing and profiling support is extended on multi-gpu setups.
Activity record
CUpti_ActivityDevice
for device information has been deprecated and replaced by new activity recordCUpti_ActivityDevice2
. New record adds device UUID which can be used to uniquely identify the device across profiler runs.Activity record
CUpti_ActivityKernel2
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel3
. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capability 5.2.
1.1.16.25. Updates in CUDA 6.5
New Features
Instruction classification is done for source-correlated Instruction Execution activity
CUpti_ActivityInstructionExecution
. SeeCUpti_ActivityInstructionClass
for instruction classes.Two new device attributes are added to the activity
CUpti_DeviceAttribute
:CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE
gives peak single precision flop per cycle for the GPU.CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE
gives peak double precision flop per cycle for the GPU.
Two new metric properties are added:
CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE
gives peak single precision flop per cycle for the GPU.CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE
gives peak double precision flop per cycle for the GPU.
Activity record
CUpti_ActivityGlobalAccess
for source level global access information has been deprecated and replaced by new activity recordCUpti_ActivityGlobalAccess2
. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.Activity record
CUpti_ActivityBranch
for source level branch information has been deprecated and replaced by new activity recordCUpti_ActivityBranch2
. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.Sample
sass_source_map
is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.Default event collection mode is changed to Kernel (
CUPTI_EVENT_COLLECTION_MODE_KERNEL
) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS
). Also Continuous mode is supported only on Tesla devices.Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API
cuptiGetAutoBoostState
is added to query the auto boost state of the device. This API returns errorCUPTI_ERROR_NOT_SUPPORTED
on devices that don’t support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.Activity record
CUpti_ActivityKernel2
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel3
. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.
1.1.16.26. Updates in CUDA 6.0
New Features
Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The
Instruction Execution
kind collects SASS-level instruction execution counts, divergence data, and predication data. TheShared Access
kind collects source correlated data indication inefficient shared memory accesses.CUPTI provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
CUPTI recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
The CUpti_ActivityContext activity record
CUpti_ActivityContext
has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bitcomputeApiKind
field was replaced with two 16 bit fields,computeApiKind
anddefaultStreamId
. Because all validcomputeApiKind
values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value forcomputeApiKind
and have a value of zero fordefaultStreamId
. The CUPTI client is responsible for versioning the persisted context data to recognize when thedefaultStreamId
field is valid.To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function
cuptiMetricGetRequiredEventGroupSets
can be used to get the groups of events that should be collected at the same time.Execution overheads introduced by CUPTI have been dramatically decreased.
The new activity buffer API introduced in CUDA Toolkit 5.5 is required. The legacy
cuptiActivityEnqueueBuffer
andcuptiActivityDequeueBuffer
functions have been removed.
1.1.16.27. Updates in CUDA 5.5
New Features
Applications that use CUDA Dynamic Parallelism can be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
Device attributes such as power usage, clocks, thermals, etc. are reported via a new activity kind.
A new activity buffer API uses callbacks to request and return buffers of activity records. The existing
cuptiActivityEnqueueBuffer
andcuptiActivityDequeueBuffer
functions are still supported but are deprecated and will be removed in a future release.The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
A new metric API
cuptiMetricGetValue2
allows metric values to be calculated for any device, even if that device is not available on the system.CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.
1.2. Known Issues
The following are known issues with the current release.
A security vulnerability issue required profiling tools to disable features using GPU performance counters for non-root or non-admin users when using a Windows 419.17 or Linux 418.43 or later driver. By default, NVIDIA drivers require elevated permissions to access GPU performance counters. On Tegra platforms, profile as root or using sudo. On other platforms, you can either start profiling as root or using sudo, or by enabling non-admin profiling. More details about the issue and the solutions can be found on the ERR_NVGPUCTRPERM web page.
Note
CUPTI allows tracing features for non-root and non-admin users on desktop platforms only, Tegra platforms require root or sudo access.
Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default. But it might fail to do so in some conditions and profiling will continue and results will be inconsistent. API
cuptiGetAutoBoostState()
can be used to query the auto boost state of the device. This API returns errorCUPTI_ERROR_NOT_SUPPORTED
on devices that don’t support auto boost. Note that auto boost is supported only on certain Tesla devices with compute capability 3.0 and higher.CUPTI doesn’t populate the activity structures which are deprecated, instead the newer version of the activity structure is filled with the information.
Because of the low resolution of the timer on Windows, the start and end timestamps can be same for activities having short execution duration on Windows.
The application which calls CUPTI APIs cannot be used with Nvidia tools like
nvprof
,Nvidia Visual Profiler
,Nsight Compute
,Nsight Systems
,Nvidia Nsight Visual Studio Edition
,cuda-gdb
andcuda-memcheck
.PCIe and NVLink records, when enabled using the API
cuptiActivityEnable
, are not captured when CUPTI is initialized lazily after the CUDA initialization. APIcuptiActivityEnableAndDump
can be used to dump the records for these activities at any point during the profiling session.CUPTI fails to profile the OpenACC application when the OpenACC library linked with the application has missing definition of the OpenACC API routine/s. This is indicated by the error code
CUPTI_ERROR_OPENACC_UNDEFINED_ROUTINE
.OpenACC profiling might fail when OpenACC library is linked statically in the user application. This happens due to the missing definition of the OpenACC API routines needed for the OpenACC profiling, as compiler might ignore definitions for the functions not used in the application. This issue can be mitigated by linking the OpenACC library dynamically.
Unified memory profiling is not supported on the ARM architecture.
Profiling a C++ application which overloads the new operator at the global scope and uses any CUDA APIs like cudaMalloc() or cudaMallocManaged() inside the overloaded new operator will result in a hang.
Devices with compute capability 6.0 and higher introduce a new feature, compute preemption, to give fair chance for all compute contexts while running long tasks. With compute preemption feature-
If multiple contexts are running in parallel it is possible that long kernels will get preempted.
Some kernels may get preempted occasionally due to timeslice expiry for the context.
If kernel has been preempted, the time the kernel spends preempted is still counted towards kernel duration.
Compute preemption can affect events and metrics collection. The following are known issues with the current release:
Events and metrics collection for a MPS client can result in higher counts than expected on devices with compute capability 7.0 and higher, since MPS client may get preempted due to termination of another MPS client.
Events warps_launched and sm_cta_launched and metric inst_per_warp might provide higher counts than expected on devices with compute capability 6.0 and higher. Metric unique_warps_launched can be used in place of warps_launched to get correct count of actual warps launched as it is not affected by compute preemption.
To avoid compute preemption affecting profiler results try to isolate the context being profiled:
Run the application on secondary GPU where display is not connected.
On Linux if the application is running on the primary GPU where the display driver is connected then unload the display driver.
Run only one process that uses GPU at one time.
Devices with compute capability 6.0 and higher support demand paging. When the kernel is scheduled for the first time, all the pages allocated using cudaMallocManaged and that are required for execution of the kernel are fetched in the global memory when GPU faults are generated. Profiler requires multiple passes to collect all the metrics required for kernel analysis. The kernel state needs to be saved and restored for each kernel replay pass. For devices with compute capability 6.0 and higher and platforms supporting Unified memory, in the first kernel iteration the GPU faults will be generated and all pages will be fetched in the global memory. Second iteration onwards GPU page faults will not occur. This will significantly affect the memory related events and timing. The time taken from trace will include the time required to fetch the pages but most of the metrics profiled in multiple iterations will not include time/cycles required to fetch the pages. This causes inconsistency in the profiler results.
When profiling an application that uses CUDA Dynamic Parallelism (CDP) there are several limitations to the profiling tools. CUDA 12.0 adds support for revamped CUDA Dynamic Parallelism APIs (referred to as CDP2), offering substantial performance improvements vs. the legacy CUDA Dynamic Parallelism APIs (referred to as CDP1).
For Legacy CUDA Dynamic Parallelism (CDP1), CUPTI supports tracing of all host and device kernels for devices with compute capability 5.x and 6.x. For devices with compute capability 7.0 and higher, CUPTI traces all the host launched kernels until it encounters a host launched kernel which launches child kernels; subsequent kernels are not traced.
For CUDA Dynamic Parallelism (CDP2), CUPTI supports tracing of host launched kernels only, it can’t trace device launched kernels.
CUPTI doesn’t report CUDA API calls for device launched kernels.
CUPTI doesn’t support profiling of device launched kernels i.e. it doesn’t report detailed event, metric, and source-level results for device launched kernels. Event, metric, and source-level results collected for CPU-launched kernels will include event, metric, and source-level results for the entire call-tree of kernels launched from within that kernel.
When profiling an application that uses CUDA Device Graphs, there are some limitations to the profiling tools.
CUPTI traces the device graph when it is launched from the host. When the graph is launched from the device, graph level tracing is supported, but node level tracing is not.
Compilation of samples autorange_profiling and userrange_profiling requires a host compiler which supports C++11 features. For some g++ compilers, it is required to use the flag -std=c++11 to turn on C++11 features.
PC Sampling Activity API is not supported on Tegra platforms, while PC Sampling API is supported on Tegra platforms.
As of CUDA 11.4 and R470 TRD1 driver release, CUPTI is supported in a vGPU environment which requires a vGPU license. If the license is not obtained after 20 minutes, the reported performance data including metrics from the GPU will be inaccurate. This is because of a feature in vGPU environment which reduces performance but retains functionality as specified here.
CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). This is reported using the error code
CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED
. For more information, please visit the web page.CUPTI versions shipped in the CUDA Toolkit 11.7 and CUDA Toolkit 11.8 don’t support Kepler (sm_35 and sm_37) devices. Refer to the webpages CUPTI 11.7 and CUPTI 11.8 for location of the CUPTI packages having the support for these Kepler devices.
Support for the GA103 GPU was added in the CUDA 11.6 release but it was broken for releases from CUDA 11.8 to CUDA 12.2 Update 1.
Unified memory profiling is broken for Maxwell devices on Windows platform.
For confidential computing devices, allocation of pinned (page-locked) host memory for profiling buffer for concurrent kernel tracing is not supported. Setting attribute
CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED
of the activity attribute enumCUpti_ActivityAttribute
will return the error codeCUPTI_ERROR_NOT_SUPPORTED
.There is no tracing and profiling support for chip-to-chip (C2C) interconnect.
With the new PC Sampling APIs CUPTI doesn’t report any pc sampling data for cuda graph launches in serialized mode.
There can be a long delay in attaching CUPTI to a running process. It is possible that the thread might starve if it is unable to access the resources it needs to perform the attach operation, as other threads have higher priority or hold locks on these resources.
Concurrent kernel tracing is not supported for cooperative kernel launches under MPS, and it can cause hang.
1.2.1. Profiling
The following are common known issues for both the event and metric APIs and the profiling APIs:
Profiling may significantly change the overall performance characteristics of the application. Refer to the section CUPTI Overhead for more details.
Profiling a kernel while other contexts are active on the same device (e.g. X server, or secondary CUDA or graphics application) can result in varying metric values for L2/FB (Device Memory) related metrics. Specifically, L2/FB traffic from non-profiled contexts cannot be excluded from the metric results. To completely avoid this issue, profile the application on a GPU without secondary contexts accessing the same device (e.g. no X server on Linux).
Profiling is not supported for multidevice cooperative kernels, that is, kernels launched by using the API functions
cudaLaunchCooperativeKernelMultiDevice
orcuLaunchCooperativeKernelMultiDevice
.Enabling certain events can cause GPU kernels to run longer than the driver’s watchdog time-out limit. In these cases the driver will terminate the GPU kernel resulting in an application error and profiling data will not be available. Please disable the driver watchdog time out before profiling such long running CUDA kernels
On Linux, setting the X Config option Interactive to false is recommended.
For Windows, detailed information about TDR (Timeout Detection and Recovery) and how to disable it is available at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/timeout-detection-and-recovery
The values of operations/second calculated from this NVIDIA developer tools site and generated by using the data center monitoring tools are not calculated in the same way as the operations/second used for export control purposes and should not be relied upon to assess performance against the export control limits.
1.2.1.1. Event and Metric API
The following are known issues related to Event and Metric API:
The CUPTI event APIs from the header
cupti_events.h
and metric APIs from the headercupti_metrics.h
are not supported for the devices with compute capability 7.5 and higher. These are replaced by Profiling API and Perfworks metric API. Refer to the section Migration to the Profiling API.While collecting events in continuous mode, event reporting may be delayed i.e. event values may be returned by a later call to readEvent(s) API and the event values for the last readEvent(s) API may get lost.
When profiling events, it is possible that the domain instance that gets profiled gives event value 0 due to absence of workload on the domain instance since CUPTI profiles one instance of the domain by default. To profile all instances of the domain, user can set event group attribute
CUPTI_EVENT_GROUP_ATTR_PROFILE_ALL_DOMAIN_INSTANCES
through APIcuptiEventGroupSetAttribute()
.Profiling results might be incorrect for CUDA applications compiled with nvcc version older than 9.0 for devices with compute capability 6.0 and 6.1. Profiling session will continue and CUPTI will notify it using error code
CUPTI_ERROR_CUDA_COMPILER_NOT_COMPATIBLE
. It is advised to recompile the application code with nvcc version 9.0 or later. Ignore this warning if code is already compiled with the recommended nvcc version.For some metrics, the required events can only be collected for a single CUDA context. For an application that uses multiple CUDA contexts, these metrics will only be collected for one of the contexts. The metrics that can be collected only for a single CUDA context are indicated in the metric reference tables.
Some metric values are calculated assuming a kernel is large enough to occupy all device multiprocessors with approximately the same amount of work. If a kernel launch does not have this characteristic, then those metric values may not be accurate.
Some events and metrics are not available on all devices. For list of metrics, you can refer to the metric reference tables.
CUPTI can give out of memory error for event and metrics profiling, it could be due to large number of instructions in the kernel.
Profiling is not supported for CUDA kernel nodes launched by a CUDA Graph.
These APIs are not supported on below system configurations:
64-bit ARM Server CPU architecture (arm64 SBSA).
Virtual GPUs (vGPU).
Windows Subsystem for Linux (WSL).
1.2.1.2. Profiling and Perfworks Metric API
The following are known issues related to the Profiling and Perfworks Metric API:
Profiling a kernel while any other GPU work is executing on the same MIG compute instance can result in varying metric values for all units. Care should be taken to serialize, or otherwise prevent concurrent CUDA launches within the target application to ensure those kernels do not influence each other. Be aware that GPU work issued through other APIs in the target process or workloads created by non-target processes running simultaneously in the same MIG compute instance will influence the collected metrics. Note that it is acceptable to run CUDA processes in other MIG compute instances as they will not influence the profiled MIG compute instance.
For devices with compute capability 8.0, the NVLink topology information is available but NVLink performance metrics (
nvlrx__*
andnvltx__*
) are not supported due to a potential application hang during data collection.Profiling is not supported under MPS (Multi-Process Service).
For profiling the CUDA kernel nodes launched by a CUDA Graph, not all combinations of range profiling and replay modes are supported. Here are some limitations:
User replay and application replay modes with auto range are not supported.
In the user range mode, entire graph is profiled as one workload i.e. all the kernel nodes launched by the CUDA Graph will be profiled and single result will be provided, user can’t do the profiling for a range of kernels.
For Device Graph profiling in the auto range and kernel replay mode, each kernel node will be profiled except for the nodes which launch device graphs.
Profiling kernels executed on a device that is part of an SLI group is not supported.
Refer to the section for differences from event and metric APIs.
Profiling on Windows Subsystem for Linux (WSL) is only supported with WSL version 2, NVIDIA display driver version 525 or higher and Windows 11.
Profiling is not supported for applications using Green Contexts.
Profiling is not supported for device graphs which have been updated after instantiation.
1.3. Support
Information on supported platforms and GPUs.
1.3.1. Platform Support
Platform |
Support |
---|---|
Windows |
Yes |
Windows Subsystem for Linux version 2 (WSL 2) |
Yes* |
Linux (x86_64) |
Yes |
Linux (aarch64 sbsa) |
Yes* |
Linux (x86_64) (Drive SDK) |
Yes* |
Linux (aarch64) |
Yes* |
Linux (ppc64le) |
No |
QNX |
Yes* |
Mac OSX |
No |
Android |
No |
Tracing and profiling of 32-bit processes is not supported.
Event and Metric APIs are not supported on Linux (aarch64 sbsa) and WSL 2 platforms.
1.3.2. GPU Support
CUPTI API |
Supported GPU architectures |
Notes |
---|---|---|
Activity |
Maxwell and later GPU architectures, i.e. devices with compute capability 5.0 and higher |
|
Callback |
Maxwell and later GPU architectures, i.e. devices with compute capability 5.0 and higher |
|
Event |
Maxwell, Pascal, Volta |
Not supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher |
Metric |
Maxwell, Pascal, Volta |
Not supported on Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher |
Profiling |
Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher |
Not supported on Maxwell and Pascal GPUs |
PC Sampling |
Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher |
Not supported on Maxwell and Pascal GPUs |
PM Sampling |
Turing and later GPU architectures, i.e. devices with compute capability 7.5 and higher |
Not supported on Maxwell, Pascal and Volta GPUs |
SASS Metric |
Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher |
Not supported on Maxwell and Pascal GPUs |
Checkpoint |
Volta and later GPU architectures, i.e. devices with compute capability 7.0 and higher |
Not supported on Maxwell and Pascal GPUs |