Changelog

Profiler changes in CUDA 12.3

List of changes done as part of the CUDA Toolkit 12.3 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 12.2

List of changes done as part of the CUDA Toolkit 12.2 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 12.1

List of changes done as part of the CUDA Toolkit 12.1 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 12.0

List of changes done as part of the CUDA Toolkit 12.0 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.8

List of changes done as part of the CUDA Toolkit 11.8 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.7

List of changes done as part of the CUDA Toolkit 11.7 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.6

List of changes done as part of the CUDA Toolkit 11.6 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.5

List of changes done as part of the CUDA Toolkit 11.5 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.4

List of changes done as part of the CUDA Toolkit 11.4 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.3

List of changes done as part of the CUDA Toolkit 11.3 release.

  • Visual Profiler extends remote profiling support to macOS host running version 11 (Big Sur) on Intel x86_64 architecture.

  • General bug fixes.

Profiler changes in CUDA 11.2

List of changes done as part of the CUDA Toolkit 11.2 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.1

List of changes done as part of the CUDA Toolkit 11.1 release.

  • General bug fixes. No new feature is added in this release.

Profiler changes in CUDA 11.0

List of changes done as part of the CUDA Toolkit 11.0 release.

  • Visual Profiler and nvprof don’t support devices with compute capability 8.0 and higher. Next-gen tools NVIDIA Nsight Compute and NVIDIA Nsight Systems should be used instead.

  • Starting with the CUDA 11.0, Visual Profiler and nvprof won’t support Mac as the target platform. However Visual Profiler will continue to support remote profiling from the Mac host. Visual Profiler will be provided in a separate installer package to maintain the remote profiling workflow for CUDA developers on Mac.

  • Added support to trace Optix applications.

  • Fixed the nvprof option –annotate-mpi which was broken since CUDA 10.0.

Profiler changes in CUDA 10.2

List of changes done as part of the CUDA Toolkit 10.2 release.

  • Visual Profiler and nvprof allow tracing features for non-root and non-admin users on desktop platforms. Note that events and metrics profiling is still restricted for non-root and non-admin users. More details about the issue and the solutions can be found on this web page.

  • Starting with CUDA 10.2, Visual Profiler and nvprof use dynamic/shared CUPTI library. Thus it’s required to set the path to the CUPTI library before launching Visual Profiler and nvprof. CUPTI library can be found at /usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or /usr/local/<cuda-toolkit>/targets/<arch>/lib for POSIX platforms and "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<cuda-toolkit>\extras\CUPTI\lib64" for Windows.

  • Profilers no longer turn off the performance characteristics of CUDA Graph when tracing the application.

  • Added an option to enable/disable the OpenMP profiling in Visual Profiler.

  • Fixed the incorrect timing issue for the asynchronous cuMemset/cudaMemset activity.

Profiler changes in CUDA 10.1 Update 2

List of changes done as part of the CUDA Toolkit 10.1 Update 2 release.

  • This release is focused on bug fixes and stability of the profiling tools.

  • A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, Visual Profiler and nvprof cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.

  • Visual Profiler requires Java Runtime Environment (JRE) 1.8 to be available on the local system. However, starting with CUDA Toolkit version 10.1 Update 2, the JRE is no longer included in the CUDA Toolkit due to Oracle upgrade licensing changes. The user must install the required version of JRE 1.8 in order to use Visual Profiler. Refer to the section Setting up Java Runtime Environment for more information.

Profiler changes in CUDA 10.1

List of changes done as part of the CUDA Toolkit 10.1 release.

  • This release is focused on bug fixes and stability of the profiling tools.

  • Support for NVTX string registration API nvtxDomainRegisterStringA().

Profiler changes in CUDA 10.0

List of changes done as part of the CUDA Toolkit 10.0 release.

  • Added tracing support for devices with compute capability 7.5.

  • Profiling features for devices with compute capability 7.5 and higher are supported in the NVIDIA Nsight Compute. Visual Profiler does not support Guided Analysis, some stages under Unguided Analysis and events and metrics collection for devices with compute capability 7.5 and higher. One can launch the NVIDIA Nsight Compute UI for devices with compute capability 7.5 and higher from Visual Profiler. Also nvprof does not support query and collection of events and metrics, source level analysis and other options used for profiling on devices with compute capability 7.5 and higher. The NVIDIA Nsight Compute command line interface can be used for these features.

  • Visual Profiler and nvprof now support OpenMP profiling where available. See OpenMP for more information.

  • Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph.

  • Profiler supports version 3 of NVIDIA Tools Extension API (NVTX). This is a header-only implementation of NVTX version 2.

Profiler changes in CUDA 9.2

List of changes done as part of the CUDA Toolkit 9.2 release.

  • The Visual Profiler allows to switch multiple segments to non-segment mode for Unified Memory profiling on the timeline. Earlier it was restircted to single segment only.

  • The Visual Profiler shows a summary view of the memory hierarchy of the CUDA programming model. This is available for devices with compute capability 5.0 and higher. Refer Memory Statistics for more information.

  • The Visual Profiler can correctly import profiler data generated by nvprof when the option --kernels kernel-filter is used.

  • nvprof supports display of basic PCIe topolgy including PCI bridges between NVIDIA GPUs and Host Bridge.

  • To view and analyze bandwidth of memory transfers over PCIe topologies, new set of metrics to collect total data bytes transmitted and recieved through PCIe are added. Those give accumulated count for all devices in the system. These metrics are collected at the device level for the entire application. And those are made available for devices with compute capability 5.2 and higher.

  • The Visual Profiler and nvprof added support for new metrics:

    • Instruction executed for different types of load and store

    • Total number of cached global/local load requests from SM to texture cache

    • Global atomic/non-atomic/reduction bytes written to L2 cache from texture cache

    • Surface atomic/non-atomic/reduction bytes written to L2 cache from texture cache

    • Hit rate at L2 cache for all requests from texture cache

    • Device memory (DRAM) read and write bytes

    • The utilization level of the multiprocessor function units that execute tensor core instructions for devices with compute capability 7.0

  • nvprof allows to collect tracing infromation along with the profiling information in the same pass. Use new option --trace <api|gpu> to enable trace along with collection of events/metrics.

Profiler changes in CUDA 9.1

List of changes done as part of the CUDA Toolkit 9.1 release.

  • The Visual Profiler shows the breakdown of the time spent on the CPU for each thread in the CPU Details View.

  • The Visual Profiler supports a new option to select the PC sampling frequency.

  • The Visual Profiler shows NVLink version in the NVLink topology.

  • nvprof provides the correlation ID when profiling data is generated in CSV format.

Profiler changes in CUDA 9.0

List of changes done as part of the CUDA Toolkit 9.0 release.

  • Visual Profiler and nvprof now support profiling on devices with compute capability 7.0.

  • Tools and extensions for profiling are hosted on Github at https://github.com/NVIDIA/cuda-profiler

  • There are several enhancements to Unified Memory profiling:

    • The Visual Profiler now associates unified memory events with the source code at which the memory is allocated.

    • The Visual Profiler now correlates a CPU page fault to the source code resulting in the page fault.

    • New Unified Memory profiling events for page thrashing, throttling and remote map are added.

    • The Visual Profiler provides an option to switch between segment and non-segment mode on the timeline.

    • The Visual Profiler supports filtering of Unified Memory profiling events based on the virtual address, migration reason or the page fault access type.

    • CPU page fault support is extended to Mac platforms.

  • Tracing and profiling of cooperative kernel launches is supported.

  • The Visual Profiler shows NVLink events on the timeline.

  • The Visual Profiler color codes links in the NVLink topology diagram based on throughput.

  • The Visual Profiler supports new options to make it easier to do multi-hop remote profiling.

  • nvprof supports a new option to select the PC sampling frequency.

  • The Visual Profiler supports remote profiling to systems supporting ssh key exchange algorithms with a key length of 2048 bits.

  • OpenACC profiling is now also supported on non-NVIDIA systems.

  • nvprof flushes all profiling data when a SIGINT or SIGKILL signal is encountered.

Profiler changes in CUDA 8.0

List of changes done as part of the CUDA Toolkit 8.0 release.

  • Visual Profiler and nvprof now support NVLink analysis for devices with compute capability 6.0. See NVLink view for more information.

  • Visual Profiler and nvprof now support dependency analysis which enables optimization of the program runtime and concurrency of applications utilizing multiple CPU threads and CUDA streams. It allows computing the critical path of a specific execution, detect waiting time and inspect dependencies between functions executing in different threads or streams. See Dependency Analysis for more information.

  • Visual Profiler and nvprof now support OpenACC profiling. See OpenACC for more information.

  • Visual Profiler now supports CPU profiling. Refer CPU Details View and CPU Source View for more information.

  • Unified Memory profiling now provides GPU page fault information on devices with compute capability 6.0 and 64 bit Linux platforms.

  • Unified Memory profiling now provides CPU page fault information on 64 bit Linux platforms.

  • Unified Memory profiling support is extended to the Mac platform.

  • The Visual Profiler source-disassembly view has several enhancements. There is now a single integrated view for the different source level analysis results collected for a kernel instance. Results of different analysis steps can be viewed together. See Source-Disassembly View for more information.

  • The PC sampling feature is enhanced to point out the true latency issues for devices with compute capability 6.0 and higher.

  • Support for 16-bit floating point (FP16) data format profiling.

  • If the new NVIDIA Tools Extension API(NVTX) feature of domains is used then Visual Profiler and nvprof will show the NVTX markers and ranges grouped by domain.

  • The Visual Profiler now adds a default file extension .nvvp if an extension is not specified when saving or opening a session file.

  • The Visual Profiler now supports timeline filtering options in create new session and import dialogs. Refer “Timeline Options” section under Creating a Session for more details.

Profiler changes in CUDA 7.5

List of changes done as part of the CUDA Toolkit 7.5 release.

  • Visual Profiler now supports PC sampling for devices with compute capability 5.2. Warp state including stall reasons are shown at source level for kernel latency analysis. See PC Sampling View for more information.

  • Visual Profiler now supports profiling child processes and profiling all processes launched on the same system. See Creating a Session for more information on the new multi-process profiling options. For profiling CUDA applications using Multi-Process Service(MPS) see MPS profiling with Visual Profiler

  • Visual Profiler import now supports browsing and selecting files on a remote system.

  • nvprof now supports CPU profiling. See CPU Sampling for more information.

  • All events and metrics for devices with compute capability 5.2 can now be collected accurately in presence of multiple contexts on the GPU.

Profiler changes in CUDA 7.0

The profiling tools contain a number of changes and new features as part of the CUDA Toolkit 7.0 release.

  • The Visual Profiler has been updated with several enhancements:

    • Performance is improved when loading large data file. Memory usage is also reduced.

    • Visual Profiler timeline is improved to view multi-gpu MPS profile data.

    • Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer.

  • nvprof has been updated with several enhancements:

    • All events and metrics for devices with compute capability 3.x and 5.0 can now be collected accurately in presence of multiple contexts on the GPU.

Profiler changes in CUDA 6.5

List of changes done as part of the CUDA Toolkit 6.5 release.

  • The Visual Profiler kernel memory analysis has been updated with several enhancements:

    • ECC overhead is added which provides a count of memory transactions required for ECC

    • Under L2 cache a split up of transactions for L1 Reads, L1 Writes, Texture Reads, Atomic and Noncoherent reads is shown

    • Under L1 cache a count of Atomic transactions is shown

  • The Visual Profiler kernel profile analysis view has been updated with several enhancements:

    • Initially the instruction with maximum execution count is highlighted

    • A bar is shown in the background of the counter value for the “Exec Count” column to make it easier to identify instruction with high execution counts

    • The current assembly instruction block is highlighted using two horizontal lines around the block. Also “next” and “previous” buttons are added to move to the next or previous block of assembly instructions.

    • Syntax highlighting is added for the CUDA C source.

    • Support is added for showing or hiding columns.

    • A tooltip describing each column is added.

  • nvprof now supports a new application replay mode for collecting multiple events and metrics. In this mode the application is run multiple times instead of using kernel replay. This is useful for cases when the kernel uses a large amount of device memory and use of kernel replay can be slow due to a high overhead of saving and restoring device memory for each kernel replay run. See Event/metric Summary Mode for more information. Visual Profiler also supports this new application replay mode and it can enabled in the Visual Profiler “New Session” dialog.

  • Visual Profiler now displays peak single precision flops and double precision flops for a GPU under device properties.

  • Improved source-to-assembly code correlation for CUDA Fortran applications compiled by the PGI CUDA Fortran compiler.

Profiler changes in CUDA 6.0

List of changes done as part of the CUDA Toolkit 6.0 release.

  • Unified Memory is fully supported by both the Visual Profiler and nvprof. Both profilers allow you to see the Unified Memory related memory traffic to and from each GPU on your system.

  • The standalone Visual Profiler, nvvp, now provides a multi-process timeline view. You can import multiple timeline data sets collected with nvprof into nvvp and view them on the same timeline to see how they are sharing the GPU(s). This multi-process import capability also includes support for CUDA applications using MPS. See MPS Profiling for more information.

  • The Visual Profiler now supports a remote profiling mode that allows you to collect a profile on a remote Linux system and view the timeline, analysis results, and detailed results on your local Linux, Mac, or Windows system. See Remote Profiling for more information.

  • The Visual Profiler analysis system now includes a side-by-side source and disassembly view annotated with instruction execution counts, inactive thread counts, and predicated instruction counts. This new view enables you to find hotspots and inefficient code sequences within your kernels.

  • The Visual Profiler analysis system has been updated with several new analysis passes: 1) kernel instructions are categorized into classes so that you can see if instruction mix matches your expectations, 2) inefficient shared memory access patterns are detected and reported, and 3) per-SM activity level is presented to help you detect detect load-balancing issues across the blocks of your kernel.

  • The Visual Profiler guided analysis system can now generate a kernel analysis report. The report is a PDF version of the per-kernel information presented by the guided analysis system.

  • Both nvvp and nvprof can now operate on a system that does not have an NVIDIA GPU. You can import profile data collected from another system and view and analyze it on your GPU-less system.

  • Profiling overheads for both nvvp and nvprof have been significantly reduced.