NVIDIA® Nsight™ Application Development Environment for Heterogeneous Platforms, Visual Studio Edition 5.3 User Guide
Send Feedback
The NVIDIA Nsight Analysis tools contain a CUDA Profiler Activity that allows you to gather detailed performance information, in addition to timing and launch configuration details.
A CUDA Profiler activity consists of a kernel filter and a set of profiler experiments. Profile experiments are directed analysis tests targeted at collecting in-depth performance information for an isolated instance of a kernel launch.
The CUDA Profiler allows the user to collect an arbitrary number of experiments per kernel launch. Each profile experiment may require the target kernel to be executed one or more times in order to collect all of the required data. In the following examples, the individual iterations of a profile experiment are referred to as Experiment Passes. Executing all passes of all experiments for a target kernel launch is handled transparently to the analyzed application.
NVIDIA Nsight employs a replay mechanism for the executing experiment. A full snapshot of the mutable state of the target CUDA context is captured, before executing the experiment passes. For each experiment pass, the target kernel is executed once, followed by restoring the saved mutable state. Effectively, this rewinds the CUDA context to the exact state before the kernel launch. Subsequent experiment passes are guaranteed to operate on the same, unchanged input data and CUDA state. 1The NVIDIA Nsight CUDA Profiler attempts to execute the CUDA kernel transparently to the application. There are several issues that may lead to problems if either (1) the application has a hard time limit for the kernel or other interaction on the thread that launches the kernel, or (2) the kernel performs an IPC mechanism with the CPU that result in state changes on the CPU side.
As a result, the introduced performance overhead for profiling a single kernel launch is highly dependent on the following factors:
Given these factors, profiling may incur a fairly large performance overhead to the target process. For that reason, it is recommended that the user limit kernel profiling to the specific kernels of interest, and also limit the enabled profiling experiments to only those that are actionable (with respect to the current code optimization efforts). This will help to maintain fast turnaround times.
In case no filter is set at all or the specified filter is invalid, all kernel launches for all kernels will be profiled. As this can pose a high performance overhead, a warning icon will be displayed.
After skipping 5 kernels, profile 80 kernels.
In this scenario, the first 5 kernels will be skipped, then kernels 6 through 85 will be profiled.
![]() | Note that the counters for skipping kernels and for limiting the profile session are applied after the Kernel RegEx Filter. In other words, kernel names that do not match the Kernel RegEx filter (if set) will not be counted toward the total number of kernels in the two fields (N kernels and X kernels) described above. |
stdout
). ![]() | Note that if the option Non-Overlapping Input/Output Buffers is enabled mistakenly (that is, the profiled kernels do overwrite their input buffers or use unmatched device malloc/free), the behavior of the profiled application is completely undefined. As a consequence, the application might terminate abnormally, or the collected profile data may be invalid. |
As shown here, on the left is a list that includes the available experiment templates, in addition to all available individual experiments. By selecting an item from that list, the experiment is added to the active experiments in the middle column.
Note that some experiments can be added multiple times to the middle list, while others are only allowed to be added once. For example, it is possible to collect an arbitrary number of Achieved FLOPS and Achieved IOPS experiments, all with different weight values.
All experiments in the middle section will be executed for each profiled kernel launch. Based on the experiment selection in the middle table, the right document area will show a brief summary. In addition, a few experiments expose further configuration options through the panel on the right. For example, the Achieved FLOPS experiment exposes various floating point operations with modifiable operation filtering selection and hit-count weighting.
The Achieved FLOPS and Achieved IOPS experiments are based on the Disassembly Regex experiment. Expand the Experiment Definition section in their configuration to see the underlying script used. This script can be pasted into the configuration for a Disassembly Regex experiment and used directly or modified for different behavior.
Profiler experiment results are displayed per CUDA launch.
Changing the selection in the CUDA Launches table will update both the correlation pane and the details pane.
NVIDIA® Nsight™ Application Development Environment for Heterogeneous Platforms, Visual Studio Edition User Guide Rev. 5.3.170616 ©2009-2017. NVIDIA Corporation. All Rights Reserved.