You are here: Analysis Tools > Other Analysis Reports > CUDA Reports

If you mark the CUDA checkbox in the Trace Settings area of the Activity document, NVIDIA Nsight produces reports that include CUDA-specific trace data.

CUDA Summary

The CUDA Summary report is a top-level report that contains a summary of the CUDA-related information collected in the session. The summary displays the information in 3 sections:

CUDA Devices

Column Label Description
ID The ID of the CUDA device as returned by the CUDA Driver API calls cuDeviceGet and cuDeviceCount.
Name The display name of the CUDA device as returned by the call to cuDeviceGetName.
Contexts The number of CUcontext instances created on the device.
GPU% The % of time that kernels executed on the device out of the total duration of the capture. This does not include overhead for launching kernels. The actual percentage is slightly higher.
H to D (bytes) H to D (bytes) The number of bytes transferred from the host to the device.
D to H (bytes) D to H (bytes) The number of bytes transferred from the device to host.

CUDA Contexts

The first column lists the different attributes that were measured, such as number of API calls or errors. The second column lists the total count for the analysis session (capture), such as the total number of API calls for the session, or total number of errors for the session. The remaining columns, list the counts for each context. The first context is Context 0, which refers to any CUDA Driver API calls made when a CUcontext instance is not active on the current thread.

Row Label Description
Device ID Context 0 refers to any CUDA Driver API calls made when a CUcontext instance is not active on the current thread.
Driver API Calls In the Total column, click on the link to see the CUDA Driver Call Summary pages or the list of the CUDA Driver calls.
# Call - number of API calls.
# Errors – number of API calls that did not return CUDA_SUCCESS. To see the instances, look at the CUDA Driver API Calls page and set the filter to:
CUresult Not Equal To 0
% Time – The % of time that any CUDA API call is active on a thread during the capture.
NOTE: This is not the percentage of time that API calls were on a CPU core (which could be normalized to 100%). This is not a summation of all time in a CUDA API Call divided by the capture time (which could be significantly greater than 100%). The valid range for this value is [0%-N*100%], with N being the number of threads sharing access to the context.
Launches Link to CUDA Launches and CUDA Launch Summary.
# Launches – the number of kernels launched on the context.
% GPU Time – the percentage of the capture time that a kernel was executing on the device.
Memory Copies # Copies – the number of actual memory copy operations.
# Bytes – amount of bytes transferred by CUDA Driver API cuMemcpy calls.
% Time – percentage of time that memory copies were occurring during the capture.

Top Device Functions by Total Time

This overview table lists the top 10 kernels by time spent executing on the GPU.

CUDA Driver API Call Summary

The Driver API Call Summary lists statistics of calls made to the CUDA Driver API, including the number of times the API call was called, the percentage of time of the overall capture time per thread that the API call took, the number of errors returned, and statistics on the elapsed time each API call took while executing.

The color gradient in some cells used to display values of column entries relative to one another. For most of the columns, the gradient represents the a percentage of current cells value relative to the maximum value contained in that column. However, the columns containing percentage values use a maximum of 100%, even if there is no entry with 100%.

This color gradient allows you to quickly identify outliers in the data without having to scan through all the numbers and mentally compare them to one another. It is also very helpful when sorting the grid by different columns. Seeing where the large gradients (outliers in other columns) move to can be very informative.

CUDA Launches

The CUDA Launches report shows every CUDA kernel that was launched during your program's execution. Each row shows the time of execution and work size data for each launch.

CUDA Memory Copies

The CUDA Memory Copies shows every copy command executed in your program, including information on the start time, the duration, the number of bytes copied, and the rate of the copy. Memory copy commands can be performance limiting, especially if the copy command results in a transfer of data across the PCI-e bus from your CPU to your GPU or from your GPU to your CPU.


NVIDIA® Nsight™ Development Platform, Visual Studio Edition User Guide Rev. 4.6.150311 ©2009-2015. NVIDIA Corporation. All Rights Reserved.

of