You are here: Analysis Tools > Other Analysis Reports > CUDA Reports > CUDA Concurrent Kernel Trace Mode

CUDA Concurrent Kernel Trace Mode

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.6 User Guide
Send Feedback

Some devices of compute capability 2.x can execute multiple kernels concurrently. Applications may query this capability by checking the concurrentKernels device property, which is equal to 1 for devices that support it. (As an alternative, developers may refer to the NVIDIA Nsight CUDA Devices analysis report to verify the capabilities of the graphics cards available on the target system.)

NVIDIA Nsight 4.6 includes support for tracing concurrent kernel execution on NVIDIA graphics cards built on the Fermi architecture. In older versions of NVIDIA Nsight, analysis captures always serialized all kernel launches, forcing them to be executed one at a time. With the new concurrent kernel trace mode, the runtime behavior of the target application with respect to the concurrent kernel execution is maintained, and all kernel start and end times are captured without forcing the kernels to be executed one at a time.

On NVIDIA graphics cards build on the Fermi architecture, the new concurrent trace mode is enabled by default. A user may override that default behavior by changing the analysis setting of the NVIDIA Nsight options. On NVIDIA GPUs built on the Tesla architecture, the serialized capture mode is always used, regardless of the configuration specified in the NVIDIA Nsight options.

A few notes on concurrent kernel trace mode:

The maximum number of kernel launches that a device can execute concurrently is sixteen.
A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.
Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

Note that CUDA concurrent kernel trace mode is only available for cards built on Fermi architecture.

Concurrent versus Serialized Mode

Here's an example of how concurrent kernel execution appears in the timeline report for the concurrentKernels SDK sample that ships with the NVIDIA GPU Computing SDK. All eight kernel launches are executed in parallel on the GPU.

By contrast, in serialized mode, it's easy to see that all kernel launches are forced to be processed one at a time, causing a significantly different runtime behavior.