You are here: Developer Tools > Desktop Developer Tools > NVIDIA Nsight Visual Studio Edition > Instruction Count

Instruction Count

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Send Feedback

Overview

The Instruction Count source-level experiment identifies instructions which do not execute for all threads in the warp. If enabled, the experiment can also provide histograms of active mask and predication.

Background

CUDA devices operate most efficiently when all threads in a warp are enabled. There are two reasons threads within a warp can be disabled: being inactive, and being predicated off. If the block size is not a multiple of the warp size, the last warp in the block will have inactive threads. When some threads within a warp exit the kernel while others continue, the exiting threads become inactive. Threads become predicated off when divergent branches occur, because the separate paths taken by the threads must be serialized, and threads are disabled for paths they do not take. For each instruction executed, the ratio of threads enabled in the warp to full warp size is called control flow efficiency, and the goal should be to achieve 100%. The Efficiency chart in the kernel-level Branch Statistics experiment shows control flow efficiency averaged over the duration of the kernel.

Histograms

This experiment includes options to capture histograms for number of active threads, and number of threads not predicated off. The options must be enabled in the activity page's configuration, which is only accessible when using the Custom options for configuring experiments. Histogram bins with a non-zero count have gray bars, and the highest bin has a red bar.

Data Table

Columns

Instructions Executed

Total executed instructions (any semantics per warp) regardless predicate or condition code.

Thread Instructions Executed

Total executed instructions (per thread), regardless predicate or condition code.

Thread Instructions Executed Not Predicated Off

Total executed instructions (per thread), with predicate and condition code evaluating to true.

Thread Execution Efficiency

Percentage of threads in the warp that were executed. 100% means all threads in a warp executed the instruction. Less than 100% means threads were inactive due to suboptimal launch or early return, or predicated off due to control flow divergence.

Active Mask

Distribution of active threads per instruction executed. There are bins for 1 to 32, since at minimum one thread must be active for the warp to remain active, and the maximum is the full WARP SIZE of 32. An optimal histogram has only counts in the bin close to WARP SIZE.

Predicates

Distribution of active threads that are not predicated off per instruction executed. There are bins for 0 to 32, since at minimum all threads can be predicated off, and at maximum all 32 threads in the warp are enabled (not predicated off). The histogram is only displayed for instructions with predicates, which can be seen by enabling the SASS Source column. An optimal histogram has only counts in the bin close to WARP SIZE, although it is common to see single instructions or short sequences of instructions fully predicated off (a count in the zero bin). This is due to how control flow is implemented by the compiler, using branches (for long divergences) or predication (for short divergences, where avoiding the overhead of a branch is preferable).

Analysis

If an instruction has a low percentage for Thread Execution Efficiency …
- … determine if this is because not all threads are active, because some threads are predicated off, or both. If all threads are active each time the instruction executes, then "Thread Instructions Executed" should be WARP SIZE times "Instructions Executed". If "Thread Instructions Executed" is less, then efficiency is low due to inactive threads. If an instruction has a lower value for "Thread Instructions Executed Not Predicated Off" than "Thread Instructions Executed", then efficiency is low due to divergent branches.
If efficiency is low due to inactive threads …
- … ensure the number of threads per block is a multiple of WARP SIZE, and avoid letting some threads in a warp exit the kernel before other threads in the same warp.
If efficiency is low due to divergent branches …
- … try to avoid divergent branches by ensuring decisions for whether or not to branch only differ at warp boundaries. For example, if WARP SIZE is 32 and a block has 64 threads, there is no divergence if threads 0-31 (warp 0) take a branch, and threads 32-63 (warp 1) do not take the branch. But there is divergence if some of the threads in warp 0 take the branch and other threads in warp 0 do not. Use the Divergent Branches experiment to find divergent branches and eliminate them.