You are here: Analysis Tools > CUDA Experiments > Kernel-Level Experiments > Issue Efficiency

Issue Efficiency

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.0 User Guide
Send Feedback

Overview

The Issue Efficiency experiment provides information about the device's ability to issue the instructions. The key takeaway is the answer to the question if the device was able to issue instructions every cycle. Not being able to do so inevitably lowers the potential peak performance of the kernel. Understanding the root cause of not being able to issue instructions back-to-back is a good high-level triage for analyzing a kernel's performance.

Background

For devices of compute capability 2.0 and higher the Streaming Multiprocessors (SM) feature multiple warp schedulers. Each warp scheduler manages a fixed, hardware-given maximum number of warps. This defines the Device Limit of warps per SM - the upper bound of how many warps can be resident at once on each SM. The execution configuration of a kernel launch can further limit the number of warps that can be resident due to resource constraints (see ... for more details). Theoretical Occupancy defines the uppermost limit of warp parallelism on an SM. So far all discussed information is statically defined by the target device, the kernel's source code, and the launch parameters. At runtime, the number of warps allocated to a multiprocessor at once for every cycle is referred to as Active Warp. In best case the average active warps across the kernel execution is equal or very close to the theoretical occupancy. However if the scheduler fails to balance the workload evenly across the warp schedulers or simply no remaining work is left to issue to fill up the machine, the actual number of active warps can be significantly lower than the theoretical occupancy. The discussed metrics relate to each other by:

Device Limit >= Theoretical Occupancy >= Active Warps

The set of active warps can be thought of as the pool of candidates for making forward progress at runtime at a given moment in time. However not all active warps might necessarily be able to issue their next instruction. For example a warp may be sitting at a barrier waiting for other warps to catch up, it might not have fetched its next instruction yet, or it waits for results from previous instructions to become available. Separating active warps by their ability to issue their next instruction leads to the sub-set of Stalled Warps not able to make forward progress and the sub-set of Eligible Warps that are ready to issue their next instruction. In more detail, an active warp is considered eligible if the instruction has been fetched, the execution unit required by the instruction is available, and the instruction has no dependencies that have not been met. By definition we can state the relation between those metrics as:

Active Warps == Stalled Warps + Eligible Warps

At every cycle each warp scheduler will pick an eligible warp and issue one or more instructions. If no eligible warp is available for a cycle the scheduler the opportunity of making forward progress is lost and processing units might become unutilized. As each scheduler picks either no or exactly one warp, the Selected Warps per multiprocessor per cycle is bound to the range of [0 .. # Warp Schedulers per SM]. To guarantee that each scheduler can issue an instruction every cycle, there needs to be at least a single eligible warp available per warp scheduler per cycle. Consequently, the target value of eligible warps per scheduler is to have at least one or more warps available per cycle. In the context of the SM this is having at least as many eligible warps as warp schedulers.

Charts

Warps Per SM

Provides an overview of the various warps per SM metrics discussed in the background section of this document. The metrics are reported as average values across the complete kernel execution for each individual SM of the target device. The y-Axis is scaled to the device limit.

Even if on average a sufficient amount of eligible warps is reported, this does not necessarily guarantee that there was never a cycle that missed to issue one instruction per warp scheduler.

Metrics

Active Warps A warp is active from the time it is scheduled on a multiprocessor until it completes the last instruction. Each warp scheduler maintains its own list of assigned active warps. This assignment of warps to the schedulers is done once at the time a warp becomes active and is valid for the lifetime of the warp. As a rough guideline you want to have at least a minimum of eight active warps per warp scheduler. More active warps might allow hiding warp latencies more efficiently.

Eligible Warps An active warp is considered eligible if it is able to issue the next instruction. Each warp scheduler will select the next warp to issue an instruction from the pool of eligible warps. Warps that are not eligible will report an Issue Stall Reason. The target is to have at least one eligible warp per scheduler per cycle.

Theoretical Occupancy The theoretical occupancy acts as upper limit to active warps and consequently also eligible warps per SM. It is defined by the execution configuration of the kernel launch. More detailed information is discussed in the context of the Achieved Occupancy experiment.

Warp Issue Efficiency

Distribution of the availability of eligible warps per cycle across the GPU. The values are reported as sum across all warp schedulers for the duration of the kernel execution.

On devices with compute capability 2.x the chart exposes three metrics. No eligible warps means both warp schedulers failed to issue. One eligible denotes only one warp scheduler has an eligible warp available. Two or more guarantees that both warp schedulers issue an instruction.

Metrics

No Eligible The number of cycles that a warp scheduler had no eligible warps to select from and therefore did not issue an instruction. The lower the percentage of cycles with no eligible warp the more efficient the code runs on the target device. Investigate the Issue Stall Reason to understand what keeps warps from becoming eligible, if this value is high.

One or More Eligible The number of cycles that a warp scheduler had at least one eligible warps to select from. This metric is equal to total number of cycles an instruction was issued summed across all warp schedulers. Better if the value is higher with a target of getting close to 100%.

Issue Stall Reasons

The issue stall reasons capture why an active warp is not eligible. On devices of compute capability 3.0 and higher every stalled warp increments its most critical stall reason by one on every cycle. The sum of the stall reasons hence increment per multiprocessor per cycle by a value between zero (if all warps are eligible) and the number of active warps (if all warps are stalled). The update of the stall reason counters occurs for all stalled warps independent of being able to issue an instruction that cycle or not. This is in contrast to devices of compute capability 2.x that only update the issue stall reasons on cycles for which the warp scheduler was unable to issue an instruction.

Investigating issue stall reasons is actionable only if the Warp Issue Efficiency shows cycles with no eligible warps. Reducing issue stalls for a kernel that never fails to issue a new instruction on all warp schedulers will not increase overall execution performance.

Metrics

Instruction Fetch A warp reports this reason if the fetch unit has not returned the next instruction for the warp. This stall reason is often high when launching a kernel with a high number of active warps, for kernels with highly divergent flow control per warp, when executing warps that frequently jump to distant code blocks.

Execution Dependency A warp reports this reason if an input dependency is not yet available. This includes waiting for results from any global/local/shared memory access or any low-latency operation. If this value is high check the efficiency of the memory accesses with the memory experiments and try to reduce the number of transactions per request and the number of L1/L2 cache misses wherever possible. This can also be reduced by increasing the number of independent instructions. Common strategies for the latter include loop unrolling, function inlining, or manually interleaving independent calculations.

Data Requests A warp reports this reason if a data request cannot be made at the time as the required resources are not available, or are fully utilized, or too many operations of that type are already outstanding. In case data requests make up a large portion of the stall reasons, run the memory experiments to determine if you can optimize existing transactions per request or if you need to revisit your algorithm.

Texture A warp reports this reason if the texture sub-system is currently fully utilized or is not able to accept any further requests. If this value is high, check the Memory Statistics - Texture experiment and try to optimize the kernel's texture accesses.

Synchronization A warp reports this reason if the warp is blocked at a __syncthreads() or a memory barrier. If this reason is large and the kernel execution configuration is limited to a small number of blocks then consider dividing the kernel grid into more thread blocks. This may also indicate highly imbalanced workloads for the warps within a block before hitting a barrier synchronization point. If this applies, try to rebalance the work more evenly across the warps of each block.

Analysis

If the percentage of cycles with no eligible warp in the Warp Issue Efficiency chart is high …
- … try to increase the number of active warps if possible. In many cases increasing the number of active warps will result in an larger pool of eligible warps. If …
  - … the theoretical occupancy is far from the device limit (maximum y-axis value) in the Warps per SM chart, try to optimize the execution configuration of the kernel launch and also check the Occupancy experiment for further details. If you are register limited do not rule out experimenting with launch bounds to increase occupancy, even if this results in some register spilling.
  - … the average active warps in the Warps per SM chart is well below the theoretical occupancy, check the Instruction Statistics experiment for highly imbalanced workloads or tail effects. Potential strategies may include splitting the kernel grid in a more fine granular way, distribute work across the blocks in a more balances way, avoiding/parallelizing costly reduce operations to prepare or gather the final result on a single block, warp, or thread.
  - … check the Pipe Utilization experiment to see if a particular pipeline is already fully utilized. In this case increasing active warps is unlikely to results in more eligible warps as any newly added active warp will be stalled by trying to access the oversubscribed pipeline. In this case, try to reduce the load on this pipeline or investigate if the expected peak performance for the target hardware is already reached.
- … try to resolve the most frequent stall reasons. If the highest percentage of the stall reasons are …
  - … execution dependencies, run the memory experiments to look for potentials in optimizing slow memory accesses. Also check your instruction mix for a high amount of low-throughput instructions, such as double instructions, and verify you require to use the low-throughput operations. Whenever acceptable use the faster math instrinsics or switch to single floating point precision. Use the compiler option --use_fast_math to quickly check the potential effects of this optimization approaches.
  - … texture stalls, verify your texture accessed with the Memory Statistics - Texture and look for potentials for further optimizations. Again also check the the Pipe Utilization experiment to verify your did not already hit the maximum throughput for the texture pipeline.
  - … barrier stalls, try to balance execution workload of the warps within each block evenly before a barrier or consider if the algorithm can be implemented using less barriers.
If the percentage of cycles with no eligible warp in the Warp Issue Efficiency chart close to zero …
- … verify if the kernel performance is close to the expected peak performance for the instruction throughput or the memory throughput.
- … and the average eligible warps in the Warps per SM is well above the number of warp schedulers per multiprocessor, a potential advanced optimization strategy is to merge the workloads of two or more threads into one thread. If the remaining grid is still large enough to fill the whole device, this approach can result in more favorable execution times, even if overall occupancy, active warps, and eligible warps is reduced.

Open topic with navigation