You are here: Developer Tools > Desktop Developer Tools > NVIDIA Nsight Visual Studio Edition > Issue Efficiency

Issue Efficiency

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Send Feedback

Overview

The Issue Efficiency experiment provides information about the device's ability to issue the instructions. The key takeaway is the answer to the question if the device was able to issue instructions every cycle. Not being able to do so inevitably lowers the potential peak performance of the kernel. Understanding the root cause of not being able to issue instructions back-to-back is a good high-level triage for analyzing a kernel's performance.

Background

For devices of compute capability 2.0 and higher the Streaming Multiprocessors (SM) feature multiple warp schedulers. Each warp scheduler manages a fixed, hardware-given maximum number of warps. This defines the Device Limit of warps per SM - the upper bound of how many warps can be resident at once on each SM. The execution configuration of a kernel launch can further limit the number of warps that can be resident due to resource constraints (see ... for more details). Theoretical Occupancy defines the uppermost limit of warp parallelism on an SM. So far all discussed information is statically defined by the target device, the kernel's source code, and the launch parameters. At runtime, the number of warps allocated to a multiprocessor at once for every cycle is referred to as Active Warp. In best case the average active warps across the kernel execution is equal or very close to the theoretical occupancy. However if the scheduler fails to balance the workload evenly across the warp schedulers or simply no remaining work is left to issue to fill up the machine, the actual number of active warps can be significantly lower than the theoretical occupancy. The discussed metrics relate to each other by:

Device Limit >= Theoretical Occupancy >= Active Warps

The set of active warps can be thought of as the pool of candidates for making forward progress at runtime at a given moment in time. However not all active warps might necessarily be able to issue their next instruction. For example a warp may be sitting at a barrier waiting for other warps to catch up, it might not have fetched its next instruction yet, or it waits for results from previous instructions to become available. Separating active warps by their ability to issue their next instruction leads to the sub-set of Stalled Warps not able to make forward progress and the sub-set of Eligible Warps that are ready to issue their next instruction. In more detail, an active warp is considered eligible if the instruction has been fetched, the execution unit required by the instruction is available, and the instruction has no dependencies that have not been met. By definition we can state the relation between those metrics as:

Active Warps == Stalled Warps + Eligible Warps

At every cycle each warp scheduler will pick an eligible warp and issue one or more instructions. If no eligible warp is available for a cycle the scheduler the opportunity of making forward progress is lost and processing units might become unutilized. As each scheduler picks either no or exactly one warp, the Selected Warps per multiprocessor per cycle is bound to the range of [0 .. # Warp Schedulers per SM]. To guarantee that each scheduler can issue an instruction every cycle, there needs to be at least a single eligible warp available per warp scheduler per cycle. Consequently, the target value of eligible warps per scheduler is to have at least one or more warps available per cycle. In the context of the SM this is having at least as many eligible warps as warp schedulers.

Charts

Warps Per SM

Provides an overview of the various warps per SM metrics discussed in the background section of this document. The metrics are reported as average values across the complete kernel execution for each individual SM of the target device. The y-Axis is scaled to the device limit.

Even if on average a sufficient amount of eligible warps is reported, this does not necessarily guarantee that there was never a cycle that missed to issue one instruction per warp scheduler.

Metrics

Active Warps A warp is active from the time it is scheduled on a multiprocessor until it completes the last instruction. Each warp scheduler maintains its own list of assigned active warps. This assignment of warps to the schedulers is done once at the time a warp becomes active and is valid for the lifetime of the warp. As a rough guideline you want to have at least a minimum of eight active warps per warp scheduler. More active warps might allow hiding warp latencies more efficiently.

Eligible Warps An active warp is considered eligible if it is able to issue the next instruction. Each warp scheduler will select the next warp to issue an instruction from the pool of eligible warps. Warps that are not eligible will report an Issue Stall Reason. The target is to have at least one eligible warp per scheduler per cycle.

Theoretical Occupancy The theoretical occupancy acts as upper limit to active warps and consequently also eligible warps per SM. It is defined by the execution configuration of the kernel launch. More detailed information is discussed in the context of the Achieved Occupancy experiment.

Warp Issue Efficiency

Distribution of the availability of eligible warps per cycle across the GPU. The values are reported as sum across all warp schedulers for the duration of the kernel execution.

On devices with compute capability 2.x the chart exposes three metrics. No eligible warps means both warp schedulers failed to issue. One eligible denotes only one warp scheduler has an eligible warp available. Two or more guarantees that both warp schedulers issue an instruction.

Metrics

No Eligible The number of cycles that a warp scheduler had no eligible warps to select from and therefore did not issue an instruction. The lower the percentage of cycles with no eligible warp the more efficient the code runs on the target device. Investigate the Issue Stall Reason to understand what keeps warps from becoming eligible, if this value is high.

One or More Eligible The number of cycles that a warp scheduler had at least one eligible warps to select from. This metric is equal to total number of cycles an instruction was issued summed across all warp schedulers. Better if the value is higher with a target of getting close to 100%.

Issue Stall Reasons

The issue stall reasons capture why an active warp is not eligible. On devices of compute capability 3.0 and higher, every stalled warp increments its most critical stall reason by one on every cycle. The sum of the stall reasons, hence increment per multiprocessor per cycle, by a value between zero (if all warps are eligible) and the number of active warps (if all warps are stalled). The update of the stall reason counters occurs for all stalled warps independent of being able to issue an instruction that cycle or not. This is in contrast to devices of compute capability 2.x that only update the issue stall reasons on cycles for which the warp scheduler was unable to issue an instruction.

Pipeline Busy — The compute resources required by the instruction are not yet available.
Texture — The texture subsystem is fully utilized, or has too many outstanding requests.
Constant — A constant load is blocked due to a miss in the constants cache.
Instruction Fetch — The next assembly instruction has not yet been fetched.
Memory Throttle — A large number of pending memory operations prevent further forward progress. These can be reduced by combining several memory transactions into one.
Memory Dependency — A load/store cannot be made because the required resources are not available or are fully utilized, or too many requests of a given type are outstanding. Memory dependency stalls can potentially be reduced by optimizing memory alignment and access patterns.
Synchronization — The warp is blocked at a _syncthreads() call.
Execution Dependency — An input required by the instruction is not yet available. Execution dependency stalls can potentially be reduced by increasing instruction-level parallelism.

Analyzing Warp Issue Efficiency Versus Issue Stall Reasons

The Warp Issue Efficiency chart’s "No Eligible" segment shows how frequently issue stalls occur. The Issue Stall Reasons chart shows a distribution of the reasons warps are stalled. If the "No Eligible" segment is small, issue stalls are occurring infrequently, so there is diminishing value in studying the reasons for the stalls. However, if the "No Eligible" segment is large, addressing the most prevalent problems shown in the Issue Stall Reasons chart will yield the most improvement in issue efficiency.

In the illustrated example, first notice that in the Warp Issue Efficiency, for more than 80% of the clock cycles during the kernel's execution, no warps were eligible to issue their next instruction.

A high percentage of issue stalls indicates the Issue Stall Reasons chart should be consulted to determine the causes for the stalls. In this example, the most prominent reason is Memory Dependency, accounting for over half of the stalls. This is a dependency on data being loaded from memory which had not yet arrived, and indicates that performance is being limited due to memory latency.

In this example, the kernel could better utilize the GPU by doing more computation, during which the memory latency could be hidden, or by employing instruction-level parallelism to allow independent instructions to issue between loading and using data.

Analysis

If the percentage of cycles with no eligible warp in the Warp Issue Efficiency chart is high …
- … try to increase the number of active warps if possible. In many cases increasing the number of active warps will result in an larger pool of eligible warps. If …
  - … the theoretical occupancy is far from the device limit (maximum y-axis value) in the Warps per SM chart, try to optimize the execution configuration of the kernel launch and also check the Occupancy experiment for further details. If you are register limited do not rule out experimenting with launch bounds to increase occupancy, even if this results in some register spilling.
  - … the average active warps in the Warps per SM chart is well below the theoretical occupancy, check the Instruction Statistics experiment for highly imbalanced workloads or tail effects. Potential strategies may include splitting the kernel grid in a more fine granular way, distribute work across the blocks in a more balances way, avoiding/parallelizing costly reduce operations to prepare or gather the final result on a single block, warp, or thread.
  - … check the Pipe Utilization experiment to see if a particular pipeline is already fully utilized. In this case increasing active warps is unlikely to results in more eligible warps as any newly added active warp will be stalled by trying to access the oversubscribed pipeline. In this case, try to reduce the load on this pipeline or investigate if the expected peak performance for the target hardware is already reached.
- … try to resolve the most frequent stall reasons. If the highest percentage of the stall reasons are …
  - … execution dependencies, run the memory experiments to look for potentials in optimizing slow memory accesses. Also check your instruction mix for a high amount of low-throughput instructions, such as double instructions, and verify you require to use the low-throughput operations. Whenever acceptable use the faster math instrinsics or switch to single floating point precision. Use the compiler option --use_fast_math to quickly check the potential effects of this optimization approaches.
  - … texture stalls, verify your texture accessed with the Memory Statistics - Texture and look for potentials for further optimizations. Again also check the the Pipe Utilization experiment to verify your did not already hit the maximum throughput for the texture pipeline.
  - … barrier stalls, try to balance execution workload of the warps within each block evenly before a barrier or consider if the algorithm can be implemented using less barriers.
If the percentage of cycles with no eligible warp in the Warp Issue Efficiency chart close to zero …
- … verify if the kernel performance is close to the expected peak performance for the instruction throughput or the memory throughput.
- … and the average eligible warps in the Warps per SM is well above the number of warp schedulers per multiprocessor, a potential advanced optimization strategy is to merge the workloads of two or more threads into one thread. If the remaining grid is still large enough to fill the whole device, this approach can result in more favorable execution times, even if overall occupancy, active warps, and eligible warps is reduced.