NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.0 User Guide
Send Feedback
The Issue Efficiency experiment provides information about the device's ability to issue the instructions. The key takeaway is the answer to the question if the device was able to issue instructions every cycle. Not being able to do so inevitably lowers the potential peak performance of the kernel. Understanding the root cause of not being able to issue instructions back-to-back is a good high-level triage for analyzing a kernel's performance.
For devices of compute capability 2.0 and higher the Streaming Multiprocessors (SM) feature multiple warp schedulers. Each warp scheduler manages a fixed, hardware-given maximum number of warps. This defines the Device Limit of warps per SM - the upper bound of how many warps can be resident at once on each SM. The execution configuration of a kernel launch can further limit the number of warps that can be resident due to resource constraints (see ... for more details). Theoretical Occupancy defines the uppermost limit of warp parallelism on an SM. So far all discussed information is statically defined by the target device, the kernel's source code, and the launch parameters. At runtime, the number of warps allocated to a multiprocessor at once for every cycle is referred to as Active Warp. In best case the average active warps across the kernel execution is equal or very close to the theoretical occupancy. However if the scheduler fails to balance the workload evenly across the warp schedulers or simply no remaining work is left to issue to fill up the machine, the actual number of active warps can be significantly lower than the theoretical occupancy. The discussed metrics relate to each other by:
Device Limit >= Theoretical Occupancy >= Active Warps
The set of active warps can be thought of as the pool of candidates for making forward progress at runtime at a given moment in time. However not all active warps might necessarily be able to issue their next instruction. For example a warp may be sitting at a barrier waiting for other warps to catch up, it might not have fetched its next instruction yet, or it waits for results from previous instructions to become available. Separating active warps by their ability to issue their next instruction leads to the sub-set of Stalled Warps not able to make forward progress and the sub-set of Eligible Warps that are ready to issue their next instruction. In more detail, an active warp is considered eligible if the instruction has been fetched, the execution unit required by the instruction is available, and the instruction has no dependencies that have not been met. By definition we can state the relation between those metrics as:
Active Warps == Stalled Warps + Eligible Warps
At every cycle each warp scheduler will pick an eligible warp and issue one or more instructions. If no eligible warp is available for a cycle the scheduler the opportunity of making forward progress is lost and processing units might become unutilized. As each scheduler picks either no or exactly one warp, the Selected Warps per multiprocessor per cycle is bound to the range of [0 .. # Warp Schedulers per SM]. To guarantee that each scheduler can issue an instruction every cycle, there needs to be at least a single eligible warp available per warp scheduler per cycle. Consequently, the target value of eligible warps per scheduler is to have at least one or more warps available per cycle. In the context of the SM this is having at least as many eligible warps as warp schedulers.
Warps Per SM
Provides an overview of the various warps per SM metrics discussed in the background section of this document. The metrics are reported as average values across the complete kernel execution for each individual SM of the target device. The y-Axis is scaled to the device limit. Even if on average a sufficient amount of eligible warps is reported, this does not necessarily guarantee that there was never a cycle that missed to issue one instruction per warp scheduler. MetricsActive Warps A warp is active from the time it is scheduled on a multiprocessor until it completes the last instruction. Each warp scheduler maintains its own list of assigned active warps. This assignment of warps to the schedulers is done once at the time a warp becomes active and is valid for the lifetime of the warp. As a rough guideline you want to have at least a minimum of eight active warps per warp scheduler. More active warps might allow hiding warp latencies more efficiently. Eligible Warps An active warp is considered eligible if it is able to issue the next instruction. Each warp scheduler will select the next warp to issue an instruction from the pool of eligible warps. Warps that are not eligible will report an Issue Stall Reason. The target is to have at least one eligible warp per scheduler per cycle. Theoretical Occupancy The theoretical occupancy acts as upper limit to active warps and consequently also eligible warps per SM. It is defined by the execution configuration of the kernel launch. More detailed information is discussed in the context of the Achieved Occupancy experiment. |
Warp Issue Efficiency
Distribution of the availability of eligible warps per cycle across the GPU. The values are reported as sum across all warp schedulers for the duration of the kernel execution. On devices with compute capability 2.x the chart exposes three metrics. No eligible warps means both warp schedulers failed to issue. One eligible denotes only one warp scheduler has an eligible warp available. Two or more guarantees that both warp schedulers issue an instruction. MetricsNo Eligible The number of cycles that a warp scheduler had no eligible warps to select from and therefore did not issue an instruction. The lower the percentage of cycles with no eligible warp the more efficient the code runs on the target device. Investigate the Issue Stall Reason to understand what keeps warps from becoming eligible, if this value is high. One or More Eligible The number of cycles that a warp scheduler had at least one eligible warps to select from. This metric is equal to total number of cycles an instruction was issued summed across all warp schedulers. Better if the value is higher with a target of getting close to 100%. |
Issue Stall Reasons
The issue stall reasons capture why an active warp is not eligible. On devices of compute capability 3.0 and higher every stalled warp increments its most critical stall reason by one on every cycle. The sum of the stall reasons hence increment per multiprocessor per cycle by a value between zero (if all warps are eligible) and the number of active warps (if all warps are stalled). The update of the stall reason counters occurs for all stalled warps independent of being able to issue an instruction that cycle or not. This is in contrast to devices of compute capability 2.x that only update the issue stall reasons on cycles for which the warp scheduler was unable to issue an instruction. Investigating issue stall reasons is actionable only if the Warp Issue Efficiency shows cycles with no eligible warps. Reducing issue stalls for a kernel that never fails to issue a new instruction on all warp schedulers will not increase overall execution performance. MetricsInstruction Fetch A warp reports this reason if the fetch unit has not returned the next instruction for the warp. This stall reason is often high when launching a kernel with a high number of active warps, for kernels with highly divergent flow control per warp, when executing warps that frequently jump to distant code blocks. Execution Dependency A warp reports this reason if an input dependency is not yet available. This includes waiting for results from any global/local/shared memory access or any low-latency operation. If this value is high check the efficiency of the memory accesses with the memory experiments and try to reduce the number of transactions per request and the number of L1/L2 cache misses wherever possible. This can also be reduced by increasing the number of independent instructions. Common strategies for the latter include loop unrolling, function inlining, or manually interleaving independent calculations. Data Requests A warp reports this reason if a data request cannot be made at the time as the required resources are not available, or are fully utilized, or too many operations of that type are already outstanding. In case data requests make up a large portion of the stall reasons, run the memory experiments to determine if you can optimize existing transactions per request or if you need to revisit your algorithm. Texture A warp reports this reason if the texture sub-system is currently fully utilized or is not able to accept any further requests. If this value is high, check the Memory Statistics - Texture experiment and try to optimize the kernel's texture accesses. Synchronization A warp reports this reason if the warp is blocked at a __syncthreads() or a memory barrier. If this reason is large and the kernel execution configuration is limited to a small number of blocks then consider dividing the kernel grid into more thread blocks. This may also indicate highly imbalanced workloads for the warps within a block before hitting a barrier synchronization point. If this applies, try to rebalance the work more evenly across the warps of each block. |
--use_fast_math to quickly check the potential effects of this optimization approaches.
NVIDIA® Nsight™ Development Platform, Visual Studio Edition User Guide Rev. 4.0.140501 ©2009-2014. NVIDIA Corporation. All Rights Reserved.