Open topic with navigation
NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Each Streaming Multiprocessor (SM) of a CUDA device features numerous hardware units that are specialized in performing specific task. At the chip level those units provide execution pipelines to which the warp schedulers dispatch instructions to. For example, texture units provide the ability to execute texture fetches and perform texture filtering. Load/Store units fetch and save data to memory. Understanding the utilization of those pipelines and knowing how close they are to the peak performance of the target device are key information for analyzing the efficiency of executing a kernel; and also allows to identify performance bottlenecks caused by oversubscribing to a certain type of pipeline.
The Kepler GK110 whitepaper and the NVIDIA GeForce GTX 680 whitepaper both describe and compare the architecture details of the most recent CUDA compute devices. The following table gives a brief summary of the unit counts per SM and their corresponding pipeline throughputs per cycle:
|Fermi (cc 2.0)||Kepler (cc 3.x)|
|Units Count||Throughput||Unit Count||Throughput|
With respect to those architecture throughput numbers, the Pipeline Utilization metrics report the observed utilization for each pipeline at runtime. High pipeline utilization states that the corresponding compute resources were used heavily and kept busy often during the execution of the kernel. Low values indicate that the pipeline is not frequently used and resources were idle. The results for individual pipelines are independent of each other; summing up two or more pipeline utilization percentages does not result in a meaningful value. As the pipeline metrics are reported as an average over the duration of the kernel launch, a low value does not necessarily rule out that the pipeline was a bottleneck at some point in time during the kernel execution.
Arithmetic instructions are executed by a multitude of pipelines as hinted in the previous table for the two examples of 32bit floating point instructions and special functions operations. Those pipelines offer highly different throughputs, in turn leading to varying Throughputs of Native Arithmetic Instructions as documented by the CUDA C Programming Guide. For the CUDA devices supported by this experiment those arithmetic throughputs are:
|32-bit floating-point add, multiply, multiply-add||32||48||192||192|
|64-bit floating-point add, multiply, multiply-add||≤16||4||8||64|
|32-bit floating-point reciprocal, reciprocal square root, base-2 logarithm, base 2 exponential, sine, cosine||4||8||32||32|
|32-bit integer add, extended-precision add, subtract, extended-precision subtract, minimum, maximum||32||48||160||160|
|32-bit integer multiply, multiply-add, extended-precision multiply-add, sum of absolute difference, population count, count of leading zeros, most significant non-sign bit||16||16||32||32|
|32-bit integer shift||16||16||32||≤64|
|32-bit integer bit reverse, bit field extract/insert||16||16||32||64|
|Type conversions from 8-bit and 16-bit integer to 32-bit types||16||16||128||128|
|Type conversions from and to 64-bit types||≤16||4||8||32|
|All other type conversions||16||16||32||32|
By definition the instruction throughput states how many operations each SM can process per cycle. In other words, an operation with a high throughput poses less cost to issue than an operation with a lower instruction throughput. The Arithmetic Workload uses this simplistic cost model by weighting issued instruction counts with the corresponding reciprocal instruction throughput. Doing so allows evaluating which type of arithmetic instruction poses how much of an overall processing cost. This is especially useful for kernels with a high arithmetic pipeline utilization to identify which class of arithmetic operations might have hit the available throughput limit.
Shows the average utilization of the four major logical pipelines of the SMs during the execution of the kernel. Useful for investigating if a pipeline is oversubscribed and therefore is limiting the kernel's performance. Also helpful to estimate if adding more work will scale well or if a pipeline limit will be hit. In this context adding more work may refer to adding more arithmetic workload (for example by increasing the accuracy of some calculations), increasing the number of memory operations (including introducing register spilling), or increasing the number of Active Warps per SM with the goal of improving instruction latency hiding.
The reported values are averages across the duration of the kernel execution. A low utilization percentage does not guarantee that the pipeline was never oversubscribed at some point during the kernel execution.
Load / Store
Covers all issued instructions that trigger a request to the memory system of the target device - excluding texture operations. Accounts for load and store operations to global, local, shared memory as well as any atomic operation. Also includes register spills. Devices of compute capability 3.5 and higher support loading global memory through the Read-Only Data Cache (
Covers all issued instructions that perform a texture fetch and, for devices of compute capability 3.5 and higher, global memory loads via the Read-Only Data Cache (
Covers all issued instructions that can have an effect on the control flow, such as branch instructions (
Arithmetic Covers all issued floating point instructions, integer instructions, conversion operations, and movement instructions. See the Instruction Set Reference for a detailed list of assembly instructions for each of these groups. If the arithmetic pipeline utilization is high, check the Arithmetic Workload chart to identify the type of instruction with the highest costs.
Provides the distribution of estimated costs for numerous classes of arithmetic instructions. The cost model is based on the issue count weighted by the reciprocal of the corresponding instruction throughput. The instruction classes match the rows of the arithmetic throughput table given in the background section of this document.
Arithmetic instructions are executed by a multitude of pipelines on the chip, which may operate in parallel. As a consequence the Arithmetic Workload distribution is not a true partition of the Arithmetic Pipeline percentage shown the Pipeline Utilization chart; however, the instruction type with the highest cost estimate is likely causing the highest on-chip pipeline utilization.
All references to individual assembly instructions in the following metric descriptions refer to the native instruction set architecture (ISA) of CUDA devices as further described in the Instruction Set Reference.
Estimated workload for all 32-bit floating-point add (
Estimated workload for all 64-bit floating-point add (
Estimated workload for all 32-bit floating-point reciprocal (
Estimated workload for all 32-bit integer add (
Estimated workload for all 32-bit integer multiply (
Estimated workload for all 32-bit integer shift left (
Estimated workload for all 32-bit integer bit reverse, bit field extract (
Estimated workload for all logical operations (
Estimated workload for all warp shuffle (
Conv (From I8/I16 to I32)
Estimated workload for all type conversions from 8-bit and 16-bit integer to 32-bit types (subset of
Conv (To/From FP64)
Estimated workload for all type conversions from and to 64-bit types (subset of
Conv (All Other)
Estimated workload for all all other type conversions (remaining subset of
--ptxas-options=-v) to obtain the number of spilled bytes; if high, try to reduce spilling by changing the execution configuration or the launch bounds of the kernel launch.
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.