Open topic with navigation
NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Knowing the number of executed integer operations and the mix of instructions is a useful input for come up with sound estimates of performance expectations for a kernel. And it also serves as basis for better understanding many other metrics and experiments. Similar to the Achieved FLOPS experiment, its primary benefit is tracking and evaluating differences in performance for the code changes made; rather than deriving the actual cause of a performance limitation.
On devices of compute capability 1.x, 32-bit integer multiplication is implemented using multiple instructions as it is not natively supported. 24-bit integer multiplication is natively supported however via the
__[u]mul24 intrinsic. Using
__[u]mul24 instead of the 32-bit multiplication operator whenever possible usually improves performance for instruction bound kernels. It can have the opposite effect however in cases where the use of
__[u]mul24 inhibits compiler optimizations.
On devices of compute capability 2.x and beyond, 32-bit integer multiplication is natively supported, but 24-bit integer multiplication is not.
__[u]mul24 is therefore implemented using multiple instructions and should not be used.
Integer division and modulo operation are costly: tens of instructions on devices of compute capability 1.x, below 20 instructions on devices of compute capability 2.x and higher. They can be replaced with bitwise operations in some cases: If
n is a power of 2,
(i/n) is equivalent to
(i%n) is equivalent to
(i&(n-1)); the compiler will perform these conversions if
n is literal.
References to specific assembly instructions in this document are made in regard to the actual instruction set architecture (ISA) of the hardware, called SASS. For a description and a full list of SASS assembly instructions for the different CUDA compute architectures see the documentation of the NVIDIA CUDA tool cuobjdump.
Reports the weighted sum of hit counts for executed integer instructions grouped by instruction class. The applied weights can be customized prior to data collection using the Experiment Configuration on the Activity Page.
Depending on the actual operation and the used compiler flags, a single integer instruction written in CUDA C may result in multiple instructions in assembly code. The reported values refer to the executed assembly instructions; therefore the numbers may differ from expectations derived exclusively from the CUDA C code. Use the Source View page to investigate the mapping between high-level code and assembly code.
Weighted sum of all executed integer additions (
Weighted sum of all executed integer multiplications (
Weighted sum of all executed integer multiply-add (
Weighted sum of all executed sum-of-absolute-differences (
Weighted sum of all executed shift-and-add (
Weighted sum of all executed shift instructions, covering shift-right (
Weighted sum of all executed integer bit operations. Specifically accouning for bit-field-extract (
Reports the weighted sum of hit counts for executed integer instructions per second. The chart is a stacked bar graph using the very same instruction classes as the Integer Operations chart.
Math Weighted sum of all executed integer arithmetic instructions per second. Combines the individual contributions of ADD, MUL, MAD, SAD, and SCADD as reported in the Integer Operations chart.
Other Weighted sum of shift instructions and bit operations per second.
General hints for optimizing integer Arithmetic Instructions are given in the CUDA C Best Practices Guide:
Also try mimimizing the number of executed arithmetic instructions with low throughput. The CUDA C Programming Guide provides a list of the expected throughputs for all native Arithmetic Instructions per compute capability.
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.