Open topic with navigation
NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Measuring floating point operations per second is a common metric for comparing different algorithms, variants in implementation, or changes in the compute device. While optimizing kernel code its primary value is to provide an estimate of how close an implementation comes to the theoretical arithmetic peak performance of the target device. For that it can be used to track the progress of optimizing a kernel's performance; even though the metric itself provides limited insight to what might cause low performance.
All CUDA compute devices follow the IEEE 754-2008 standard for binary floating-point representation, with some small exceptions. These exceptions are detailed in the Floating-Point Standard section of the CUDA C Programming Guide. One of the key differences is the fused multiply-add (
FMA) instruction, which combines multiply-add operations into a single instruction execution. Its result will often differ slightly from results obtained by doing the two operations separately.
References to specific assembly instructions in this document are made in regard to the actual instruction set architecture (ISA) of the hardware, called SASS. For a description and a full list of SASS assembly instructions for the different CUDA compute architectures see the documentation of the NVIDIA CUDA tool cuobjdump.
Reports the weighted sum of hit counts for executed floating point instructions grouped by instruction class. The applied weights can be customized prior to data collection using the Experiment Configuration on the Activity Page.
Depending on the actual operation and the used compiler flags, a single floating point instruction written in CUDA C may result in multiple instructions in assembly code. The reported values refer to the executed assembly instructions; therefore the numbers may differ from expectations derived exclusively from the CUDA C code. Use the Source View page to investigate the mapping between high-level code and assembly code.
Weighted sum of all executed single precision floating point additions (
Weighted sum of all executed single precision floating point multiplications (
Weighted sum of all executed single precision floating point fused multiply-add (
Weighted sum of all executed single precision special operations, including the reciprocal (
Weighted sum of all executed double precision floating point additions (
Weighted sum of all executed double precision floating point multiplications (
Weighted sum of all executed double precision fused multiply-add (
Weighted sum of all executed double precision special operations, including the reciprocal (
Reports the weighted sum of hit counts for executed floating point instructions per second for both, single precision and double precision accuracy. The chart is a stacked bar graph using the very same instruction classes as the Floating Point Operations chart.
Weighted sum of all executed single precision operations per second. The peak single precision floating point performance of a CUDA device is defined as the number of CUDA Cores times the graphics clock frequency multiplied by two. The factor of two stems from the ability to execute two operations at once using fused multiply-add (
Double Weighted sum of all executed double precision operations per second.
General hints for optimizing floating point Arithmetic Instructions are given in the CUDA C Best Practices Guide:
Also try minimizing the number of executed arithmetic instructions with low throughput. The CUDA C Programming Guide provides a list of the expected throughputs for all native Arithmetic Instructions per compute capability.
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.