You are here: Developer Tools > Desktop Developer Tools > NVIDIA Nsight Visual Studio Edition > Achieved FLOPs

Achieved FLOPs

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Send Feedback

Overview

Measuring floating point operations per second is a common metric for comparing different algorithms, variants in implementation, or changes in the compute device. While optimizing kernel code its primary value is to provide an estimate of how close an implementation comes to the theoretical arithmetic peak performance of the target device. For that it can be used to track the progress of optimizing a kernel's performance; even though the metric itself provides limited insight to what might cause low performance.

Background

All CUDA compute devices follow the IEEE 754-2008 standard for binary floating-point representation, with some small exceptions. These exceptions are detailed in the Floating-Point Standard section of the CUDA C Programming Guide. One of the key differences is the fused multiply-add (FMA) instruction, which combines multiply-add operations into a single instruction execution. Its result will often differ slightly from results obtained by doing the two operations separately.

References to specific assembly instructions in this document are made in regard to the actual instruction set architecture (ISA) of the hardware, called SASS. For a description and a full list of SASS assembly instructions for the different CUDA compute architectures see the documentation of the NVIDIA CUDA tool cuobjdump.

Charts

Floating Point Operations

Reports the weighted sum of hit counts for executed floating point instructions grouped by instruction class. The applied weights can be customized prior to data collection using the Experiment Configuration on the Activity Page.

Depending on the actual operation and the used compiler flags, a single floating point instruction written in CUDA C may result in multiple instructions in assembly code. The reported values refer to the executed assembly instructions; therefore the numbers may differ from expectations derived exclusively from the CUDA C code. Use the Source View page to investigate the mapping between high-level code and assembly code.

Metrics

Single ADD Weighted sum of all executed single precision floating point additions (FADD). The default weight is 1.

Single MUL Weighted sum of all executed single precision floating point multiplications (FMUL). The default weight is 1.

Single FMA Weighted sum of all executed single precision floating point fused multiply-add (FFMA) instructions. The contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations can be controlled by the --fmad compiler flag. The default weight is 2.

Single Special Weighted sum of all executed single precision special operations, including the reciprocal (RCP), reciprocal of the square root (RSQ), sine or cosine (SIN/COS), base-2 exponential (EX2), and base-2 logarithm (LG2). Each special function can be assigned a separate weight. The default weight is 1; except of RSQ which is multiplied with a default value of 2.

Double ADD Weighted sum of all executed double precision floating point additions (DADD). The default weight is 1.

Double MUL Weighted sum of all executed double precision floating point multiplications (DMUL). The default weight is 1.

Double FMA Weighted sum of all executed double precision fused multiply-add (DFMA) instructions. The contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations can be controlled by the --fmad compiler flag. The default weight is 2.

Double Special Weighted sum of all executed double precision special operations, including the reciprocal (RCP) and reciprocal of the square root (RSQ). The two special functions can be assigned a separate weight. The default weight for RCP is 1; RSQ is multiplied by a default value of 2.

Floating Point Operations per Second

Reports the weighted sum of hit counts for executed floating point instructions per second for both, single precision and double precision accuracy. The chart is a stacked bar graph using the very same instruction classes as the Floating Point Operations chart.

Metrics

Single Weighted sum of all executed single precision operations per second. The peak single precision floating point performance of a CUDA device is defined as the number of CUDA Cores times the graphics clock frequency multiplied by two. The factor of two stems from the ability to execute two operations at once using fused multiply-add (FFMA) instructions. This theoretical peak acts as valid upper limit only if the experiment is configured to use the default weights. Custom weights will result in different theoretical peak values.

Double Weighted sum of all executed double precision operations per second.

Analysis

General hints for optimizing floating point Arithmetic Instructions are given in the CUDA C Best Practices Guide:

Reciprocal Square Root
Other Arithmetic Instructions: Avoid automatic conversion of doubles to floats.
Math Libraries: Use the fast math library whenever speed trumps precision. And prefer faster, more specialized math functions over slower, more general ones when possible.
Precision-related Compiler Flags

Also try minimizing the number of executed arithmetic instructions with low throughput. The CUDA C Programming Guide provides a list of the expected throughputs for all native Arithmetic Instructions per compute capability.