5.5. Floating-Point Computation#

5.5.1. Floating-Point Introduction#

Since the adoption of the IEEE-754 Standard for Binary Floating-Point Arithmetic in 1985, virtually all mainstream computing systems, including NVIDIA’s CUDA architectures, have implemented the standard. The IEEE-754 standard specifies how the results of floating-point arithmetic should be approximated.

To get accurate results and achieve the highest performance with the required precision, it is important to consider many aspects of floating-point behavior. This is particularly important in a heterogeneous computing environment where operations are performed on different types of hardware.

The following sections review the basic properties of floating-point computation and cover Fused Multiply-Add (FMA) operations and the dot product. These examples illustrate how different implementation choices affect accuracy.

5.5.1.1. Floating-Point Format#

Floating-point format and functionality are defined in the IEEE-754 Standard.

The standard mandates that binary floating-point data be encoded on three fields:

  • Sign: one bit to indicate a positive or negative number.

  • Exponent: encodes the exponent offset by a numeric bias in base 2.

  • Significand (also called mantissa or fraction): encodes the fractional value of the number.

Floating-Point Encoding

The latest IEEE-754 standard defines the encodings and properties of the following binary formats:

  • 16-bit, also known as half-precision, corresponding to the __half data type in CUDA.

  • 32-bit, also known as single-precision, corresponding to the float data type in C, C++, and CUDA.

  • 64-bit, also known as double-precision, corresponding to the double data type in C, C++, and CUDA.

  • 128-bit, also known as quad-precision, corresponding to the __float128 or _Float128 data types in CUDA.

These types have the following bit lengths:

IEEE-754 Floating-Point Encodings

The numeric value associated with floating-point encoding for normal values is computed as follows:

\[(-1)^\mathrm{sign} \times 1.\mathrm{mantissa} \times 2^{\mathrm{exponent} - \mathrm{bias}}\]

for subnormal values, the leading 1 in the formula is absent.

The exponents are biased by \(127\) and \(1023\) for single- and double-precision, respectively. The integral part of \(1.\) is implicit in the fraction.
For example, the value \(-192 = (-1)^1 \times 2^7 \times 1.5\), and is encoded as a negative sign, an exponent of \(7\), and a fractional part \(0.5\). Hence the exponent \(7\) is represented by bit strings with values 7 + 127 = 134 = 10000110 for float and 7 + 1023 = 1030 = 10000111110 for double. The mantissa 0.5 = 2^-1 is represented by a binary value with 1 in the first position. The binary encoding of \(-192\) in single-precision and double-precision is shown in the following figure:
Floating-Point Representation for ``-192``

Since the fraction field uses a limited number of bits, not all real numbers can be represented exactly. For instance, the binary representation of the mathematical value of the fraction \(2 / 3\) is 0.10101010..., which has an infinite number of bits after the binary point. Therefore, \(2 / 3\) must be rounded before it can be represented as a floating-point number with limited precision. The rounding rules and modes are specified in IEEE-754. The most frequently used mode is round-to-nearest-or-even, abbreviated round-to-nearest.

5.5.1.2. Normal and Subnormal Values#

Any floating-point value that can be represented with at least one bit set in the exponent is called normal.

An important aspect of floating-point normal values is the wide gap between the smallest representable non-zero floating-point number, FLT_MIN, and zero. This gap is much wider than the gap between FLT_MIN and the second-smallest representable non-zero floating-point number.

Floating-point subnormal numbers, also called denormals, were introduced to address this issue. A subnormal floating-point value is represented with all bits in the exponent set to zero and at least one bit set in the significand. Subnormals are a required part of the IEEE-754 floating-point standard.

Subnormal numbers allow for a gradual loss of precision as an alternative to sudden rounding toward zero. However, subnormal numbers are computationally more expensive. Therefore, applications that don’t require strict accuracy may choose to avoid them to improve performance. The nvcc compiler allows disabling subnormal numbers by setting the -ftz=true option (flush-to-zero), which is also included in --use_fast_math.

A simplified visualization of the encoding of the smallest normal value and subnormal values in single-precision is shown in the following figure:

minimum normal value and subnormal values representations

where X represents both 0 and 1.

5.5.1.3. Special Values#

The IEEE-754 standard defines three special values for floating-point numbers:

Zero:

  • Mathematical zero.

  • Note that there are two possible representations of floating-point zero: +0 and -0. This differs from the representation of integer zero.

  • +0 == -0 evaluates to true.

  • 0 is encoded with all bits set to 0 in the exponent and significand.

Infinity:

  • Floating-point numbers behave according to saturation arithmetic, in which operations that overflow the representable range result in +Infinity or -Infinity.

  • Infinity is encoded with all bits in the exponent set to 1 and all bits in the significand set to 0. There are exactly two encodings for infinity values.

  • Any arithmetic operation that applies a finite number to infinity will result in infinity, except for division by zero and multiplication by zero, which result in NaN.

Not-a-Number (NaN):

  • NaN is a special symbol that represents an undefined or non-representable value. Common examples are 0.0 / 0.0, sqrt(-1.0), or +Inf - Inf.

  • NaN is encoded with all bits in the exponent set to 1 and any bit pattern in the significand, except for all bits set to 0. There are \(2^{\mathrm{mantissa} + 1} - 2\) possible encodings.

  • Any arithmetic operation involving a NaN will result in NaN.

  • Any comparison operation involving a NaN will result in false, including NaN == NaN (non-reflexive).

  • NaNs are provided in two forms:

    • Quiet NaNs qNaN are used to propagate errors resulting from invalid operations or values. Invalid arithmetic operations generally produce a quiet NaN. They are encoded with the most significant bit of the significand set to 1.

    • Signaling NaNs sNaN are designed to raise an invalid-operation exception. Signaling NaNs are generally explicitly created. They are encoded with the most significant bit of the significand set to 0.

    • The exact bit patterns for Quiet and Signaling NaNs are implementation-defined. CUDA provides the cuda::std::numeric_limits<T>::quiet_NaN and cuda::std::numeric_limits<T>::signaling_NaN constants to get their special values.

A simplified visualization of the encodings of special values is shown in the following figure:

Floating-Point Representation for Infinity and NaN

where X represents both 0 and 1.

5.5.1.4. Associativity#

It is important to note that the rules and properties of mathematical arithmetic do not directly apply to floating-point arithmetic due to its limited precision. The example below shows single-precision values A, B, and C and the exact mathematical value of their sum computed using different associativity.

\[\begin{split}\begin{aligned} A &= 2^{1} \times 1.00000000000000000000001 \\ B &= 2^{0} \times 1.00000000000000000000001 \\ C &= 2^{3} \times 1.00000000000000000000001 \\ (A + B) + C &= 2^{3} \times 1.01100000000000000000001011 \\ A + (B + C) &= 2^{3} \times 1.01100000000000000000001011 \end{aligned}\end{split}\]

Mathematically, \((A + B) + C\) is equal to \(A + (B + C)\).

Let \(\mathrm{rn}(x)\) denote one rounding step on \(x\). Performing the same computations in single-precision floating-point arithmetic in round-to-nearest mode according to IEEE-754, we obtain:

\[\begin{split}\begin{aligned} A + B &= 2^{1} \times 1.1000000000000000000000110000\ldots \\ \mathrm{rn}(A+B) &= 2^{1} \times 1.10000000000000000000010 \\ B + C &= 2^{3} \times 1.0010000000000000000000100100\ldots \\ \mathrm{rn}(B+C) &= 2^{3} \times 1.00100000000000000000001 \\ A + B + C &= 2^{3} \times 1.0110000000000000000000101100\ldots \\ \mathrm{rn}\big(\mathrm{rn}(A+B) + C\big) &= 2^{3} \times 1.01100000000000000000010 \\ \mathrm{rn}\big(A + \mathrm{rn}(B+C)\big) &= 2^{3} \times 1.01100000000000000000001 \end{aligned}\end{split}\]

For reference, the exact mathematical results are also computed above. The results computed according to IEEE-754 differ from the exact mathematical results. Additionally, the results corresponding to the sums \(\mathrm{rn}(\mathrm{rn}(A + B) + C)\) and \(\mathrm{rn}(A + \mathrm{rn}(B + C))\) differ from each other. In this case, \(\mathrm{rn}(A + \mathrm{rn}(B + C))\) is closer to the correct mathematical result than \(\mathrm{rn}(\mathrm{rn}(A + B) + C)\).

This example shows that seemingly identical computations can produce different results, even when all basic operations comply with IEEE-754.

5.5.1.5. Fused Multiply-Add (FMA)#

The Fused Multiply-Add (FMA) operation computes the result with only one rounding step. Without the FMA, the result would require two rounding steps: one for multiplication and one for addition. Because the FMA uses only one rounding step, it produces a more accurate result.

The Fused Multiply-Add operation can affect the propagation of NaNs differently than two separate operations. However, FMA NaN handling is not universally identical across all targets. Different implementations with multiple NaN operands may prefer a quiet NaN or propagate one operand’s payload. Additionally, IEEE-754 does not strictly mandate a deterministic payload selection order when multiple NaN operands are present. NaNs may also occur in intermediate computations, for example, \(\infty \times 0 + 1\) or \(1 \times \infty - \infty\), resulting in an implementation-defined NaN payload.


For clarity, first consider an example using decimal arithmetic to illustrate how the FMA operation works. We will compute \(x^2 - 1\) using five total digits of precision, with four digits after the decimal point.

  • For \(x = 1.0008\), the correct mathematical result is \(x^2 - 1 = 1.60064 \times 10^{-4}\). The closest number using only four digits after the decimal point is \(1.6006 \times 10^{-4}\).

  • The Fused Multiply-Add operation achieves the correct result using only one rounding step \(\mathrm{rn}(x \times x - 1) = 1.6006 \times 10^{-4}\).

  • The alternative is to compute the multiply and add steps separately. \(x^2 = 1.00160064\) translates to \(\mathrm{rn}(x \times x) = 1.0016\). The final result is \(\mathrm{rn}(\mathrm{rn}(x \times x) -1) = 1.6000 \times 10^{-4}\).

Rounding the multiply and add separately yields a result that is off by \(0.00064\). The corresponding FMA computation is wrong by only \(0.00004\) and its result is closest to the correct mathematical answer. The results are summarized below:

\[\begin{split}\begin{aligned} x &= 1.0008 \\ x^{2} &= 1.00160064 \\ x^{2} - 1 &= 1.60064 \times 10^{-4} && \text{true value} \\ \mathrm{rn}\big(x^{2} - 1\big) &= 1.6006 \times 10^{-4} && \text{fused multiply-add} \\ \mathrm{rn}\big(x^{2}\big) &= 1.0016 \\ \mathrm{rn}\big(\mathrm{rn}(x^{2}) - 1\big) &= 1.6000 \times 10^{-4} && \text{multiply, then add} \end{aligned}\end{split}\]

Below is another example, using binary single precision values:

\[\begin{split}\begin{aligned} A &= 2^{0} \times 1.00000000000000000000001 \\ B &= -2^{0} \times 1.00000000000000000000010 && \text{fused multiply-add} \\ \mathrm{rn}\big(A \times A + B\big) &= 2^{-46} \times 1.00000000000000000000000 && \text{multiply, then add} \\ \mathrm{rn}\big(\mathrm{rn}(A \times A) + B\big) &= 0 \end{aligned}\end{split}\]
  • Computing multiplication and addition separately results in the loss of all bits of precision, yielding \(0\).

  • Computing the FMA, on the other hand, provides a result equal to the mathematical value.

Fused multiply-add helps prevent loss of precision during subtractive cancellation. Subtractive cancellation occurs when quantities of similar magnitude with opposite signs are added. In this case, many of the leading bits cancel out, resulting in fewer meaningful bits. The fused multiply-add computes a double-width product during multiplication. Thus, even if subtractive cancellation occurs during addition, there are enough valid bits remaining in the product to yield a precise result.


Fused Multiply-Add Support in CUDA:

CUDA provides the Fused Multiply-Add operation in several ways for both float and double data types:


Fused Multiply-Add Support on Host Platforms:

Whether to use the fused operation depends on the availability of the operation on the platform and how the code is compiled. It is important to understand the host platform’s support for Fused Multiply-Add when comparing CPU and GPU results.

5.5.1.6. Dot Product Example#

Consider the problem of finding the dot product of two short vectors \(\overrightarrow{a}\) and \(\overrightarrow{b}\) both with four elements.

\[\begin{split}\overrightarrow{a} = \begin{bmatrix} a_{1} \\ a_{2} \\ a_{3} \\ a_{4} \end{bmatrix} \qquad \overrightarrow{b} = \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \\ b_{4} \end{bmatrix} \qquad \overrightarrow{a} \cdot \overrightarrow{b} = a_{1}b_{1} + a_{2}b_{2} + a_{3}b_{3} + a_{4}b_{4}\end{split}\]

Although this operation is easy to write down mathematically, implementing it in software involves several alternatives that could lead to slightly different results. All of the strategies presented here use operations that are fully compliant with IEEE-754.

Example Algorithm 1: The simplest way to compute the dot product is to use a sequential sum of products, keeping the multiplications and additions separate.

The final result can be represented as \(((((a_1 \times b_1) + (a_2 \times b_2)) + (a_3 \times b_3)) + (a_4 \times b_4))\).

Example Algorithm 2: Compute the dot product sequentially using fused multiply-add.

The final result can be represented as \((a_4 \times b_4) + ((a_3 \times b_3) + ((a_2 \times b_2) + (a_1 \times b_1 + 0)))\).

Example Algorithm 3: Compute the dot product using a divide-and-conquer strategy. First, we find the dot products of the first and second halves of the vectors. Then, we combine these results using addition. This algorithm is called the “parallel algorithm” because the two subproblems can be computed in parallel since they are independent of each other. However, the algorithm does not require a parallel implementation; it can be implemented with a single thread.

The final result can be represented as \(((a_1 \times b_1) + (a_2 \times b_2)) + ((a_3 \times b_3) + (a_4 \times b_4))\).

5.5.1.7. Rounding#

The IEEE-754 standard requires support for several operations. These include arithmetic operations such as addition, subtraction, multiplication, division, square root, fused multiply-add, finding the remainder, conversion, scaling, sign, and comparison operations. The results of these operations are guaranteed to be consistent across all implementations of the standard for a given format and rounding mode.


Rounding Modes

The IEEE-754 standard defines four rounding modes: round-to-nearest, round towards positive, round towards negative, and round towards zero. CUDA supports all four modes. By default, operations use round-to-nearest. Intrinsic mathematical functions can be used to select other rounding modes for individual operations.

Rounding Mode

Interpretation

rn

Round to nearest, ties to even

rz

Round towards zero

ru

Round towards \(\infty\)

rd

Round towards \(-\infty\)

5.5.1.8. Notes on Host/Device Computation Accuracy#

The accuracy of a floating-point computation result is affected by several factors. This section summarizes important considerations for achieving reliable results in floating-point computations. Some of these aspects have been described in greater detail in previous sections.

These aspects are also important when comparing the results between CPU and GPU. Differences between host and device execution must be interpreted carefully. The presence of differences does not necessarily mean the GPU’s result is incorrect or that there is a problem with the GPU.

Associativity:

Floating-point addition and multiplication in finite precision are not associative because they often result in mathematical values that cannot be directly represented in the target format, requiring rounding. The order in which these operations are evaluated affects how rounding errors accumulate and can significantly alter the final result.

Fused Multiply-Add:

Fused Multiply-Add computes \(a \times b + c\) in a single operation, resulting in greater accuracy and a faster execution time. The accuracy of the final result can be affected by its use. Fused Multiply-Add relies on hardware support and can be enabled either explicitly by calling the related function or implicitly through compiler optimization flags.

Precision:

Increasing the floating-point precision can potentially improve the accuracy of the results. Higher precision reduces loss of significance and enables the representation of a wider range of values. However, higher precision types have lower throughput and consumes more registers. Additionally, using them to explicitly store input and output increases memory usage and data movement.

Compiler Flags and Optimizations:

All major compilers provide a variety of optimization flags to control the behavior of floating-point operations.

  • The highest optimization level for GCC (-O3), Clang (-O3), nvcc (-O3), and Microsoft Visual Studio (/O2) does not affect floating-point semantics. However, inlining, loop unrolling, vectorization, and common subexpression elimination could affect the results. The NVC++ compiler also requires the flags -Kieee -Mnofma for IEEE-754-compliant semantics.

  • Refer to the GCC, Clang, Microsoft Visual Studio Compiler, nvc++, and Arm C/C++ compiler documentation for detailed information about options that affect floating-point behavior.

Library Implementations:

Functions defined outside the IEEE-754 standard are not guaranteed to be correctly rounded and depend on implementation-defined behavior. Therefore, the results may differ across different platforms, including between host, device, and different device architectures.

Deterministic Results:

A deterministic result refers to computing the same bit-wise numerical outputs every time when run with the same inputs under the same specified conditions. Such conditions include:

  • Hardware dependencies, such as execution on the same CPU processor or GPU device.

  • Compiler aspects, such as the version of the compiler and the compiler flags.

  • Run-time conditions that affect the computation, such as rounding mode or environment variables.

  • Identical inputs to the computation.

  • Thread configuration, including the number of threads involved in the computation and their organization, for example block and grid size.

  • The ordering of arithmetic atomic operations depends on hardware scheduling which can vary between runs.

Taking Advantage of the CUDA Libraries:

The CUDA Math Libraries, C Standard Library Mathematical functions, and C++ Standard Library Mathematical functions are designed to boost developer productivity for common functionalities, particularly for floating-point math and numerics-intensive routines. These functionalities provide a consistent high-level interface, are optimized, and are widely tested across platforms and edge cases. Users are encouraged to take full advantage of these libraries and avoid tedious manual reimplementations.

5.5.2. Floating-Point Data Types#

CUDA supports the Bfloat16, half-, single-, double-, and quad-precision floating-point data types. The following table summarizes the supported floating-point data types in CUDA and their requirements.

Table 43 Supported Floating-Point Types#

Precision / Name

Data Type

IEEE-754

Header / Built-in

Requirements

Bfloat16

__nv_bfloat16

<cuda_bf16.h>

Compute Capability 8.0 or higher.

Half Precision

__half

<cuda_fp16.h>

Single Precision

float

Built-in

Double Precision

double

Built-in

Quad Precision

__float128/_Float128

Built-in

<crt/device_fp128_functions.h> for mathematical functions

Host compiler support and Compute Capability 10.0 or higher.

The C or C++ spelling, _Float128 and __float128 respectively, also depends on the host compiler support.

CUDA also supports TensorFloat-32 (TF32), microscaling (MX) floating-point types, and other lower precision numerical formats that are not intended for general-purpose computation, but rather for specialized purposes involving tensor cores. These include 4-, 6-, and 8-bit floating-point types. See the CUDA Math API for more details.

The following figure reports the mantissa and exponent sizes of the supported floating-point data types.

Floating-Point Types: Mantissa and Exponent sizes

The following table reports the ranges of the supported floating-point data types.

Table 44 Supported Floating-Point Types Properties#

Precision / Name

Largest Value

Smallest Positive Value

Smallest Positive Denormal

Epsilon

Bfloat16

\(\approx 2^{128}\)

\(\approx 3.39 \cdot 10^{38}\)

\(2^{-126}\)

\(\approx 1.18 \cdot 10^{-38}\)

\(2^{-133}\)

\(2^{-7}\)

Half Precision

\(\approx 2^{16}\)

\(65504\)

\(2^{-14}\)

\(\approx 6.1 \cdot 10^{-5}\)

\(2^{-24}\)

\(2^{-10}\)

Single Precision

\(\approx 2^{128}\)

\(\approx 3.39 \cdot 10^{38}\)

\(2^{-126}\)

\(\approx 1.18 \cdot 10^{-38}\)

\(2^{-149}\)

\(2^{-23}\)

Double Precision

\(\approx 2^{1024}\)

\(\approx 1.8 \cdot 10^{308}\)

\(2^{-1022}\)

\(\approx 2.22 \cdot 10^{-308}\)

\(2^{-1074}\)

\(2^{-52}\)

Quad Precision

\(\approx 2^{16384}\)

\(\approx 1.19 \cdot 10^{4932}\)

\(2^{-16382}\)

\(\approx 3.36 \cdot 10^{-4032}\)

\(2^{-16494}\)

\(2^{-112}\)

Hint

The CUDA C++ Standard Library provides cuda::std::numeric_limits in the <cuda/std/limits> header to query the properties and the ranges of the supported floating-point types, including microscaling formats (MX). See the C++ reference for the list of queryable properties.

Complex numbers support:


5.5.3. CUDA and IEEE-754 Compliance#

All GPU devices follow the IEEE 754-2019 standard for binary floating-point arithmetic with the following limitations:

  • There is no dynamically configurable rounding mode; however, most of the operations support multiple constant IEEE rounding modes, selectable via specifically named device intrinsics functions.

  • There is no mechanism to detect floating-point exceptions, so all operations behave as if IEEE-754 exceptions are always masked. If there is an exceptional event, the default masked response defined by IEEE-754 is delivered. For this reason, although signaling NaN SNaN encodings are supported, they are not signaling and are handled as quiet exceptions.

  • Floating-point operations may alter the bit patterns of input NaN payloads. Operations such as absolute value and negation may also not comply with the IEEE 754 requirement, which could result in the sign of a NaN being updated in an implementation-defined manner.

To maximize the portability of results, users are recommended to use the default settings of the nvcc compiler’s floating-point options: -ftz=false, -prec-div=true, and -prec-sqrt=true, and not use the --use_fast_math option. Note that floating-point expression re-associations and contractions are allowed by default, similarly to the --fmad=true option. See also the nvcc User Manual for a detailed description of these compilation flags.

The IEEE-754 and C/C++ language standards do not explicitly address the conversion of a floating-point value to an integer value in cases where the rounded-to-integer value falls outside the range of the target integer format. The clamping behavior to the range of GPU devices is delineated in the PTX ISA conversion instructions section. However, compiler optimizations may leverage the unspecified behavior clause when out-of-range conversion is not invoked directly via a PTX instruction, consequently resulting in undefined behavior and an invalid CUDA program. The CUDA Math documentation issues warnings to users on a per-function/intrinsic basis. For instance, consider the __double2int_rz() instruction. This may differ from how host compilers and library implementations behave.

Atomic Functions Denormals Behavior:

Atomic operations have the following behavior regarding floating-point denormals, regardless of the setting of the compiler flag -ftz:

  • Atomic single-precision floating-point adds on global memory always operate in flush-to-zero mode, namely behave equivalent to PTX add.rn.ftz.f32 semantic.

  • Atomic single-precision floating-point adds on shared memory always operate with denormal support, namely behave equivalent to PTX add.rn.f32 semantic.

5.5.4. CUDA and C/C++ Compliance#

Floating-Point Exceptions:

Unlike the host implementation, the mathematical operators and functions supported in device code do not set the global errno variable nor report floating-point exceptions to indicate errors. Thus, if error diagnostic mechanisms are required, users should implement additional input and output screening for the functions.

Undefined Behavior with Floating-Point Operations:

Common conditions of undefined behavior for mathematical operations include:

  • Invalid arguments to mathematical operators and functions:

    • Using an uninitialized floating-point variable.

    • Using a floating-point variable outside its lifetime.

    • Signed integer overflow.

    • Dereferencing an invalid pointer.

  • Floating-point specific undefined behavior:

    • Converting a floating-point value to an integer type for which the result is not representable is undefined behavior. This also includes NaN and infinity.

Users are responsible for ensuring the validity of a CUDA program. Invalid arguments may result in undefined behavior and be subject to compiler optimizations.

Contrary to integer division by zero, floating-point division by zero is not undefined behavior and not subject to compiler optimizations; rather, it is implementation-specific behavior. C++ implementations that conform to IEC-60559 (IEEE-754), including CUDA, produce infinity. Note that invalid floating-point operations produce NaN and should not be misinterpreted as undefined behavior. Examples include zero divided by zero and infinity divided by infinity.

Floating-Point Literals Portability:

Both C and C++ allow for the representation of floating-point values in either decimal or hexadecimal notation. Hexadecimal floating-point literals, which are supported in C99 and C++17, denote a real value in scientific notation that can be precisely expressed in base-2. However, this does not guarantee that the literal will map to an actual value stored in a target variable (see the next paragraph). Conversely, a decimal floating-point literal may represent a numeric value that cannot be expressed in base-2.

According to the C++ standard rules, hexadecimal and decimal floating-point literals are rounded to the nearest representable value, larger or smaller, chosen in an implementation-defined manner. This rounding behavior may differ between the host and the device.

float f1 = 0.5f;    // 0.5, '0.5f' is a decimal floating-point literal
float f2 = 0x1p-1f; // 0.5, '0x1p-1f' is a hexadecimal floating-point literal
float f3 = 0.1f;
// f1, f2 are represented as 0 01111110 00000000000000000000000
// f3     is represented as  0 01111011 10011001100110011001101

The run-time and compile-time evaluations of the same floating-point expression are subject to the following portability issues:

  • The run-time evaluation of a floating-point expression may be affected by the selected rounding mode, floating-point contraction (FMA) and reassociation compiler settings, as well as floating-point exceptions. Note that CUDA does not support floating-point exceptions and the rounding mode is always round-to-nearest-ties-to-even.

  • The compiler may use a higher-precision internal representation for constant expressions.

  • The compiler may perform optimizations, such as constant folding, constant propagation, and common subexpression elimination, which can lead to a different final value or comparison result.

C Standard Math Library Notes:

The host implementations of common mathematical functions are mapped to C Standard Math Library functions in a platform-specific way. These functions are provided by the host compiler and the respective host libm, if available.

  • Functions not available from the host compilers are implemented in the crt/math_functions.h header file. For example, erfinv() is implemented there.

  • Less common functions, such as rhypot() and cyl_bessel_i0(), are only available in the device code.

As previously mentioned, the host and device implementations of mathematical functions are independent. For more details on the behavior of these functions, please refer to the host implementation’s documentation.


5.5.5. Floating-Point Functionality Exposure#

The mathematical functions supported by CUDA are exposed through the following methods:

Built-in C/C++ language arithmetic operators:

  • x + y, x - y, x * y, x / y, x++, x--, +=, -=, *=, /=.

  • Support single-, double-, and quad-precision types, float, double, and __float128/_Float128 respectively.

    • __half and __nv_bfloat16 types are also supported by including the <cuda_fp16.h> and <cuda_bf16.h> headers, respectively.

    • __float128/_Float128 type support relies on the host compiler and device compute capability, see the Supported Floating-Point Types table.

  • They are available in both host and device code.

  • Their behavior is affected by the nvcc optimization flags.

CUDA C++ Standard Library Mathematical functions:

  • Expose the full set of C++ <cmath> header functions exposed through the <cuda/std/cmath> header and the cuda::std:: namespace.

  • Support IEEE-754 standard floating-point types, __half, float, double, __float128, as well as Bfloat16 __nv_bfloat16.

  • They are available in both host and device code.

  • They often rely on the CUDA Math API functions. Therefore, there could be different levels of accuracy between the host and device code.

  • Their behavior is affected by the nvcc optimization flags.

  • A subset of functionalities is also supported on constant expressions, such as constexpr functions, in accordance with the C++23 and C++26 standard specifications.

CUDA C Standard Library Mathematical functions (CUDA Math API):

  • Expose a subset of the C <math.h> header functions.

  • Support single and double-precision types, float and double respectively.

    • They are available in both host and device code.

    • They don’t require additional headers.

    • Their behavior is affected by the nvcc optimization flags.

  • A subset of the <math.h> header functionalities is also available for __half, __nv_bfloat16, and __float128/_Float128 types. These functions have names that resemble those of the C Standard Library.

    • __half and __nv_bfloat16 types require the <cuda_fp16.h> and <cuda_bf16.h> headers, respectively. Their host and device code availability is defined on a per-function basis.

    • __float128/_Float128 type support relies on the host compiler and device compute capability, see the Supported Floating-Point Types table. The related functions require the crt/device_fp128_functions.h header and they are only available in device code.

  • They can have a different accuracy between host and device code.

Non-standard CUDA Mathematical functions (CUDA Math API):

  • Expose mathematical functionalities that are not part of the C/C++ Standard Library.

  • Mainly support single- and double-precision types, float and double respectively.

    • Their host and device code availability is defined on a per-function basis.

    • They don’t require additional headers.

    • They can have a different accuracy between host and device code.

  • __nv_bfloat16, __half, __float128/_Float128 are supported for a limited set of functions.

    • __half and __nv_bfloat16 types require the <cuda_fp16.h> and <cuda_bf16.h> headers, respectively.

    • __float128/_Float128 type support relies on the host compiler and device compute capability, see the Supported Floating-Point Types table. The related functions require the crt/device_fp128_functions.h header.

    • They are only available in device code.

  • Their behavior is affected by the nvcc optimization flags.

Intrinsic Mathematical functions (CUDA Math API):

  • Support single- and double-precision types, float and double respectively.

  • They are only available in device code.

  • They are faster but less accurate than the respective CUDA Math API functions.

  • Their behavior is not affected by the nvcc floating-point optimization flags -prec-div=false, -prec-sqrt=false, and -fmad=true. The only exception is -ftz=true, which is also included in -use_fast_math.

Table 45 Summary of Math Functionality Features#

Functionality

Supported Types

Host

Device

Affected by Floating-Point Optimization Flags
(only for float and double)

Built-in C/C++ language arithmetic operators

float, double, __half, __nv_bfloat16, __float128/_Float128, cuda::std::complex

CUDA C++ Standard Library Mathematical functions

float, double, __half, __nv_bfloat16, __float128, cuda::std::complex

__nv_fp8_e4m3, __nv_fp8_e5m2, __nv_fp8_e8m0, __nv_fp6_e2m3, __nv_fp6_e3m2, __nv_fp4_e2m1 *

CUDA C Standard Library Mathematical functions

float, double

__nv_bfloat16, __half with limited support and similar names

On a per-function basis

__float128/_Float128 with limited support and similar names

Non-standard CUDA Mathematical functions

float, double

On a per-function basis

__nv_bfloat16, __half, __float128/_Float128 with limited support

Intrinsic functions

float, double

Only with -ftz=true, also included in -use_fast_math

* The CUDA C++ Standard Library functions support queries for small floating-point types, such as numeric_limits<T>, fpclassify(), isfinite(), isnormal(), isinf(), and isnan().

The following sections provide accuracy information for some of these functions, when applicable. It uses ULP for quantification. For more information on the definition of the Unit in the Last Place (ULP), please see Jean-Michel Muller’s paper On the definition of ulp(x).


5.5.6. Built-In Arithmetic Operators#

The built-in C/C++ language operators, such as x + y, x - y, x * y, x / y, x++, x--, and reciprocal 1 / x, for single-, double-, and quad-precision types comply with the IEEE-754 standard. They guarantee a maximum ULP error of zero using a round-to-nearest-even rounding mode. They are available in both host and device code.

The nvcc compilation flag -fmad=true, also included in --use_fast_math, enables contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations and has the following effect on the maximum ULP error for the single-precision type float:

The nvcc compilation flag -prec-div=false, also included in --use_fast_math, has the following effect on the maximum ULP error for the division operator / for the single-precision type float:


5.5.7. CUDA C++ Mathematical Standard Library Functions#

CUDA provides comprehensive support for C++ Standard Library mathematical functions through the cuda::std:: namespace. The functionalities are part of the <cuda/std/cmath> header. They are available on both host and device code.

The following sections specify the mapping with the CUDA Math APIs and the error bounds of each function when executed on the device.

  • The maximum ULP error is defined as the absolute value of the difference between the ULPs returned by the function and a correctly rounded result of the corresponding precision obtained according to the round-to-nearest ties-to-even rounding mode.

  • The error bounds are derived from extensive, though not exhaustive, testing. Therefore, they are not guaranteed.

5.5.7.1. Basic Operations#

CUDA Math API for basic operations are available in both host and device code, except for __float128.

All the following functions have a maximum ULP error of zero.

Table 46 C++ Mathematical Standard Library Functions
C Math API Mapping
Basic Operations
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128

fabs(x)

\(|x|\)

__habs(x)

__habs(x)

fabsf(x)

fabs(x)

__nv_fp128_fabs(x)

fmod(x, y)

Remainder of \(\dfrac{x}{y}\), computed as \(x - \mathrm{trunc}\left(\dfrac{x}{y}\right) \cdot y\)

N/A

N/A

fmodf(x, y)

fmod(x, y)

__nv_fp128_fmod(x, y)

remainder(x, y)

Remainder of \(\dfrac{x}{y}\), computed as \(x - \mathrm{rint}\left(\dfrac{x}{y}\right) \cdot y\)

N/A

N/A

remainderf(x, y)

remainder(x, y)

__nv_fp128_remainder(x, y)

remquo(x, y, iptr)

Remainder and quotient of \(\dfrac{x}{y}\)

N/A

N/A

remquof(x, y, iptr)

remquo(x, y, iptr)

N/A

fma(x, y, z)

\(x \cdot y + z\)

__hfma(x, y, z), device-only

__hfma(x, y, z), device-only

fmaf(x, y, z)

fma(x, y, z)

__nv_fp128_fma(x, y, z)

fmax(x, y)

\(\max(x, y)\)

__hmax(x, y)

__hmax(x, y)

fmaxf(x, y)

fmax(x, y)

__nv_fp128_fmax(x, y)

fmin(x, y)

\(\min(x, y)\)

__hmin(x, y)

__hmin(x, y)

fminf(x, y)

fmin(x, y)

__nv_fp128_fmin(x, y)

fdim(x, y)

\(\max(x-y, 0)\)

N/A

N/A

fdimf(x, y)

fdim(x, y)

__nv_fp128_fdim(x, y)

nan(str)

NaN value from string representation

N/A

N/A

nanf(str)

nan(str)

N/A

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

5.5.7.2. Exponential Functions#

CUDA Math API for exponential functions are available in both host and device code only for float and double types.

Table 47 C++ Mathematical Standard Library Functions
C Math API Mapping and Accuracy (Maximal ULP)
Exponential Functions
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128


exp(x)


\(e^x\)

hexp(x)

0 ULP

hexp(x)

0 ULP

expf(x)

2 ULP

exp(x)

1 ULP

__nv_fp128_exp(x)

1 ULP


exp2(x)


\(2^x\)

hexp2(x)

0 ULP

hexp2(x)

0 ULP

exp2f(x)

2 ULP

exp2(x)

1 ULP

__nv_fp128_exp2(x)

1 ULP


expm1(x)


\(e^x - 1\)


N/A


N/A

expm1f(x)

1 ULP

expm1(x)

1 ULP

__nv_fp128_expm1(x)

1 ULP


log(x)


\(\ln(x)\)

hlog(x)

0 ULP

hlog(x)

0 ULP

logf(x)

1 ULP

log(x)

1 ULP

__nv_fp128_log(x)

1 ULP


log10(x)


\(\log_{10}(x)\)

hlog10(x)

0 ULP

hlog10(x)

0 ULP

log10f(x)

2 ULP

log10(x)

1 ULP

__nv_fp128_log10(x)

1 ULP


log2(x)


\(\log_2(x)\)

hlog2(x)

0 ULP

hlog2(x)

0 ULP

log2f(x)

1 ULP

log2(x)

1 ULP

__nv_fp128_log2(x)

1 ULP


log1p(x)


\(\ln(1+x)\)


N/A


N/A

log1pf(x)

1 ULP

log1p(x)

1 ULP

__nv_fp128_log1p(x)

1 ULP

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

5.5.7.3. Power Functions#

CUDA Math API for power functions are available in both host and device code only for float and double types.

Table 48 C++ Mathematical Standard Library Functions
C Math API Mapping and Accuracy (Maximal ULP)
Power Functions
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128


pow(x, y)


\(x^y\)


N/A


N/A

powf(x, y)

4 ULP

pow(x, y)

2 ULP

__nv_fp128_pow(x, y)

1 ULP


sqrt(x)


\(\sqrt{x}\)

hsqrt(x)

0 ULP

hsqrt(x)

0 ULP

sqrtf(x)

▪ 0 ULP
▪ 1 ULP with --use_fast_math

sqrt(x)

0 ULP

__nv_fp128_sqrt(x)

0 ULP


cbrt(x)


\(\sqrt[3]{x}\)


N/A


N/A

cbrtf(x)

1 ULP

cbrt(x)

1 ULP


N/A


hypot(x, y)


\(\sqrt{x^2 + y^2}\)


N/A


N/A

hypotf(x, y)

3 ULP

hypot(x, y)

2 ULP

__nv_fp128_hypot(x, y)

1 ULP

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

5.5.7.4. Trigonometric Functions#

CUDA Math API for trigonometric functions are available in both host and device code only for float and double types.

Table 49 C++ Mathematical Standard Library Functions
C Math API Mapping and Accuracy (Maximal ULP)
Trigonometric Functions
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128


sin(x)


\(\sin(x)\)

hsin(x)

0 ULP

hsin(x)

0 ULP

sinf(x)

2 ULP

sin(x)

2 ULP

__nv_fp128_sin(x)

1 ULP


cos(x)


\(\cos(x)\)

hcos(x)

0 ULP

hcos(x)

0 ULP

cosf(x)

2 ULP

cos(x)

2 ULP

__nv_fp128_cos(x)

1 ULP


tan(x)


\(\tan(x)\)


N/A


N/A

tanf(x)

4 ULP

tan(x)

2 ULP

__nv_fp128_tan(x)

1 ULP


asin(x)


\(\sin^{-1}(x)\)


N/A


N/A

asinf(x)

2 ULP

asin(x)

2 ULP

__nv_fp128_asin(x)

1 ULP


acos(x)


\(\cos^{-1}(x)\)


N/A


N/A

acosf(x)

2 ULP

acos(x)

2 ULP

__nv_fp128_acos(x)

1 ULP


atan(x)


\(\tan^{-1}(x)\)


N/A


N/A

atanf(x)

2 ULP

atan(x)

2 ULP

__nv_fp128_atan(x)

1 ULP


atan2(y, x)


\(\tan^{-1}\left(\dfrac{y}{x}\right)\)


N/A


N/A

atan2f(y, x)

3 ULP

atan2(y, x)

2 ULP


N/A

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

5.5.7.5. Hyperbolic Functions#

CUDA Math API for hyperbolic functions are available in both host and device code only for float and double types.

Table 50 C++ Mathematical Standard Library Functions
C Math API Mapping and Accuracy (Maximum ULP)
Hyperbolic Functions
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128


sinh(x)


\(\sinh(x)\)


N/A


N/A

sinhf(x)

3 ULP

sinh(x)

2 ULP

__nv_fp128_sinh(x)

1 ULP


cosh(x)


\(\cosh(x)\)


N/A


N/A

coshf(x)

2 ULP

cosh(x)

1 ULP

__nv_fp128_cosh(x)

1 ULP


tanh(x)


\(\tanh(x)\)

htanh(x)

0 ULP

htanh(x)

0 ULP

tanhf(x)

2 ULP

tanh(x)

1 ULP

__nv_fp128_tanh(x)

1 ULP


asinh(x)


\(\operatorname{sinh}^{-1}(x)\)


N/A


N/A

asinhf(x)

3 ULP

asinh(x)

3 ULP

__nv_fp128_asinh(x)

1 ULP


acosh(x)


\(\operatorname{cosh}^{-1}(x)\)


N/A


N/A

acoshf(x)

4 ULP

acosh(x)

3 ULP

__nv_fp128_acosh(x)

1 ULP


atanh(x)


\(\operatorname{tanh}^{-1}(x)\)


N/A


N/A

atanhf(x)

3 ULP

atanh(x)

2 ULP

__nv_fp128_atanh(x)

1 ULP

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

5.5.7.6. Error and Gamma Functions#

CUDA Math API for error and gamma functions are available in both host and device code for float and double types.

Error and Gamma functions are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

Table 51 C++ Mathematical Standard Library Functions
C Math API Mapping and Accuracy (Maximum ULP)
Error and Gamma Functions
#

cuda::std Function

Meaning

float

double


erf(x)


\(\dfrac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} dt\)

erff(x)

2 ULP

erf(x)

2 ULP


erfc(x)


\(1 - \mathrm{erf}(x)\)

erfcf

4 ULP

erfc

5 ULP


tgamma(x)


\(\Gamma(x)\)

tgammaf(x)

5 ULP

tgamma(x)

10 ULP


lgamma(x)


\(\ln |\Gamma(x)|\)

lgammaf(x)

▪ 6 ULP for \(x \notin [-10.001, -2.264]\)
▪ larger otherwise

lgamma(x)

4 ULP for \(x \notin [-23.0001, -2.2637]\)
▪ larger otherwise

5.5.7.7. Nearest Integer Floating-Point Operations#

CUDA Math API for nearest integer floating-point operations are available in both host and device code only for float and double types.

All the following functions have a maximum ULP error of zero.

Table 52 C++ Mathematical Standard Library Functions
C Math API Mapping
Nearest Integer Floating-Point Operations
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128

ceil(x)

\(\lceil x \rceil\)

hceil(x)

hceil(x)

ceilf(x)

ceil(x)

__nv_fp128_ceil(x)

floor(x)

\(\lfloor x \rfloor\)

hfloor(x)

hfloor(x)

floorf(x)

floor(x)

__nv_fp128_floor(x)

trunc(x)

Truncate to integer

htrunc(x)

htrunc(x)

truncf(x)

trunc(x)

__nv_fp128_trunc(x)

round(x)

Round to nearest integer, ties away from zero

N/A

N/A

roundf(x)

round(x)

__nv_fp128_round(x)

nearbyint(x)

Round to integer, ties to even

N/A

N/A

nearbyintf(x)

nearbyint(x)

N/A

rint(x)

Round to integer, ties to even

hrint(x)

hrint(x)

rintf(x)

rint(x)

__nv_fp128_rint(x)

lrint(x)

Round to nearest integer, ties to even (returns long int)

N/A

N/A

lrintf(x)

lrint(x)

N/A

llrint(x)

Round to nearest integer, ties to even (returns long long int)

N/A

N/A

llrintf(x)

llrint(x)

N/A

lround(x)

Round to nearest integer, ties away from zero (returns long int)

N/A

N/A

lroundf(x)

lround(x)

N/A

llround(x)

Round to nearest integer, ties away from zero (returns long long int)

N/A

N/A

llroundf(x)

llround(x)

N/A

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

Performance Considerations

The recommended way to round a single- or double-precision floating-point operand to an integer is to use the functions rintf() and rint(), not roundf() and round(). This is because roundf() and round() map to multiple instructions in device code, whereas rintf() and rint() map to a single instruction. truncf(), trunc(), ceilf(), ceil(), floorf(), and floor() each map to a single instruction as well.

5.5.7.8. Floating-Point Manipulation Functions#

CUDA Math API for floating-point manipulation functions are available in both host and device code, except for __float128.

Floating-point manipulation functions are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16. In these cases, the functions are emulated by converting to a float type and then converting the result back.

All the following functions have a maximum ULP error of zero.

Table 53 C++ Mathematical Standard Library Functions
C Math API Mapping
Floating-Point Manipulation Functions
#

cuda::std Function

Meaning

float

double

__float128

frexp(x, exp)

Extract mantissa and exponent

frexpf(x, exp)

frexp(x, exp)

__nv_fp128_frexp(x, nptr)

ldexp(x, n)

\(x \cdot 2^{\mathrm{n}}\)

ldexpf(x, n)

ldexp(x, n)

__nv_fp128_ldexp(x, n)

modf(x, iptr)

Extract integer and fractional parts

modff(x, iptr)

modf(x, iptr)

__nv_fp128_modf(x, iptr)

scalbn(x, n)

\(x \cdot 2^n\)

scalbnf(x, n)

scalbn(x, n)

N/A

scalbln(x, n)

\(x \cdot 2^n\)

scalblnf(x, n)

scalbln(x, n)

N/A

ilogb(x)

\(\lfloor \log_2(|x|) \rfloor\)

ilogbf(x)

ilogb(x)

__nv_fp128_ilogb(x)

logb(x)

\(\lfloor \log_2(|x|) \rfloor\)

logbf(x)

logb(x)

N/A

nextafter(x, y)

Next representable value toward \(y\)

nextafterf(x, y)

nextafter(x, y)

N/A

copysign(x, y)

Copy sign of \(y\) to \(x\)

copysignf(x, y)

copysign(x, y)

__nv_fp128_copysign(x, y)

5.5.7.9. Classification and Comparison#

CUDA Math API for classification and comparison functions are available in both host and device code, except for __float128.

All the following functions have a maximum ULP error of zero.

Table 54 C++ Mathematical Standard Library Functions
C Math API Mapping
Classification and Comparison Functions
#

cuda::std Function

Meaning

__nv_bfloat16

__half

float

double

__float128

fpclassify(x)

Classify \(x\)

N/A

N/A

N/A

N/A

N/A

isfinite(x)

Check if \(x\) is finite

N/A

N/A

isfinite(x)

isfinite(x)

N/A

isinf(x)

Check if \(x\) is infinite

__hisinf(x)

__hisinf(x)

isinf(x)

isinf(x)

N/A

isnan(x)

Check if \(x\) is NaN

__hisnan(x)

__hisnan(x)

isnan(x)

isnan(x)

__nv_fp128_isnan(x)

isnormal(x)

Check if \(x\) is normal

N/A

N/A

N/A

N/A

N/A

signbit(x)

Check if sign bit is set

N/A

N/A

signbit(x)

signbit(x)

N/A

isgreater(x, y)

Check if \(x > y\)

__hgt(x, y)

__hgt(x, y)

N/A

N/A

N/A

isgreaterequal(x, y)

Check if \(x \geq y\)

__hge(x, y)

__hge(x, y)

N/A

N/A

N/A

isless(x, y)

Check if \(x < y\)

__hlt(x, y)

__hlt(x, y)

N/A

N/A

N/A

islessequal(x, y)

Check if \(x \leq y\)

__hle(x, y)

__hle(x, y)

N/A

N/A

N/A

islessgreater(x, y)

Check if \(x < y\) or \(x > y\)

__hge(x, y)

__hge(x, y)

N/A

N/A

N/A

isunordered(x, y)

Check if \(x\), \(y\), or both are NaN

N/A

N/A

N/A

N/A

__nv_fp128_isunordered(x, y)

* Mathematical functions marked with “N/A” are not natively available for CUDA-extended floating-point types, such as __half and __nv_bfloat16.

5.5.8. Non-Standard CUDA Mathematical Functions#

CUDA provides mathematical functions that are not part of the C/C++ Standard Library and are instead offered as extensions. For single- and double-precision functions, host and device code availability is defined on a per-function basis.

This section specifies the error bounds of each function when executed on the device.

  • The maximum ULP error is defined as the absolute value of the difference between the ULPs returned by the function and a correctly rounded result of the corresponding precision obtained according to the round-to-nearest ties-to-even rounding mode.

  • The error bounds are derived from extensive, though not exhaustive, testing. Therefore, they are not guaranteed.

Table 55 Non-standard CUDA Mathematical functions
float and double
Mapping and Accuracy (Maximum ULP)
#

Meaning

float

double

\(\dfrac{x}{y}\)

fdividef(x, y), device-only

0 ULP, same as x / y


N/A


\(10^x\)

exp10f(x)

2 ULP

exp10(x)

1 ULP


\(\sqrt{x^2 + y^2 + z^2}\)

norm3df(x, y, z), device-only

3 ULP

norm3d(x, y, z), device-only

2 ULP


\(\sqrt{x^2 + y^2 + z^2 + t^2}\)

norm4df(x, y, z, t), device-only

3 ULP

norm4d(x, y, z, t), device-only

2 ULP


\(\sqrt{\sum_{i=0}^{\mathrm{dim}-1} p_i^{2}}\)

normf(dim, p), device-only

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off

norm(dim, p), device-only

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off

\(\dfrac{1}{\sqrt{x}}\)

rsqrtf(x)

2 ULP

rsqrt(x)

1 ULP

\(\dfrac{1}{\sqrt[3]{x}}\)

rcbrtf(x)

1 ULP

rcbrt(x)

1 ULP

\(\dfrac{1}{\sqrt{x^2 + y^2}}\)

rhypotf(x, y), device-only

2 ULP

rhypot(x, y), device-only

1 ULP

\(\dfrac{1}{\sqrt{x^2 + y^2 + z^2}}\)

rnorm3df(x, y, z), device-only

2 ULP

rnorm3d(x, y, z), device-only

1 ULP

\(\dfrac{1}{\sqrt{x^2 + y^2 + z^2 + t^2}}\)

rnorm4df(x, y, z, t), device-only

2 ULP

rnorm4d(x, y, z, t), device-only

1 ULP


\(\dfrac{1}{\sqrt{\sum_{i=0}^{\mathrm{dim}-1} p_i^{2}}}\)

rnormf(dim, p), device-only

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off

rnorm(dim, p), device-only

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off


\(\cos(\pi x)\)

cospif(x)

1 ULP

cospi(x)

2 ULP


\(\sin(\pi x)\)

sinpif(x)

1 ULP

sinpi(x)

2 ULP


\(\sin(\pi x), \cos(\pi x)\)

sincospif(x, sptr, cptr)

1 ULP

sincospi(x, sptr, cptr)

2 ULP


\(\Phi(x)\)

normcdff(x)

5 ULP

normcdf(x)

5 ULP


\(\Phi^{-1}(x)\)

normcdfinvf(x)

5 ULP

normcdfinv(x)

8 ULP


\(\mathrm{erfc}^{-1}(x)\)

erfcinvf(x)

4 ULP

erfcinv(x)

6 ULP


\(e^{x^2}\mathrm{erfc}(x)\)

erfcxf(x)

4 ULP

erfcx(x)

4 ULP


\(\mathrm{erf}^{-1}(x)\)

erfinvf(x)

2 ULP

erfinv(x)

5 ULP


\(I_0(x)\)

cyl_bessel_i0f(x), device-only

6 ULP

cyl_bessel_i0(x), device-only

6 ULP


\(I_1(x)\)

cyl_bessel_i1f(x), device-only

6 ULP

cyl_bessel_i1(x), device-only

6 ULP


\(J_0(x)\)

j0f(x)

▪ 9 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 2.2 \cdot 10^{-6}\), otherwise

j0(x)

▪ 7 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 5 \cdot 10^{-12}\), otherwise


\(J_1(x)\)

j1f(x)

▪ 9 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 2.2 \cdot 10^{-6}\), otherwise

j1(x)

▪ 7 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 5 \cdot 10^{-12}\), otherwise


\(J_n(x)\)

jnf(n, x)

For \(n = 128\), the maximum absolute error \(= 2.2 \cdot 10^{-6}\)

jn(n, x)

For \(n = 128\), the maximum absolute error \(= 5 \cdot 10^{-12}\)


\(Y_0(x)\)

y0f(x)

▪ 9 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 2.2 \cdot 10^{-6}\), otherwise

y0(x)

▪ 7 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 5 \cdot 10^{-12}\), otherwise


\(Y_1(x)\)

y1f(x)

▪ 9 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 2.2 \cdot 10^{-6}\), otherwise

y1(x)

▪ 7 ULP for \(|x| < 8\)
▪ the maximum absolute error \(= 5 \cdot 10^{-12}\), otherwise


\(Y_n(x)\)

ynf(n, x)

\(\lceil(2 + 2.5n)\) for \(|x| < n\)
▪ the maximum absolute error \(= 2.2 \cdot 10^{-6}\), otherwise

yn(n, x)

For \(|x| > 1.5n\), the maximum absolute error \(= 5 \cdot 10^{-12}\)

Non-standard CUDA Mathematical functions for __half, __nv_bfloat16, and __float128/_Float128 are only available in device code.

Table 56 Non-standard CUDA Mathematical functions
__nv_bfloat16, __half, __float128/_Float128
Mapping and Accuracy (Maximum ULP)
#

Meaning

__nv_bfloat16

__half

__float128/_Float128

\(\dfrac{1}{x}\)

hrcp(x)

0 ULP

hrcp(x)

0 ULP


N/A


\(10^x\)

hexp10(x)

0 ULP

hexp10(x)

0 ULP

__nv_fp128_exp10(x)

0 ULP

\(\dfrac{1}{\sqrt{x}}\)

hrsqrt(x)

0 ULP

hrsqrt(x)

0 ULP


N/A


\(\tanh(x)\) (approximate)

htanh_approx(x)

1 ULP

htanh_approx(x)

1 ULP


N/A

5.5.9. Intrinsic Functions#

Intrinsic mathematical functions are faster and less accurate versions of their corresponding CUDA C Standard Library Mathematical functions.

  • They have the same name prefixed with __, such as __sinf(x).

  • They are only available in device code.

  • They are faster because they map to fewer native instructions.

  • The flag --use_fast_math automatically translates the corresponding CUDA Math API functions into intrinsic functions. See the –use_fast_math Effect section for the full list of affected functions.

5.5.9.1. Basic Intrinsic Functions#

A subset of mathematical intrinsic functions allow to specify the rounding mode:

  • Functions suffixed with _rn operate using the round to nearest even rounding mode.

  • Functions suffixed with _rz operate using the round towards zero rounding mode.

  • Functions suffixed with _ru operate using the round up (toward positive infinity) rounding mode.

  • Functions suffixed with _rd operate using the round down (toward negative infinity) rounding mode.

The __fadd_[rn,rz,ru,rd](), __dadd_rn(), __fmul_[rn,rz,ru,rd](), and __dmul_rn() functions map to addition and multiplication operations that the compiler never merges into the FFMA or DFMA instructions. In contrast, additions and multiplications generated from the * and + operators are often combined into FFMA or DFMA.

The following table lists the single-, double-, and quad-precision floating-point intrinsic functions. All of them have a maximum ULP error of 0 and are IEEE-compliant.

5.5.9.2. Single-Precision-Only Intrinsic Functions#

The following table lists the single-precision floating-point intrinsic functions with their maximum ULP error.

  • The maximum ULP error is stated as the absolute value of the difference in ULPs between the result returned by the CUDA library function and a correctly rounded single-precision result obtained according to the round-to-nearest ties-to-even rounding mode.

  • The error bounds are derived from extensive, though not exhaustive, testing. Therefore, they are not guaranteed.

Table 58 Single-Precision Only Floating-Point Intrinsic Functions
Mapping and Accuracy (Maximum ULP)
#

Function

Meaning

Maximum ULP Error

__fdividef(x, y)

\(\dfrac{x}{y}\)

\(2\) for \(|y| \in [2^{-126}, 2^{126}]\)

__expf(x)

\(e^x\)

\(2 + \lfloor |1.173 \cdot x| \rfloor\)

__exp10f(x)

\(10^x\)

\(2 + \lfloor |2.97 \cdot x| \rfloor\)

__powf(x, y)

\(x^y\)

Derived from exp2f(y * __log2f(x))

__logf(x)

\(\ln(x)\)

\(2^{-21.41}\) abs error for \(x \in [0.5, 2]\)
▪ 3 ULP, otherwise

__log2f(x)

\(\log_2(x)\)

\(2^{-22}\) abs error for \(x \in [0.5, 2]\)
▪ 2 ULP, otherwise

__log10f(x)

\(\log_{10}(x)\)

\(2^{-24}\) abs error for \(x \in [0.5, 2]\)
▪ 3 ULP, otherwise

__sinf(x)

\(\sin(x)\)

\(2^{-21.41}\) abs error for \(x \in [-\pi, \pi]\)
▪ larger otherwise

__cosf(x)

\(\cos(x)\)

\(2^{-21.41}\) abs error for \(x \in [-\pi, \pi]\)
▪ larger otherwise

__sincosf(x, sptr, cptr)

\(\sin(x), \cos(x)\)

Component-wise, the same as __sinf(x) and __cosf(x)

__tanf(x)

\(\tan(x)\)

Derived from __sinf(x) * (1 / __cosf(x))

__tanhf(x)

\(\tanh(x)\)

▪ Max relative error: \(2^{-11}\)
▪ Subnormal results are not flushed to zero even under -ftz=true compiler flag.

5.5.9.3. --use_fast_math Effect#

The nvcc compiler flag --use_fast_math translates a subset of CUDA Math API functions called in device code into their intrinsic counterpart. Note that the CUDA C++ Standard Library functions are also affected by this flag. See the Intrinsic Functions section for more details on the implications of using intrinsic functions instead of CUDA Math API functions.

A more robust approach is to selectively replace mathematical function calls with intrinsic versions only where the performance gains justify it and where the changed properties, such as reduced accuracy and different special-case handling, are acceptable.

5.5.10. References#

  1. IEEE 754-2019 Standard for Floating-Point Arithmetic.

  2. Jean-Michel Muller. On the definition of ulp(x). INRIA/LIP research report, 2005.

  3. Nathan Whitehead, Alex Fit-Florea. Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs. Nvidia Report, 2011.

  4. David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, March 1991.

  5. David Monniaux. The pitfalls of verifying floating-point computations. ACM Transactions on Programming Languages and Systems, May 2008.

  6. Peter Dinda, Conor Hetland. Do Developers Understand IEEE Floating Point?. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018.