1.3.2. Bfloat162 Arithmetic Functions

[Bfloat16 Precision Intrinsics]

To use these functions, include the header file cuda_bf16.h in your program.

Functions

__device__ ​ __nv_bfloat162 __habs2 ( const __nv_bfloat162 a )
Calculates the absolute value of both halves of the input nv_bfloat162 number and returns the result.
__device__ ​ __nv_bfloat162 __hadd2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hadd2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hadd2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode, with saturation to [0.0, 1.0].
__device__ ​ __nv_bfloat162 __hcmadd ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs fast complex multiply-accumulate.
__device__ ​ __nv_bfloat162 __hfma2 ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hfma2_relu ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode with relu saturation.
__device__ ​ __nv_bfloat162 __hfma2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode, with saturation to [0.0, 1.0].
__device__ ​ __nv_bfloat162 __hmul2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hmul2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hmul2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode, with saturation to [0.0, 1.0].
__device__ ​ __nv_bfloat162 __hneg2 ( const __nv_bfloat162 a )
Negates both halves of the input nv_bfloat162 number and returns the result.
__device__ ​ __nv_bfloat162 __hsub2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hsub2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode.
__device__ ​ __nv_bfloat162 __hsub2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode, with saturation to [0.0, 1.0].
__device__ ​ __nv_bfloat162 atomicAdd ( const __nv_bfloat162* address, const __nv_bfloat162 val )
Vector add val to the value stored at address in global or shared memory, and writes this value back to address. The atomicity of the add operation is guaranteed separately for each of the two nv_bfloat16 elements; the entire __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access.

Functions

__device__ ​ __nv_bfloat162 __habs2 ( const __nv_bfloat162 a )
Calculates the absolute value of both halves of the input nv_bfloat162 number and returns the result.
Parameters
a
- nv_bfloat162. Is only being read.
Returns

bfloat2

  • Returns a with the absolute value of both halves.

Description

Calculates the absolute value of both halves of the input nv_bfloat162 number and returns the result.

__device__ ​ __nv_bfloat162 __hadd2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode.
Description

Performs nv_bfloat162 vector add of inputs a and b, in round-to-nearest mode.

__device__ ​ __nv_bfloat162 __hadd2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode.
Description

Performs nv_bfloat162 vector add of inputs a and b, in round-to-nearest mode. Prevents floating-point contractions of mul+add into fma.

__device__ ​ __nv_bfloat162 __hadd2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector addition in round-to-nearest-even mode, with saturation to [0.0, 1.0].
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The sum of a and b, with respect to saturation.

Description

Performs nv_bfloat162 vector add of inputs a and b, in round-to-nearest mode, and clamps the results to range [0.0, 1.0]. NaN results are flushed to +0.0.

__device__ ​ __nv_bfloat162 __hcmadd ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs fast complex multiply-accumulate.
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
c
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The result of complex multiply-accumulate operation on complex numbers a, b, and c

Description

Interprets vector nv_bfloat162 input pairs a, b, and c as complex numbers in nv_bfloat16 precision and performs complex multiply-accumulate operation: a*b + c

__device__ ​ __nv_bfloat162 __hfma2 ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode.
Description

Performs nv_bfloat162 vector multiply on inputs a and b, then performs a nv_bfloat162 vector add of the result with c, rounding the result once in round-to-nearest-even mode.

__device__ ​ __nv_bfloat162 __hfma2_relu ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode with relu saturation.
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
c
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The result of elementwise fused multiply-add operation on vectors a, b, and c with relu saturation.

Description

Performs nv_bfloat162 vector multiply on inputs a and b, then performs a nv_bfloat162 vector add of the result with c, rounding the result once in round-to-nearest-even mode. Then negative result is clamped to 0. NaN result is converted to canonical NaN.

__device__ ​ __nv_bfloat162 __hfma2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b, const __nv_bfloat162 c )
Performs nv_bfloat162 vector fused multiply-add in round-to-nearest-even mode, with saturation to [0.0, 1.0].
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
c
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The result of elementwise fused multiply-add operation on vectors a, b, and c, with respect to saturation.

Description

Performs nv_bfloat162 vector multiply on inputs a and b, then performs a nv_bfloat162 vector add of the result with c, rounding the result once in round-to-nearest-even mode, and clamps the results to range [0.0, 1.0]. NaN results are flushed to +0.0.

__device__ ​ __nv_bfloat162 __hmul2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode.
Description

Performs nv_bfloat162 vector multiplication of inputs a and b, in round-to-nearest-even mode.

__device__ ​ __nv_bfloat162 __hmul2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode.
Description

Performs nv_bfloat162 vector multiplication of inputs a and b, in round-to-nearest-even mode. Prevents floating-point contractions of mul+add or sub into fma.

__device__ ​ __nv_bfloat162 __hmul2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector multiplication in round-to-nearest-even mode, with saturation to [0.0, 1.0].
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The result of elementwise multiplication of vectors a and b, with respect to saturation.

Description

Performs nv_bfloat162 vector multiplication of inputs a and b, in round-to-nearest-even mode, and clamps the results to range [0.0, 1.0]. NaN results are flushed to +0.0.

__device__ ​ __nv_bfloat162 __hneg2 ( const __nv_bfloat162 a )
Negates both halves of the input nv_bfloat162 number and returns the result.
Description

Negates both halves of the input nv_bfloat162 number a and returns the result.

__device__ ​ __nv_bfloat162 __hsub2 ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode.
Description

Subtracts nv_bfloat162 input vector b from input vector a in round-to-nearest-even mode.

__device__ ​ __nv_bfloat162 __hsub2_rn ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode.
Description

Subtracts nv_bfloat162 input vector b from input vector a in round-to-nearest-even mode. Prevents floating-point contractions of mul+sub into fma.

__device__ ​ __nv_bfloat162 __hsub2_sat ( const __nv_bfloat162 a, const __nv_bfloat162 b )
Performs nv_bfloat162 vector subtraction in round-to-nearest-even mode, with saturation to [0.0, 1.0].
Parameters
a
- nv_bfloat162. Is only being read.
b
- nv_bfloat162. Is only being read.
Returns

nv_bfloat162

  • The subtraction of vector b from a, with respect to saturation.

Description

Subtracts nv_bfloat162 input vector b from input vector a in round-to-nearest-even mode, and clamps the results to range [0.0, 1.0]. NaN results are flushed to +0.0.

__device__ ​ __nv_bfloat162 atomicAdd ( const __nv_bfloat162* address, const __nv_bfloat162 val )
Vector add val to the value stored at address in global or shared memory, and writes this value back to address. The atomicity of the add operation is guaranteed separately for each of the two nv_bfloat16 elements; the entire __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access.
Parameters
address
- __nv_bfloat162*. An address in global or shared memory.
val
- __nv_bfloat162. The value to be added.
Returns

__nv_bfloat162

  • The old value read from address.

Description

The location of address must be in global or shared memory. This operation has undefined behavior otherwise. This operation is only supported by devices of compute capability 8.x and higher.

Note:

For more details for this function see the Atomic Functions section in the CUDA C++ Programming Guide.