Legacy API

The Graph API section can be thought of as a declarative API, in the sense that you declare a graph, and then build, and run it. Previously, cuDNN only had an imperative API, which is more convenient for basic use cases, but has turned out to be overly-restrictive as the deep learning field has evolved to require more operations and more complex fusions of operations. Over time, we are deprecating and removing parts of the legacy API, as the graph API grows to parity with its support surface. However, there is still a gap, and therefore there is still a legacy API.

Convolution Functions

Prerequisites

For the supported GPUs, the Tensor Core operations will be triggered for convolution functions only when cudnnSetConvolutionMathType() is called on the appropriate convolution descriptor by setting the mathType to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Supported Algorithms

When the prerequisite is met, the below convolution functions can be run as Tensor Core operations:

List of Supported Algorithms

Supported Convolution Function

Supported Algorithm

cudnnConvolutionForward

  • CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

  • CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

cudnnConvolutionBackwardData

  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_1

  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED

cudnnConvolutionBackwardFilter

  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1

  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED

Data and Filter Formats

The cuDNN library may use padding, folding, and NCHW-to-NHWC transformations to call the Tensor Core operations. For more information, refer to Tensor Transformations.

For algorithms other than *_ALGO_WINOGRAD_NONFUSED, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations:

  • Input, filter, and output descriptors (xDesc, yDesc, wDesc, dxDesc, dyDesc and dwDesc as applicable) are of the dataType = CUDNN_DATA_HALF (that is, FP16). For FP32 data type, refer to FP32-to-FP16 Conversion.

  • The number of input and output feature maps (that is, channel dimension C) is a multiple of 8. When the channel dimension is not a multiple of 8, refer to Padding.

  • The filter is of type CUDNN_TENSOR_NCHW or CUDNN_TENSOR_NHWC.

  • If using a filter of type CUDNN_TENSOR_NHWC, then the input, filter, and output data pointers (X, Y, W, dX, dY, and dW as applicable) are aligned to 128-bit boundaries.

RNN Functions

Prerequisites

Tensor Core operations may be used in RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() when the mathType argument in cudnnSetRNNDescriptor_v8() is set to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Supported Algorithms

When the above prerequisites are met, the RNN functions below can be run as Tensor Core operations:

List of Supported Algorithms

Supported RNN Function

Supported Algorithm

All RNN functions that support Tensor Core operations

  • CUDNN_RNN_ALGO_STANDARD

  • CUDNN_RNN_ALGO_PERSIST_STATIC

  • CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H

Data and Filter Formats

When the following requirements are met, then the cuDNN library triggers the Tensor Core operations:

  • For algo = CUDNN_RNN_ALGO_STANDARD:

    • The hidden state size, input size, and the batch size is a multiple of 8.

    • All user-provided tensors and the RNN weight buffer are aligned to 128-bit boundaries.

    • For FP16 input/output, the CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

    • For FP32 input/output, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

  • For algo = CUDNN_RNN_ALGO_PERSIST_STATIC:

    • The hidden state size and the input size is a multiple of 32.

    • The batch size is a multiple of 8.

    • If the batch size exceeds 96 (for forward training or inference) or 32 (for backward data), then the batch size constraints may be stricter, and large power-of-two batch sizes may be needed.

    • All user-provided tensors and the RNN weight buffer are aligned to 128-bit boundaries.

    • For FP16 input/output, CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

    • For FP32 input/output, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

For more information, refer to Features of RNN Functions.

Features of RNN Functions

Refer to the following table for a list of features supported by each RNN function.

For each of these terms, the short-form versions shown in parentheses are used in the tables below for brevity: CUDNN_RNN_ALGO_STANDARD (_ALGO_STANDARD), CUDNN_RNN_ALGO_PERSIST_STATIC (_ALGO_PERSIST_STATIC), CUDNN_RNN_ALGO_PERSIST_DYNAMIC (_ALGO_PERSIST_DYNAMIC), CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H (_ALGO_SMALL_H) and CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION (_ALLOW_CONVERSION).

RNN Features

Function

I/O Layout Supported

Supports Variable Sequence Length in Batch

cudnnRNNForward()

Sequence major unpacked. Batch major unpacked. Sequence major packed.

Only with _ALGO_STANDARD. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.

cudnnRNNBackwardData_v8()

Sequence major unpacked. Batch major unpacked. Sequence major packed.

Only with _ALGO_STANDARD. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.

cudnnRNNBackwardWeights_v8()

Sequence major unpacked. Batch major unpacked. Sequence major packed.

Only with _ALGO_STANDARD. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.

Note

To use an unpacked layout, set CUDNN_RNN_PADDED_IO_ENABLED through the auxFlags argument of cudnnSetRNNDescriptor_v8().

Commonly Supported

Mode (cell type)

CUDNN_RNN_RELU, CUDNN_RNN_TANH, CUDNN_LSTM, and CUDNN_GRU.

Algo

_ALGO_STANDARD, _ALGO_PERSIST_STATIC, _ALGO_PERSIST_DYNAMIC, _ALGO_SMALL_H. Do not mix different algos in cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() functions.

Math mode

CUDNN_DEFAULT_MATH, CUDNN_TENSOR_OP_MATH (will automatically fall back if run on pre-Volta or if algo doesn’t support Tensor Cores), _ALLOW_CONVERSION (may perform down conversion to utilize Tensor Cores).

Direction mode

CUDNN_UNIDIRECTIONAL, CUDNN_BIDIRECTIONAL

RNN input mode

CUDNN_LINEAR_INPUT, CUDNN_SKIP_INPUT

Supported Features By Algo

Feature

_ALGO_STANDARD

_ALGO_PERSIST_STATIC

_ALGO_SMALL_H

_ALGO_PERSIST_DYNAMIC

  • Half input

  • Single accumulation

  • Half output

  • Supported

  • Half intermediate storage

  • Single accumulation

  • Supported

  • Half intermediate storage

  • Single accumulation

  • Supported

  • Half intermediate storage

  • Single accumulation

  • Supported

  • Half intermediate storage

  • Single accumulation

  • Single input

  • Single accumulation

  • Single output

  • Supported

  • If running on Volta, with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION, will down-convert and use half intermediate storage. Otherwise, single intermediate storage.

  • Single accumulation

  • Supported

  • If running on Volta, with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION, will down-convert and use half intermediate storage. Otherwise, single intermediate storage.

  • Single accumulation

  • Supported

  • If running on Volta, with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION, will down-convert and use half intermediate storage. Otherwise, single intermediate storage.

  • Single accumulation

  • Supported

  • If running on Volta, with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION, will down-convert and use half intermediate storage. Otherwise, single intermediate storage.

  • Single accumulation

  • Double input

  • Double accumulation

  • Double output

  • Supported

  • Double intermediate storage

  • Double accumulation

Not Supported

Not Supported

  • Supported

  • Double intermediate storage

  • Double accumulation

LSTM recurrent projection

Supported

Not Supported

Not Supported

Not Supported

LSTM cell clipping

Supported

Supported

Supported

Supported

Variable sequence length in batch

Supported

Not Supported

Not Supported

Not Supported

Tensor Cores

  • For half I/O, acceleration requires setting CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Ampere, architecture acceleration requires setting CUDNN_DEFAULT_MATH, CUDNN_TENSOR_OP_MATH, or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.Acceleration requires inputSize and hiddenSize to be a multiple of 4.

  • For half I/O, acceleration requires setting CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Ampere, architecture acceleration requires setting CUDNN_DEFAULT_MATH, CUDNN_TENSOR_OP_MATH, or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.Acceleration requires inputSize and hiddenSize to be a multiple of 4.

  • For half I/O, acceleration requires setting CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. Acceleration requires inputSize and hiddenSize to be a multiple of 8.

  • For single I/O on NVIDIA Ampere, architecture acceleration requires setting CUDNN_DEFAULT_MATH, CUDNN_TENSOR_OP_MATH, or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.Acceleration requires inputSize and hiddenSize to be a multiple of 4.

Not Supported, will execute normally ignoring CUDNN_TENSOR_OP_MATH or _ALLOW_CONVERSION.

Other limitations

Max problem size is limited by GPU specifications.

  • Forward RNN: - RELU and TANH RNN: hidden_size <= 384 - LSTM and GRU: hidden_size <= 192

  • BackwardData RNN: - RELU and TANH RNN: hidden_size <= 256 - LSTM and GRU: hidden_size <= 128

Requires real time compilation through NVRTC.

Tensor Transformations

A few functions in the cuDNN library will perform transformations such as folding, padding, and NCHW-to-NHWC conversion while performing the actual function operation.

Conversion Between FP32 and FP16

The cuDNN API Reference allows you to specify that FP32 input data may be copied and converted to FP16 data internally to use Tensor Core operations for potentially improved performance. This can be achieved by selecting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum for cudnnMathType_t. In this mode, the FP32 tensors are internally down-converted to FP16, the Tensor Op math is performed, and finally up-converted to FP32 as outputs.

Tensor Operation with FP32 Inputs

For Convolutions

For convolutions, the FP32-to-FP16 conversion can be achieved by passing the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the cudnnSetConvolutionMathType() call.

// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr(cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION));

For RNNs

For RNNs, the FP32-to-FP16 conversion can be achieved by assigning the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the mathType argument of the cudnnSetRNNDescriptor_v8() call to allow FP32 input data to be down-converted for use in RNNs.

Padding

For packed NCHW data, when the channel dimension is not a multiple of 8, then the cuDNN library will pad the tensors as needed to enable Tensor Core operations. This padding is automatic for packed NCHW data in both the CUDNN_TENSOR_OP_MATH and the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION cases.

Folding

In the folding operation, the cuDNN library implicitly performs the formatting of input tensors and saves the input tensors in an internal workspace. This can lead to an acceleration of the call to Tensor Cores.

With folding or channel-folding, cuDNN can implicitly format the input tensors within an internal workspace to accelerate the overall calculation. Performing this transformation for the user often allows cuDNN to use kernels with restrictions on convolution stride to support a strided convolution problem.

Conversion Between NCHW And NHWC

Tensor Cores require that the tensors be in the NHWC data layout. Conversion between NCHW and NHWC is performed when the user requests Tensor Op math. However, a request to use Tensor Cores is just that, a request and Tensor Cores may not be used in some cases. The cuDNN library converts between NCHW and NHWC if and only if Tensor Cores are requested and are actually used.

If your input (and output) are NCHW, then expect a layout change.

Non-Tensor Op convolutions will not perform conversions between NCHW and NHWC.

In very rare and difficult-to-qualify cases that are a complex function of padding and filter sizes, it is possible that Tensor Ops is not enabled. In such cases, users can pre-pad to enable the Tensor Ops path.

Mixed Precision Numerical Accuracy

When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.

For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (ALGO_0) has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (ALGO_1). This is because ALGO_0 does not use extra workspace, and is forced to accumulate the intermediate results in FP16, that is, half precision float, and this reduces the accuracy. The ALGO_1, on the other hand, uses additional workspace to accumulate the intermediate values in FP32, that is, full precision float.