Legacy API#

The Graph API section can be thought of as a declarative API, in the sense that you declare a graph, and then build, and run it. Previously, cuDNN only had an imperative API, which is more convenient for basic use cases, but has turned out to be overly-restrictive as the deep learning field has evolved to require more operations and more complex fusions of operations. Over time, we are deprecating and removing parts of the legacy API, as the graph API grows to parity with its support surface. However, there is still a gap, and therefore there is still a legacy API.

Convolution Functions#

Prerequisites#

For the supported GPUs, the Tensor Core operations will be triggered for convolution functions only when cudnnSetConvolutionMathType() is called on the appropriate convolution descriptor by setting the mathType to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Supported Algorithms#

When the prerequisite is met, the below convolution functions can be run as Tensor Core operations:

cudnnConvolutionForward()

cudnnConvolutionBackwardData()

cudnnConvolutionBackwardFilter()

List of Supported Algorithms#
Supported Convolution Function	Supported Algorithm
`cudnnConvolutionForward`	`CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM` `CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED`
`cudnnConvolutionBackwardData`	`CUDNN_CONVOLUTION_BWD_DATA_ALGO_1` `CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED`
`cudnnConvolutionBackwardFilter`	`CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1` `CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED`

Data and Filter Formats#

The cuDNN library may use padding, folding, and NCHW-to-NHWC transformations to call the Tensor Core operations. For more information, refer to Tensor Transformations.

For algorithms other than *_ALGO_WINOGRAD_NONFUSED, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations:

Input, filter, and output descriptors (xDesc, yDesc, wDesc, dxDesc, dyDesc and dwDesc as applicable) are of the dataType = CUDNN_DATA_HALF (that is, FP16). For FP32 data type, refer to FP32-to-FP16 Conversion.

The number of input and output feature maps (that is, channel dimension C) is a multiple of 8. When the channel dimension is not a multiple of 8, refer to Padding.

The filter is of type CUDNN_TENSOR_NCHW or CUDNN_TENSOR_NHWC.

If using a filter of type CUDNN_TENSOR_NHWC, then the input, filter, and output data pointers (X, Y, W, dX, dY, and dW as applicable) are aligned to 128-bit boundaries.

RNN Functions#

Prerequisites#

Tensor Core operations may be used in RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() when the mathType argument in cudnnSetRNNDescriptor_v8() is set to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Supported Algorithms#

When the above prerequisites are met, the RNN functions below can be run as Tensor Core operations:

cudnnRNNForward()

cudnnRNNBackwardData_v8()

cudnnRNNBackwardWeights_v8()

List of Supported Algorithms#
Supported RNN Function	Supported Algorithm
All RNN functions that support Tensor Core operations	`CUDNN_RNN_ALGO_STANDARD` `CUDNN_RNN_ALGO_PERSIST_STATIC` `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H`

Data and Filter Formats#

When the following requirements are met, then the cuDNN library triggers the Tensor Core operations:

For algo = CUDNN_RNN_ALGO_STANDARD:

The hidden state size, input size, and the batch size is a multiple of 8.

All user-provided tensors and the RNN weight buffer are aligned to 128-bit boundaries.

For FP16 input/output, the CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

For FP32 input/output, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

For algo = CUDNN_RNN_ALGO_PERSIST_STATIC:

The hidden state size and the input size is a multiple of 32.

The batch size is a multiple of 8.

If the batch size exceeds 96 (for forward training or inference) or 32 (for backward data), then the batch size constraints may be stricter, and large power-of-two batch sizes may be needed.

All user-provided tensors and the RNN weight buffer are aligned to 128-bit boundaries.

For FP16 input/output, CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

For FP32 input/output, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is selected.

For more information, refer to Features of RNN Functions.

Features of RNN Functions#

Refer to the following table for a list of features supported by each RNN function.

For each of these terms, the short-form versions shown in parentheses are used in the tables below for brevity: CUDNN_RNN_ALGO_STANDARD (_ALGO_STANDARD), CUDNN_RNN_ALGO_PERSIST_STATIC (_ALGO_PERSIST_STATIC), CUDNN_RNN_ALGO_PERSIST_DYNAMIC (_ALGO_PERSIST_DYNAMIC), CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H (_ALGO_SMALL_H) and CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION (_ALLOW_CONVERSION).

RNN Features#
Function	I/O Layout Supported	Supports Variable Sequence Length in Batch
`cudnnRNNForward()`	Sequence major unpacked. Batch major unpacked. Sequence major packed.	Only with `_ALGO_STANDARD`. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.
`cudnnRNNBackwardData_v8()`	Sequence major unpacked. Batch major unpacked. Sequence major packed.	Only with `_ALGO_STANDARD`. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.
`cudnnRNNBackwardWeights_v8()`	Sequence major unpacked. Batch major unpacked. Sequence major packed.	Only with `_ALGO_STANDARD`. For unpacked layout, no input sorting required. For packed layout, require input sequences descending sorted according to length.

Note

To use an unpacked layout, set CUDNN_RNN_PADDED_IO_ENABLED through the auxFlags argument of cudnnSetRNNDescriptor_v8().

Commonly Supported

Mode (cell type): CUDNN_RNN_RELU, CUDNN_RNN_TANH, CUDNN_LSTM, and CUDNN_GRU.
Algo: _ALGO_STANDARD, _ALGO_PERSIST_STATIC, _ALGO_PERSIST_DYNAMIC, _ALGO_SMALL_H. Do not mix different algos in cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() functions.
Math mode: CUDNN_DEFAULT_MATH, CUDNN_TENSOR_OP_MATH (will automatically fall back if run on pre-Volta or if algo doesn’t support Tensor Cores), _ALLOW_CONVERSION (may perform down conversion to utilize Tensor Cores).
Direction mode: CUDNN_UNIDIRECTIONAL, CUDNN_BIDIRECTIONAL
RNN input mode: CUDNN_LINEAR_INPUT, CUDNN_SKIP_INPUT

Supported Features By Algo#
Feature	`_ALGO_STANDARD`	`_ALGO_PERSIST_STATIC`	`_ALGO_SMALL_H`	`_ALGO_PERSIST_DYNAMIC`
Half input Single accumulation Half output	Supported Half intermediate storage Single accumulation	Supported Half intermediate storage Single accumulation	Supported Half intermediate storage Single accumulation	Supported Half intermediate storage Single accumulation
Single input Single accumulation Single output	Supported If running on Volta, with `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`, will down-convert and use half intermediate storage. Otherwise, single intermediate storage. Single accumulation	Supported If running on Volta, with `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`, will down-convert and use half intermediate storage. Otherwise, single intermediate storage. Single accumulation	Supported If running on Volta, with `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`, will down-convert and use half intermediate storage. Otherwise, single intermediate storage. Single accumulation	Supported If running on Volta, with `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`, will down-convert and use half intermediate storage. Otherwise, single intermediate storage. Single accumulation
Double input Double accumulation Double output	Supported Double intermediate storage Double accumulation	Not Supported	Not Supported	Supported Double intermediate storage Double accumulation
LSTM recurrent projection	Supported	Not Supported	Not Supported	Not Supported
LSTM cell clipping	Supported	Supported	Supported	Supported
Variable sequence length in batch	Supported	Not Supported	Not Supported	Not Supported
Tensor Cores	For half I/O, acceleration requires setting `CUDNN_TENSOR_OP_MATH` or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Ampere, architecture acceleration requires setting `CUDNN_DEFAULT_MATH`, `CUDNN_TENSOR_OP_MATH`, or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`.Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 4.	For half I/O, acceleration requires setting `CUDNN_TENSOR_OP_MATH` or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Ampere, architecture acceleration requires setting `CUDNN_DEFAULT_MATH`, `CUDNN_TENSOR_OP_MATH`, or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`.Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 4.	For half I/O, acceleration requires setting `CUDNN_TENSOR_OP_MATH` or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Volta, NVIDIA Xavier, and NVIDIA Turing, acceleration requires setting `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`. Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 8. For single I/O on NVIDIA Ampere, architecture acceleration requires setting `CUDNN_DEFAULT_MATH`, `CUDNN_TENSOR_OP_MATH`, or `CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION`.Acceleration requires `inputSize` and `hiddenSize` to be a multiple of 4.	Not Supported, will execute normally ignoring `CUDNN_TENSOR_OP_MATH` or `_ALLOW_CONVERSION`.
Other limitations		Max problem size is limited by GPU specifications.	Forward RNN: - RELU and TANH RNN: `hidden_size <= 384` - LSTM and GRU: `hidden_size <= 192` BackwardData RNN: - RELU and TANH RNN: `hidden_size <= 256` - LSTM and GRU: `hidden_size <= 128`	Requires real time compilation through NVRTC.

Tensor Transformations#

A few functions in the cuDNN library will perform transformations such as folding, padding, and NCHW-to-NHWC conversion while performing the actual function operation.

Conversion Between FP32 and FP16#

The cuDNN API Reference allows you to specify that FP32 input data may be copied and converted to FP16 data internally to use Tensor Core operations for potentially improved performance. This can be achieved by selecting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum for cudnnMathType_t. In this mode, the FP32 tensors are internally down-converted to FP16, the Tensor Op math is performed, and finally up-converted to FP32 as outputs.

For Convolutions

For convolutions, the FP32-to-FP16 conversion can be achieved by passing the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the cudnnSetConvolutionMathType() call.

// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr(cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION));

For RNNs

For RNNs, the FP32-to-FP16 conversion can be achieved by assigning the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the mathType argument of the cudnnSetRNNDescriptor_v8() call to allow FP32 input data to be down-converted for use in RNNs.

Padding#

For packed NCHW data, when the channel dimension is not a multiple of 8, then the cuDNN library will pad the tensors as needed to enable Tensor Core operations. This padding is automatic for packed NCHW data in both the CUDNN_TENSOR_OP_MATH and the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION cases.

Folding#

In the folding operation, the cuDNN library implicitly performs the formatting of input tensors and saves the input tensors in an internal workspace. This can lead to an acceleration of the call to Tensor Cores.

With folding or channel-folding, cuDNN can implicitly format the input tensors within an internal workspace to accelerate the overall calculation. Performing this transformation for the user often allows cuDNN to use kernels with restrictions on convolution stride to support a strided convolution problem.

Conversion Between NCHW And NHWC#

Tensor Cores require that the tensors be in the NHWC data layout. Conversion between NCHW and NHWC is performed when the user requests Tensor Op math. However, a request to use Tensor Cores is just that, a request and Tensor Cores may not be used in some cases. The cuDNN library converts between NCHW and NHWC if and only if Tensor Cores are requested and are actually used.

If your input (and output) are NCHW, then expect a layout change.

Non-Tensor Op convolutions will not perform conversions between NCHW and NHWC.

In very rare and difficult-to-qualify cases that are a complex function of padding and filter sizes, it is possible that Tensor Ops is not enabled. In such cases, users can pre-pad to enable the Tensor Ops path.

Mixed Precision Numerical Accuracy#

When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.

For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (ALGO_0) has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (ALGO_1). This is because ALGO_0 does not use extra workspace, and is forced to accumulate the intermediate results in FP16, that is, half precision float, and this reduces the accuracy. The ALGO_1, on the other hand, uses additional workspace to accumulate the intermediate values in FP32, that is, full precision float.