Legacy API
The Graph API section can be thought of as a declarative API, in the sense that you declare a graph, and then build, and run it. Previously, cuDNN only had an imperative API, which is more convenient for basic use cases, but has turned out to be overlyrestrictive as the deep learning field has evolved to require more operations and more complex fusions of operations. Over time, we are deprecating and removing parts of the legacy API, as the graph API grows to parity with its support surface. However, there is still a gap, and therefore there is still a legacy API.
Convolution Functions
Prerequisites
For the supported GPUs, the Tensor Core operations will be triggered for convolution functions only when cudnnSetConvolutionMathType() is called on the appropriate convolution descriptor by setting the mathType to CUDNN_TENSOR_OP_MATH
or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
.
Supported Algorithms
When the prerequisite is met, the below convolution functions can be run as Tensor Core operations:
Supported Convolution Function 
Supported Algorithm 







Data and Filter Formats
The cuDNN library may use padding, folding, and NCHWtoNHWC transformations to call the Tensor Core operations. For more information, refer to Tensor Transformations.
For algorithms other than *_ALGO_WINOGRAD_NONFUSED
, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations:
Input, filter, and output descriptors (
xDesc
,yDesc
,wDesc
,dxDesc
,dyDesc
anddwDesc
as applicable) are of thedataType = CUDNN_DATA_HALF
(that is, FP16). For FP32 data type, refer to FP32toFP16 Conversion.The number of input and output feature maps (that is, channel dimension C) is a multiple of 8. When the channel dimension is not a multiple of 8, refer to Padding.
The filter is of type
CUDNN_TENSOR_NCHW
orCUDNN_TENSOR_NHWC
.If using a filter of type
CUDNN_TENSOR_NHWC
, then the input, filter, and output data pointers (X
,Y
,W
,dX
,dY
, anddW
as applicable) are aligned to 128bit boundaries.
RNN Functions
Prerequisites
Tensor Core operations may be used in RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() when the mathType
argument in cudnnSetRNNDescriptor_v8() is set to CUDNN_TENSOR_OP_MATH
or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
.
Supported Algorithms
When the above prerequisites are met, the RNN functions below can be run as Tensor Core operations:
Supported RNN Function 
Supported Algorithm 

All RNN functions that support Tensor Core operations 

Data and Filter Formats
When the following requirements are met, then the cuDNN library triggers the Tensor Core operations:
For
algo = CUDNN_RNN_ALGO_STANDARD
:
The hidden state size, input size, and the batch size is a multiple of 8.
All userprovided tensors and the RNN weight buffer are aligned to 128bit boundaries.
For FP16 input/output, the
CUDNN_TENSOR_OP_MATH
orCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
is selected.For FP32 input/output,
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
is selected.For
algo = CUDNN_RNN_ALGO_PERSIST_STATIC
:
The hidden state size and the input size is a multiple of 32.
The batch size is a multiple of 8.
If the batch size exceeds 96 (for forward training or inference) or 32 (for backward data), then the batch size constraints may be stricter, and large poweroftwo batch sizes may be needed.
All userprovided tensors and the RNN weight buffer are aligned to 128bit boundaries.
For FP16 input/output,
CUDNN_TENSOR_OP_MATH
orCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
is selected.For FP32 input/output,
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
is selected.
For more information, refer to Features of RNN Functions.
Features of RNN Functions
Refer to the following table for a list of features supported by each RNN function.
For each of these terms, the shortform versions shown in parentheses are used in the tables below for brevity: CUDNN_RNN_ALGO_STANDARD (_ALGO_STANDARD)
, CUDNN_RNN_ALGO_PERSIST_STATIC (_ALGO_PERSIST_STATIC)
, CUDNN_RNN_ALGO_PERSIST_DYNAMIC (_ALGO_PERSIST_DYNAMIC)
, CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H (_ALGO_SMALL_H)
and CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION (_ALLOW_CONVERSION)
.
Function 
I/O Layout Supported 
Supports Variable Sequence Length in Batch 


Sequence major unpacked. Batch major unpacked. Sequence major packed. 
Only with 

Sequence major unpacked. Batch major unpacked. Sequence major packed. 
Only with 

Sequence major unpacked. Batch major unpacked. Sequence major packed. 
Only with 
Note
To use an unpacked layout, set CUDNN_RNN_PADDED_IO_ENABLED
through the auxFlags
argument of cudnnSetRNNDescriptor_v8()
.
Commonly Supported
 Mode (cell type)
CUDNN_RNN_RELU
,CUDNN_RNN_TANH
,CUDNN_LSTM
, andCUDNN_GRU
. Algo
_ALGO_STANDARD
,_ALGO_PERSIST_STATIC
,_ALGO_PERSIST_DYNAMIC
,_ALGO_SMALL_H
. Do not mix different algos incudnnRNNForward()
,cudnnRNNBackwardData_v8()
, andcudnnRNNBackwardWeights_v8()
functions. Math mode
CUDNN_DEFAULT_MATH
,CUDNN_TENSOR_OP_MATH
(will automatically fall back if run on preVolta or if algo doesn’t support Tensor Cores),_ALLOW_CONVERSION
(may perform down conversion to utilize Tensor Cores). Direction mode
CUDNN_UNIDIRECTIONAL
,CUDNN_BIDIRECTIONAL
 RNN input mode
CUDNN_LINEAR_INPUT
,CUDNN_SKIP_INPUT
Feature 

















Not Supported 
Not Supported 

LSTM recurrent projection 
Supported 
Not Supported 
Not Supported 
Not Supported 
LSTM cell clipping 
Supported 
Supported 
Supported 
Supported 
Variable sequence length in batch 
Supported 
Not Supported 
Not Supported 
Not Supported 
Tensor Cores 



Not Supported, will execute normally ignoring 
Other limitations 
Max problem size is limited by GPU specifications. 

Requires real time compilation through NVRTC. 
Tensor Transformations
A few functions in the cuDNN library will perform transformations such as folding, padding, and NCHWtoNHWC conversion while performing the actual function operation.
Conversion Between FP32 and FP16
The cuDNN API Reference allows you to specify that FP32 input data may be copied and converted to FP16 data internally to use Tensor Core operations for potentially improved performance. This can be achieved by selecting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
enum for cudnnMathType_t. In this mode, the FP32 tensors are internally downconverted to FP16, the Tensor Op math is performed, and finally upconverted to FP32 as outputs.
For Convolutions
For convolutions, the FP32toFP16 conversion can be achieved by passing the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
enum value to the cudnnSetConvolutionMathType() call.
// Set the math type to allow cuDNN to use Tensor Cores: checkCudnnErr(cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION));
For RNNs
For RNNs, the FP32toFP16 conversion can be achieved by assigning the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
enum value to the mathType
argument of the cudnnSetRNNDescriptor_v8()
call to allow FP32 input data to be downconverted for use in RNNs.
Padding
For packed NCHW data, when the channel dimension is not a multiple of 8, then the cuDNN library will pad the tensors as needed to enable Tensor Core operations. This padding is automatic for packed NCHW data in both the CUDNN_TENSOR_OP_MATH
and the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
cases.
Folding
In the folding operation, the cuDNN library implicitly performs the formatting of input tensors and saves the input tensors in an internal workspace. This can lead to an acceleration of the call to Tensor Cores.
With folding or channelfolding, cuDNN can implicitly format the input tensors within an internal workspace to accelerate the overall calculation. Performing this transformation for the user often allows cuDNN to use kernels with restrictions on convolution stride to support a strided convolution problem.
Conversion Between NCHW And NHWC
Tensor Cores require that the tensors be in the NHWC data layout. Conversion between NCHW and NHWC is performed when the user requests Tensor Op math. However, a request to use Tensor Cores is just that, a request and Tensor Cores may not be used in some cases. The cuDNN library converts between NCHW and NHWC if and only if Tensor Cores are requested and are actually used.
If your input (and output) are NCHW, then expect a layout change.
NonTensor Op convolutions will not perform conversions between NCHW and NHWC.
In very rare and difficulttoqualify cases that are a complex function of padding and filter sizes, it is possible that Tensor Ops is not enabled. In such cases, users can prepad to enable the Tensor Ops path.
Mixed Precision Numerical Accuracy
When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.
For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
(ALGO_0
) has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
(ALGO_1
). This is because ALGO_0
does not use extra workspace, and is forced to accumulate the intermediate results in FP16, that is, half precision float, and this reduces the accuracy. The ALGO_1
, on the other hand, uses additional workspace to accumulate the intermediate values in FP32, that is, full precision float.