## Abstract

This cuDNN Developer Guide provides an overview of cuDNN 7.3.1, and details about the types, enums, and routines within the cuDNN library API.

For previously released cuDNN developer documentation, see cuDNN Archives.

## 1. Overview

NVIDIA® cuDNN is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines arising frequently in DNN applications:

• Convolution forward and backward, including cross-correlation
• Pooling forward and backward
• Softmax forward and backward
• Neuron activations forward and backward:
• Rectified linear (ReLU)
• Sigmoid
• Hyperbolic tangent (TANH)
• Tensor transformation functions
• LRN, LCN and batch normalization forward and backward

cuDNN's convolution routines aim for performance competitive with the fastest GEMM (matrix multiply) based implementations of such routines while using significantly less memory.

cuDNN features customizable data layouts, supporting flexible dimension ordering, striding, and subregions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility allows easy integration into any neural network implementation and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.

cuDNN offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams.

## 2. General Description

Basic concepts are described in this chapter.

### 2.1. Programming Model

The cuDNN Library exposes a Host API but assumes that for operations using the GPU, the necessary data is directly accessible from the device.

An application using cuDNN must initialize a handle to the library context by calling cudnnCreate(). This handle is explicitly passed to every subsequent library function that operates on GPU data. Once the application finishes using cuDNN, it can release the resources associated with the library handle using cudnnDestroy() . This approach allows the user to explicitly control the library's functioning when using multiple host threads, GPUs and CUDA Streams. For example, an application can use cudaSetDevice() to associate different devices with different host threads and in each of those host threads, use a unique cuDNN handle which directs library calls to the device associated with it. cuDNN library calls made with different handles will thus automatically run on different devices. The device associated with a particular cuDNN context is assumed to remain unchanged between the corresponding cudnnCreate() and cudnnDestroy() calls. In order for the cuDNN library to use a different device within the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another cuDNN context, which will be associated with the new device, by calling cudnnCreate().

cuDNN API Compatibility

Beginning in cuDNN 7, binary compatibility of patch and minor releases is maintained as follows:

• Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z)
• cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x)
• Applications compiled with a cuDNN version 7.y are not guaranteed to work with 7.x release when y > x.

### 2.2. Convolution Formulas

This section describes the various convolution formulas implemented in cuDNN convolution functions.

The convolution terms described in the table below apply to all the convolution formulas that follow.

TABLE OF CONVOLUTION TERMS

Term Description
$x$ Input (image) Tensor
$w$ Weight Tensor
$y$ Output Tensor
$n$ Current Batch Size
$c$ Current Input Channel
$C$ Total Input Channels
$H$ Input Image Height
$W$ Input Image Width
$k$ Current Output Channel
$K$ Total Output Channels
$p$ Current Output Height Position
$q$ Current Output Width Position
$G$ Group Count
$\mathit{pad}$ Padding Value
$u$ Vertical Subsample Stride (along Height)
$\mathit{v}$ Horizontal Subsample Stride (along Width)
${\mathit{dil}}_{\mathit{h}}$ Vertical Dilation (along Height)
${\mathit{dil}}_{\mathit{w}}$ Horizontal Dilation (along Width)
$r$ Current Filter Height
$R$ Total Filter Height
$s$ Current Filter Width
$S$ Total Filter Width
${C}_{g}$ $\frac{C}{G}$
${K}_{g}$ $\frac{K}{G}$

### Normal Convolution (using cross-correlation mode)

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{C}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, c, p+r, q+s}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k,c,r,s}}$

${x}_{\mathit{<0, <0}}\phantom{\rule{5px}{0ex}}=0$

${x}_{\mathit{>H, >W}}\phantom{\rule{5px}{0ex}}=0$

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{C}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, c, p+r-pad, q+s-pad}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k,c,r,s}}$

### Convolution with Subsample-Striding

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{C}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, c, \left(p*u\right) + r, \left(q*v\right) + s}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k,c,r,s}}$

### Convolution with Dilation

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{C}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, c, p + \left(r*dilh\right), q + \left(s*dilw\right)}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k,c,r,s}}$

### Convolution using Convolution Mode

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{C}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, c, p + r, q + s}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k, c, R-r-1, S-s-1}}$

### Convolution using Grouped Convolution

${C}_{g}=\frac{C}{G}$

${K}_{g}=\frac{K}{G}$

${y}_{\mathit{n, k, p, q}}=\sum _{c}^{{C}_{g}}\phantom{\rule{5px}{0ex}}\sum _{r}^{R}\phantom{\rule{5px}{0ex}}\sum _{s}^{S}\phantom{\rule{10px}{0ex}}{x}_{\mathit{n, Cg*floor\left(k/Kg\right)+c, p+r, q+s}}\phantom{\rule{15px}{0ex}}×\phantom{\rule{15px}{0ex}}{w}_{\mathit{k,c,r,s}}$

### 2.3. Notation

As of CUDNN v4 we have adopted a mathematicaly-inspired notation for layer inputs and outputs using x,y,dx,dy,b,w for common layer parameters. This was done to improve readability and ease of understanding of parameters meaning. All layers now follow a uniform convention that during inference

y = layerFunction(x, otherParams).

And during backpropagation

For convolution the notation is

y = x*w+b

where w is the matrix of filter weights, x is the previous layer's data (during inference), y is the next layer's data, b is the bias and * is the convolution operator. In backpropagation routines the parameters keep their meanings. dx,dy,dw,db always refer to the gradient of the final network error function with respect to a given parameter. So dy in all backpropagation routines always refers to error gradient backpropagated through the network computation graph so far. Similarly other parameters in more specialized layers, such as, for instance, dMeans or dBnBias refer to gradients of the loss function wrt those parameters.

Note:w is used in the API for both the width of the x tensor and convolution filter matrix. To resolve this ambiguity we use w and filter notation interchangeably for convolution filter weight matrix. The meaning is clear from the context since the layer width is always referenced near its height.

### 2.4. Tensor Descriptor

The cuDNN Library describes data holding images, videos and any other data with contents with a generic n-D tensor defined with the following parameters :

• a dimension dim from 3 to 8
• a data type (32-bit floating point, 64 bit-floating point, 16 bit floating point...)
• dim integers defining the size of each dimension
• dim integers defining the stride of each dimension (e.g the number of elements to add to reach the next element from the same dimension)

The first two dimensions define respectively the batch size n and the number of features maps c. This tensor definition allows for example to have some dimensions overlapping each others within the same tensor by having the stride of one dimension smaller than the product of the dimension and the stride of the next dimension. In cuDNN, unless specified otherwise, all routines will support tensors with overlapping dimensions for forward pass input tensors, however, dimensions of the output tensors cannot overlap. Even though this tensor format supports negative strides (which can be useful for data mirroring), cuDNN routines do not support tensors with negative strides unless specified otherwise.

### 2.4.1. WXYZ Tensor Descriptor

Tensor descriptor formats are identified using acronyms, with each letter referencing a corresponding dimension. In this document, the usage of this terminology implies :

• all the strides are strictly positive
• the dimensions referenced by the letters are sorted in decreasing order of their respective strides

### 2.4.2. 4-D Tensor Descriptor

A 4-D Tensor descriptor is used to define the format for batches of 2D images with 4 letters : N,C,H,W for respectively the batch size, the number of feature maps, the height and the width. The letters are sorted in decreasing order of the strides. The commonly used 4-D tensor formats are :

• NCHW
• NHWC
• CHWN

### 2.4.3. 5-D Tensor Description

A 5-D Tensor descriptor is used to define the format of batch of 3D images with 5 letters : N,C,D,H,W for respectively the batch size, the number of feature maps, the depth, the height and the width. The letters are sorted in descreasing order of the strides. The commonly used 5-D tensor formats are called :

• NCDHW
• NDHWC
• CDHWN

### 2.4.4. Fully-packed tensors

A tensor is defined as XYZ-fully-packed if and only if :

• the number of tensor dimensions is equal to the number of letters preceding the fully-packed suffix.
• the stride of the i-th dimension is equal to the product of the (i+1)-th dimension by the (i+1)-th stride.
• the stride of the last dimension is 1.

### 2.4.5. Partially-packed tensors

The partially 'XYZ-packed' terminology only applies in a context of a tensor format described with a superset of the letters used to define a partially-packed tensor. A WXYZ tensor is defined as XYZ-packed if and only if :

• the strides of all dimensions NOT referenced in the -packed suffix are greater or equal to the product of the next dimension by the next stride.
• the stride of each dimension referenced in the -packed suffix in position i is equal to the product of the (i+1)-st dimension by the (i+1)-st stride.
• if last tensor's dimension is present in the -packed suffix, its stride is 1.

For example a NHWC tensor WC-packed means that the c_stride is equal to 1 and w_stride is equal to c_dim x c_stride. In practice, the -packed suffix is usually with slowest changing dimensions of a tensor but it is also possible to refer to a NCHW tensor that is only N-packed.

### 2.4.6. Spatially packed tensors

Spatially-packed tensors are defined as partially-packed in spatial dimensions.

For example a spatially-packed 4D tensor would mean that the tensor is either NCHW HW-packed or CNHW HW-packed.

### 2.4.7. Overlapping tensors

A tensor is defined to be overlapping if a iterating over a full range of dimensions produces the same address more than once.

In practice an overlapped tensor will have stride[i-1] < stride[i]*dim[i] for some of the i from [1,nbDims] interval.

The library is thread safe and its functions can be called from multiple host threads, as long as threads to do not share the same cuDNN handle simultaneously.

### 2.6. Reproducibility (determinism)

By design, most of cuDNN's routines from a given version generate the same bit-wise results across runs when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across versions, as the implementation of a given routine may change. With the current release, the following routines do not guarantee reproducibility because they use atomic operations:

• cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
• cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used
• cudnnPoolingBackward when CUDNN_POOLING_MAX is used
• cudnnSpatialTfSamplerBackward

### 2.7. Scaling parameters alpha and beta

Many cuDNN routines like cudnnConvolutionForward take pointers to scaling factors (in host memory), that are used to blend computed values with initial values in the destination tensor as follows: dstValue = alpha[0]*computedValue + beta[0]*priorDstValue. When beta[0] is zero, the output is not read and may contain any uninitialized data (including NaN). The storage data type for alpha[0], beta[0] is float for HALF and FLOAT tensors, and double for DOUBLE tensors. These parameters are passed using a host memory pointer.

Note: For improved performance it is advised to use beta[0] = 0.0. Use a non-zero value for beta[0] only when blending with prior values stored in the output tensor is needed.

### 2.8. Tensor Core Operations

cuDNN v7 introduces acceleration of compute intensive routines using Tensor Core hardware on supported GPU SM versions. Tensor Core acceleration (using Tensor Core Operations) can be exploited by the library user via the cudnnMathType_t enumerator. This enumerator specifies the available options for Tensor Core enablement and is expected to be applied on a per-routine basis.

Kernels using Tensor Core Operations for are available for both Convolutions and RNNs.

The Convolution functions are:

• cudnnConvolutionForward
• cudnnConvolutionBackwardData
• cudnnConvolutionBackwardFilter

Tensor Core Operations kernels will be triggered in these paths only when:

• cudnnSetConvolutionMathType is called on the appropriate convolution descriptor setting mathType to CUDNN_TENSOR_OP_MATH.
• cudnnConvolutionForward is called using algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM or CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED; cudnnConvolutionBackwardData using algo = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 or CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED; and cudnnConvolutionBackwardFilter using algo = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED.

For algorithms other than *_ALGO_WINOGRAD_NONFUSED, the following are some of the requirements to run Tensor Core operations:

• Input, Filter and Output descriptors (xDesc, yDesc, wDesc, dxDesc, dyDesc and dwDesc as applicable) have dataType = CUDNN_DATA_HALF.
• The number of Input and Output feature maps is a multiple of 8.
• The Filter is of type CUDNN_TENSOR_NCHW or CUDNN_TENSOR_NHWC. When using a filter of type CUDNN_TENSOR_NHWC, Input, Filter and Output data pointers (X, Y, W, dX, dY, and dW as applicable) need to be aligned to 128 bit boundaries.

The RNN functions are:

• cudnnRNNForwardInference
• cudnnRNNForwardTraining
• cudnnRNNBackwardData
• cudnnRNNBackwardWeights
• cudnnRNNForwardInferenceEx
• cudnnRNNForwardTrainingEx
• cudnnRNNBackwardDataEx
• cudnnRNNBackwardWeightsEx

Tensor Core Operations kernels will be triggered in these paths only when:

• cudnnSetRNNMatrixMathType is called on the appropriate RNN descriptor setting mathType to CUDNN_TENSOR_OP_MATH.
• All routines are called using algo = CUDNN_RNN_ALGO_STANDARD or CUDNN_RNN_ALGO_PERSIST_STATIC. (new for 7.1)
• For algo = CUDNN_RNN_ALGO_STANDARD, Hidden State size, Input size and Batch size are all multiples of 8. (new for 7.1)
• For algo = CUDNN_RNN_ALGO_PERSIST_STATIC, Hidden State size and Input size are multiples of 32, Batch size is a multiple of 8. If Batch size exceeds 96 (forward training or inference) or 32 (backward data), Batch sizes constraints may be stricter and large power-of-two Batch sizes may be needed. (new for 7.1)

Note: For all cases, the CUDNN_TENSOR_OP_MATH enumerator is an indicator that the use of Tensor Cores is permissible, but not required. cuDNN may prefer not to use Tensor Core Operations (for instance, when the problem size is not suited to Tensor Core acceleration), and instead use an alternative implementation based on regular floating point operations.

### 2.8.1. Tensor Core Operations Notes

Some notes on Tensor Core Operations use in cuDNN v7 on sm_70:

Tensor Core operations are supported on the Volta GPU family, those operations perform parallel floating point accumulation of multiple floating point products. Setting the math mode to CUDNN_TENSOR_OP_MATH indicates that the library will use Tensor Core operations as mentioned previously. The default is CUDNN_DEFAULT_MATH, this default indicates that the Tensor Core operations will be avoided by the library. The default mode is a serialized operation, the Tensor Core operations are parallelized operation, thus the two might result in slight different numerical results due to the different sequencing of operations. Note: The library falls back to the default math mode when Tensor Core operations are not supported or not permitted.

The result of multiplying two matrices using Tensor Core Operations is very close, but not always identical, to the product achieved using some sequence of legacy scalar floating point operations. So cuDNN requires explicit user opt-in before enabling the use of Tensor Core Operations. However, experiments training common Deep Learning models show negligible difference between using Tensor Core Operations and legacy floating point paths as measured by both final network accuracy and iteration count to convergence. Consequently, the library treats both modes of operation as functionally indistinguishable, and allows for the legacy paths to serve as legitimate fallbacks for cases in which the use of Tensor Core Operations is unsuitable.

### 2.8.2. Tensor Operations Speedup Tips

Some tips on Reducing Computation Time for Tensor Core Operations:

• The computation time for FP32 tensors can be reduced by selecting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value for cudnnMathType_t. In this mode the FP32 tensors are internally down-converted to FP16, the tensor op math is performed, and finally up-converted to FP32 as outputs.
• When the input channel size c is a multiple of 32, you can use the new data type CUDNN_DATA_INT8x32 to accelerate your convolution computation. If you are already using INT8, which is INT8x4, then to use the new INT8x32, ensure that your data is such that the input channel size c is a multiple of 32, instead of a multiple of 4, as you would have had it for INT8x4. The new CUDNN_DATA_INT8x32 data type defines the data as 32-element vectors, each element being 8-bit signed integer.
Note: This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C. See the description for cudnnDataType_t.
Note: This new data type can only be used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM. See cudnnConvolutionFwdAlgo_t.
Note: Note that this CUDNN_DATA_INT8x32 is only supported by sm_72.

### 2.9. GPU and driver requirements

cuDNN v7.0 supports NVIDIA GPUs of compute capability 3.0 and higher. For x86_64 platform, cuDNN v7.0 comes with two deliverables: one requires a NVIDIA Driver compatible with CUDA Toolkit 8.0, the other requires a NVIDIA Driver compatible with CUDA Toolkit 9.0.

If you are using cuDNN with a Volta GPU, version 7 or later is required.

### 2.10. Backward compatibility and deprecation policy

When changing the API of an existing cuDNN function "foo" (usually to support some new functionality), first, a new routine "foo_v<n>" is created where n represents the cuDNN version where the new API is first introduced, leaving "foo" untouched. This ensures backward compatibility with the version n-1 of cuDNN. At this point, "foo" is considered deprecated, and should be treated as such by users of cuDNN. We gradually eliminate deprecated and suffixed API entries over the course of a few releases of the library per the following policy:

• In release n+1, the legacy API entry "foo" is remapped to a new API "foo_v<f>" where f is some cuDNN version anterior to n.
• Also in release n+1, the unsuffixed API entry "foo" is modified to have the same signature as "foo_<n>". "foo_<n>" is retained as-is.
• The deprecated former API entry with an anterior suffix _v<f> and new API entry with suffix _v<n> are maintained in this release.
• In release n+2, both suffixed entries of a given entry are removed.

As a rule of thumb, when a routine appears in two forms, one with a suffix and one with no suffix, the non-suffixed entry is to be treated as deprecated. In this case, it is strongly advised that users migrate to the new suffixed API entry to guarantee backwards compatibility in the following cuDNN release. When a routine appears with multiple suffixes, the unsuffixed API entry is mapped to the higher numbered suffix. In that case it is strongly advised to use the non-suffixed API entry to guarantee backward compatibiliy with the following cuDNN release.

### 2.11. Grouped Convolutions

cuDNN supports grouped convolutions by setting groupCount > 1 for the convolution descriptor convDesc, using cudnnSetConvolutionGroupCount().

Note: By default the convolution descriptor convDesc is set to groupCount of 1.

Basic Idea

Conceptually, in grouped convolutions the input channels and the filter channels are split into groupCount number of independent groups, with each group having a reduced number of channels. Convolution operation is then performed separately on these input and filter groups.

For example, consider the following: if the number of input channels is 4, and the number of filter channels of 12. For a normal, ungrouped convolution, the number of computation operations performed are 12*4.

If the groupCount is set to 2, then there are now two input channel groups of two input channels each, and two filter channel groups of six filter channels each.

As a result, each grouped convolution will now perform 2*6 computation operations, and two such grouped convolutions are performed. Hence the computation savings are 2x: (12*4)/(2*(2*6))

cuDNN Grouped Convolution

• When using groupCount for grouped convolutions, you must still define all tensor descriptors so that they describe the size of the entire convolution, instead of specifying the sizes per group.
• Grouped convolutions are supported for all formats that are currently supported by the functions cuDNNConvolutionForward(), cudnnConvolutionBackwardData() and cudnnConvolutionBackwardFilter().
• The tensor stridings that are set for groupCount of 1 are also valid for any group count.
• By default the convolution descriptor convDesc is set to groupCount of 1.

Note: See Convolution Formulas for the math behind the cuDNN Grouped Convolution.

Example

Below is an example showing the dimensions and strides for grouped convolutions for NCHW format, for 2D convolution.

Note: Note that the symbols "*" and "/" are used to indicate multiplication and division.

xDesc or dxDesc:

• Dimensions: [batch_size, input_channel, x_height, x_width]
• Strides: [input_channels*x_height*x_width, x_height*x_width, x_width, 1]

wDesc or dwDesc:

• Dimensions: [output_channels, input_channels/groupCount, w_height, w_width]
• Format: NCHW

convDesc:

• Group Count: groupCount

yDesc or dyDesc:

• Dimensions: [batch_size, output_channels, y_height, y_width]
• Strides: [output_channels*y_height*y_width, y_height*y_width, y_width, 1]

### 2.12. API Logging (new for 7.1)

cuDNN API logging is a tool that records all input parameters passed into every cuDNN API function call. This functionality is by default disabled, and can be enabled through methods described in the next paragraph. The log output contains variable names, data types, parameter values, device pointers, and metadata such as time of the function call in microseconds, process ID, thread ID, cuDNN handle and cuda stream ID. When logging is enabled, the log output will be handled by the built-in default callback function. However, the user may also write their own callback function, and use the cudnnSetCallback to pass in the function pointer of their own callback function. Following is a sample output of the API log.

```        Function cudnnSetActivationDescriptor() called:
mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1);
reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0);
coef: type=double; val=1000.000000;
Time: 2017-11-21T14:14:21.366171 (0d+0h+1m+5s since start)
Process: 21264, Thread: 21264, cudnn_handle: NULL, cudnn_stream: NULL.
```

There are two methods to enable API logging.

Method 1: To enable it through environment variables, set “CUDNN_LOGINFO_DBG” to “1”, and set “CUDNN_LOGDEST_DBG” to one of the following: “stdout”, “stderr”, or user desired file path, e.g. “/home/userName1/log.txt”. You may include date and time conversion specifiers in the file name like “log_%Y_%m_%d_%H_%M_%S.txt”. The conversion specifiers will be automatically replaced with the date and time when the program is initiated, like “log_2017_11_21_09_41_00.txt”. The supported conversion specifiers are similar to the “strftime” function. If the file already exists, the log will overwrite the existing file. Note that these environmental variables are only checked once at the initialization, and any later changes in these environmental variables will not be effective in the current run. Also note that settings through environment can be overridden by method 2 below.

CUDNN_LOGDEST_DBG not set

No logging output

No performance loss

No logging output

No performance loss

CUDNN_LOGDEST_DBG=NULL

No logging output

No performance loss

No logging output

No performance loss

CUDNN_LOGDEST_DBG=stdout or stderr

No logging output

No performance loss

Logging to stdout or stderr

Some performance loss

CUDNN_LOGDEST_DBG=

filename.txt

No logging output

No performance loss

Logging to filename.txt

Some performance loss

Method 2: To use API function calls to enable API logging, refer to the API description of cudnnSetCallback() and cudnnGetCallback().

### 2.13. Features of RNN Functions

The RNN functions are:

• cudnnRNNForwardInference
• cudnnRNNForwardTraining
• cudnnRNNBackwardData
• cudnnRNNBackwardWeights
• cudnnRNNForwardInferenceEx
• cudnnRNNForwardTrainingEx
• cudnnRNNBackwardDataEx
• cudnnRNNBackwardWeightsEx

See the table below for a list of features supported by each RNN function:

Note:

For each of these terms, the short-form versions shown in the paranthesis are used in the tables below for brevity: CUDNN_RNN_ALGO_STANDARD (_ALGO_STANDARD), CUDNN_RNN_ALGO_PERSIST_STATIC (_ALGO_PERSIST_STATIC), CUDNN_RNN_ALGO_PERSIST_DYNAMIC (_ALGO_PERSIST_DYNAMIC), and CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION (_ALLOW_CONVERSION).

 Functions Input output layout supported Supports variable sequence length in batch Commonly supported cudnnRNNForwardInference Only Sequence major, packed (non-padded) Only with _ALGO_STANDARD Require input sequences descending sorted according to length Mode (cell type) supported: CUDNN_RNN_RELU, CUDNN_RNN_TANH, CUDNN_LSTM, CUDNN_GRUAlgo supported* (see the table below for an elaboration on these algorithms): _ALGO_STANDARD, _ALGO_PERSIST_STATIC, _ALGO_PERSIST_DYNAMIC Math mode supported: CUDNN_DEFAULT_MATH,CUDNN_TENSOR_OP_MATH (will automatically fall back if run on pre-Volta, or if algo doesn’t support HMMA acceleration) _ALLOW_CONVERSION (may do down conversion to utilize HMMA acceleration) Direction mode supported: CUDNN_UNIDIRECTIONAL, CUDNN_BIDIRECTIONAL RNN input mode: CUDNN_LINEAR_INPUT, CUDNN_SKIP_INPUT cudnnRNNForwardTraining cudnnRNNBackwardData cudnnRNNBackwardWeights cudnnRNNForwardInferenceEx Sequence major unpacked, Batch major unpacked**, Sequence major packed** Only with _ALGO_STANDARD For unpacked layout**, no input sorting required. For packed layout, require input sequences descending sorted according to length cudnnRNNForwardTrainingEx cudnnRNNBackwardDataEx cudnnRNNBackwardWeightsEx

* Do not mix different algos for different steps of training. It’s also not recommended to mix non-extended and extended API for different steps of training.

** To use unpacked layout, user need to set CUDNN_RNN_PADDED_IO_ENABLED through cudnnSetRNNPaddingMode.

The following table provides the features supported by the algorithms referred in the above table: CUDNN_RNN_ALGO_STANDARD, CUDNN_RNN_ALGO_PERSIST_STATIC, and CUDNN_RNN_ALGO_PERSIST_DYNAMIC.

 Features _ALGO_STANDARD _ALGO_PERSIST_STATIC _ALGO_PERSIST_DYNAMIC Half input Single accumulation Half output Supported Half intermediate storage Single accumulation Single input Single accumulation Single output Supported If running on Volta, with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION!, will down-convert and use half intermediate storage. Otherwise: Single intermediate storage Single accumulation Double input Double accumulation Double output Supported Double intermediate storage Double accumulation Not Supported Supported Double intermediate storage Double accumulation LSTM recurrent projection Supported Not Supported Not Supported LSTM cell clipping Supported Variable sequence length in batch Supported Not Supported Not Supported HMMA acceleration on Volta/Xavier Supported For half input/output, acceleration requires setting CUDNN_TENSOR_OP_MATH! or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION! Acceleration requires inputSize and hiddenSize to be multiple of 8 For single input/output, acceleration requires setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION! Acceleration requires inputSize and hiddenSize to be multiple of 8 Not Supported, will execute normally ignoring CUDNN_TENSOR_OP_MATH! or _ALLOW_CONVERSION! Other limitations Max problem size is limited by GPU specifications. Requires real time compilation through NVRTC

!CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION can be set through cudnnSetRNNMatrixMathType.

### 2.14. Mixed Precision Numerical Accuracy

When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.

For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 ("ALGO_0") has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 ("ALGO_1"). This is because ALGO_0 does not use extra workspace, and is forced to accumulate the intermediate results in FP16, i.e., half precision float, and this reduces the accuracy. The ALGO_1, on the other hand, uses additonal workspace to accumulate the intermediate values in FP32, i.e., full precision float.

## 3. cuDNN Datatypes Reference

This chapter describes all the types and enums of the cuDNN library API.

### 3.1. cudnnActivationDescriptor_t

cudnnActivationDescriptor_t is a pointer to an opaque structure holding the description of a activation operation. cudnnCreateActivationDescriptor() is used to create one instance, and cudnnSetActivationDescriptor() must be used to initialize this instance.

### 3.2. cudnnActivationMode_t

cudnnActivationMode_t is an enumerated type used to select the neuron activation function used in cudnnActivationForward(), cudnnActivationBackward() and cudnnConvolutionBiasActivationForward().

Values

CUDNN_ACTIVATION_SIGMOID

Selects the sigmoid function.

CUDNN_ACTIVATION_RELU

Selects the rectified linear function.

CUDNN_ACTIVATION_TANH

Selects the hyperbolic tangent function.

CUDNN_ACTIVATION_CLIPPED_RELU

Selects the clipped rectified linear function.

CUDNN_ACTIVATION_ELU

Selects the exponential linear function.

CUDNN_ACTIVATION_IDENTITY (new for 7.1)

Selects the identity function, intended for bypassing the activation step in cudnnConvolutionBiasActivationForward(). (The cudnnConvolutionBiasActivationForward() function must use CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.) Does not work with cudnnActivationForward() or cudnnActivationBackward().

### 3.3. cudnnBatchNormMode_t

cudnnBatchNormMode_t is an enumerated type used to specify the mode of operation in cudnnBatchNormalizationForwardInference(), cudnnBatchNormalizationForwardTraining(), cudnnBatchNormalizationBackward() and cudnnDeriveBNTensorDescriptor() routines.

Values

CUDNN_BATCHNORM_PER_ACTIVATION

Normalization is performed per-activation. This mode is intended to be used after non-convolutional network layers. In this mode bnBias and bnScale tensor dimensions are 1xCxHxW.

CUDNN_BATCHNORM_SPATIAL

Normalization is performed over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode bnBias, bnScale tensor dimensions are 1xCx1x1.

CUDNN_BATCHNORM_SPATIAL_PERSISTENT

This mode is similar to CUDNN_BATCHNORM_SPATIAL but it can be faster for some tasks. An optimized path may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF types, compute capability 6.0 or higher, and the following two batch normalization API calls: cudnnBatchNormalizationForwardTraining(), and cudnnBatchNormalizationBackward(). In the latter case savedMean and savedInvVariance arguments should not be NULL. This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, NaN-s (not-a-number) and/or Inf-s special floating point values are written to output buffers. When Inf-s/NaN-s are present in the input data, the output in this mode is the same as from a pure floating-point implementation. For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Inf-s/NaN-s while CUDNN_BATCHNORM_SPATIAL will produce finite results. The user can invoke cudnnQueryRuntimeError() to check if a numerical overflow occurred in this mode.

### 3.4. cudnnCTCLossAlgo_t

cudnnCTCLossAlgo_t is an enumerated type that exposes the different algorithms available to execute the CTC loss operation.

Values

CUDNN_CTC_LOSS_ALGO_DETERMINISTIC

Results are guaranteed to be reproducible

CUDNN_CTC_LOSS_ALGO_NON_DETERMINISTIC

Results are not guaranteed to be reproducible

### 3.5. cudnnCTCLossDescriptor_t

cudnnCTCLossDescriptor_t is a pointer to an opaque structure holding the description of a CTC loss operation. cudnnCreateCTCLossDescriptor() is used to create one instance, cudnnSetCTCLossDescriptor() is be used to initialize this instance, cudnnDestroyCTCLossDescriptor() is be used to destroy this instance.

### 3.6. cudnnConvolutionBwdDataAlgoPerf_t

cudnnConvolutionBwdDataAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionBackwardDataAlgorithm() or heuristic results returned by cudnnGetConvolutionBackwardDataAlgorithm_v7().

Data Members

cudnnConvolutionBwdDataAlgo_t algo

The algorithm run to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionBackwardData(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionBackwardData().

• CUDNN_STATUS_ALLOC_FAILED if any error occured during workspace allocation or if provided workspace is insufficient.
• CUDNN_STATUS_INTERNAL_ERROR if any error occured during timing calculations or workspace deallocation.
• Otherwise, this will be the return status of cudnnConvolutionBackwardData().
float time

The execution time of cudnnConvolutionBackwardData() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

### 3.7. cudnnConvolutionBwdDataAlgo_t

cudnnConvolutionBwdDataAlgo_t is an enumerated type that exposes the different algorithms available to execute the backward data convolution operation.

Values

CUDNN_CONVOLUTION_BWD_DATA_ALGO_0

This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_1

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT

This algorithm uses a Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_​FFT_TILING

This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT for large size images. The results are deterministic.

This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results. The results are deterministic.

This algorithm uses the Winograd Transform approach to compute the convolution. Significant workspace may be needed to store intermediate results. The results are deterministic.

### 3.8. cudnnConvolutionBwdDataPreference_t

cudnnConvolutionBwdDataPreference_t is an enumerated type used by cudnnGetConvolutionBackwardDataAlgorithm() to help the choice of the algorithm used for the backward data convolution.

Values

CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_BWD_DATA_​PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() will return the fastest algorithm regardless how much workspace is needed to execute it.

CUDNN_CONVOLUTION_BWD_DATA_​SPECIFY_WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

### 3.9. cudnnConvolutionBwdFilterAlgoPerf_t

cudnnConvolutionBwdFilterAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionBackwardFilterAlgorithm() or heuristic results returned by cudnnGetConvolutionBackwardFilterAlgorithm_v7().

Data Members

cudnnConvolutionBwdFilterAlgo_t algo

The algorithm run to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionBackwardFilter(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionBackwardFilter().

• CUDNN_STATUS_ALLOC_FAILED if any error occured during workspace allocation or if provided workspace is insufficient.
• CUDNN_STATUS_INTERNAL_ERROR if any error occured during timing calculations or workspace deallocation.
• Otherwise, this will be the return status of cudnnConvolutionBackwardFilter().
float time

The execution time of cudnnConvolutionBackwardFilter() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

### 3.10. cudnnConvolutionBwdFilterAlgo_t

cudnnConvolutionBwdFilterAlgo_t is an enumerated type that exposes the different algorithms available to execute the backward filter convolution operation.

Values

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0

This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT

This algorithm uses the Fast-Fourier Transform approach to compute the convolution. Significant workspace is needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3

This algorithm is similar to CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 but uses some small workspace to precomputes some indices. The results are also non-deterministic.

This algorithm uses the Winograd Transform approach to compute the convolution. Significant workspace may be needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_​FFT_TILING

This algorithm uses the Fast-Fourier Transform approach to compute the convolution but splits the input tensor into tiles. Significant workspace may be needed to store intermediate results. The results are deterministic.

### 3.11. cudnnConvolutionBwdFilterPreference_t

cudnnConvolutionBwdFilterPreference_t is an enumerated type used by cudnnGetConvolutionBackwardFilterAlgorithm() to help the choice of the algorithm used for the backward filter convolution.

Values

CUDNN_CONVOLUTION_BWD_FILTER_​NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_BWD_FILTER_​PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() will return the fastest algorithm regardless how much workspace is needed to execute it.

CUDNN_CONVOLUTION_BWD_FILTER_​SPECIFY_WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

### 3.12. cudnnConvolutionDescriptor_t

cudnnConvolutionDescriptor_t is a pointer to an opaque structure holding the description of a convolution operation. cudnnCreateConvolutionDescriptor() is used to create one instance, and cudnnSetConvolutionNdDescriptor() or cudnnSetConvolution2dDescriptor() must be used to initialize this instance.

### 3.13. cudnnConvolutionFwdAlgoPerf_t

cudnnConvolutionFwdAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionForwardAlgorithm() or heuristic results returned by cudnnGetConvolutionForwardAlgorithm_v7().

Data Members

cudnnConvolutionFwdAlgo_t algo

The algorithm run to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionForward(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionForward().

• CUDNN_STATUS_ALLOC_FAILED if any error occured during workspace allocation or if provided workspace is insufficient.
• CUDNN_STATUS_INTERNAL_ERROR if any error occured during timing calculations or workspace deallocation.
• Otherwise, this will be the return status of cudnnConvolutionForward().
float time

The execution time of cudnnConvolutionForward() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

### 3.14. cudnnConvolutionFwdAlgo_t

cudnnConvolutionFwdAlgo_t is an enumerated type that exposes the different algorithms available to execute the forward convolution operation.

Values

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data, but still needs some memory workspace to precompute some indices in order to facilitate the implicit construction of the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_GEMM

This algorithm expresses the convolution as an explicit matrix product. A significant memory workspace is needed to store the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_DIRECT

This algorithm expresses the convolution as a direct convolution (e.g without implicitly or explicitly doing a matrix multiplication).

CUDNN_CONVOLUTION_FWD_ALGO_FFT

This algorithm uses the Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results.

CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING

This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than CUDNN_CONVOLUTION_FWD_ALGO_FFT for large size images.

This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results.

This algorithm uses the Winograd Transform approach to compute the convolution. Significant workspace may be needed to store intermediate results.

### 3.15. cudnnConvolutionFwdPreference_t

cudnnConvolutionFwdPreference_t is an enumerated type used by cudnnGetConvolutionForwardAlgorithm() to help the choice of the algorithm used for the forward convolution.

Values

CUDNN_CONVOLUTION_FWD_NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_FWD_PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() will return the fastest algorithm regardless how much workspace is needed to execute it.

CUDNN_CONVOLUTION_FWD_SPECIFY_​WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

### 3.16. cudnnConvolutionMode_t

cudnnConvolutionMode_t is an enumerated type used by cudnnSetConvolutionDescriptor() to configure a convolution descriptor. The filter used for the convolution can be applied in two different ways, corresponding mathematically to a convolution or to a cross-correlation. (A cross-correlation is equivalent to a convolution with its filter rotated by 180 degrees.)

Values

CUDNN_CONVOLUTION

In this mode, a convolution operation will be done when applying the filter to the images.

CUDNN_CROSS_CORRELATION

In this mode, a cross-correlation operation will be done when applying the filter to the images.

### 3.17. cudnnDataType_t

cudnnDataType_t is an enumerated type indicating the data type to which a tensor descriptor or filter descriptor refers.

Values

CUDNN_DATA_FLOAT

The data is 32-bit single-precision floating point (float).

CUDNN_DATA_DOUBLE

The data is 64-bit double-precision floating point (double).

CUDNN_DATA_HALF

The data is 16-bit floating point.

CUDNN_DATA_INT8

The data is 8-bit signed integer.

CUDNN_DATA_UINT8 (new for 7.1)

The data is 8-bit unsigned integer.

CUDNN_DATA_INT32

The data is 32-bit signed integer.

CUDNN_DATA_INT8x4

The data is 32-bit elements each composed of 4 8-bit signed integer. This data type is only supported with tensor format CUDNN_TENSOR_NCHW_VECT_C.

CUDNN_DATA_INT8x32

The data is 32-element vectors, each element being 8-bit signed integer. This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C. Moreover, this data type can only be used with “algo 1,” i.e., CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM. See cudnnConvolutionFwdAlgo_t.

CUDNN_DATA_UINT8x4 (new for 7.1)

The data is 32-bit elements each composed of 4 8-bit unsigned integer. This data type is only supported with tensor format CUDNN_TENSOR_NCHW_VECT_C.

### 3.18. cudnnDeterminism_t

cudnnDeterminism_t is an enumerated type used to indicate if the computed results are deterministic (reproducible). See section 2.5 (Reproducibility) for more details on determinism.

Values

CUDNN_NON_DETERMINISTIC

Results are not guaranteed to be reproducible

CUDNN_DETERMINISTIC

Results are guaranteed to be reproducible

### 3.19. cudnnDirectionMode_t

cudnnDirectionMode_t is an enumerated type used to specify the recurrence pattern in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_UNIDIRECTIONAL
The network iterates recurrently from the first input to the last.
CUDNN_BIDIRECTIONAL
Each layer of the the network iterates recurrently from the first input to the last and separately from the last input to the first. The outputs of the two are concatenated at each iteration giving the output of the layer.

### 3.20. cudnnDivNormMode_t

cudnnDivNormMode_t is an enumerated type used to specify the mode of operation in cudnnDivisiveNormalizationForward() and cudnnDivisiveNormalizationBackward().

Values

CUDNN_DIVNORM_PRECOMPUTED_MEANS

The means tensor data pointer is expected to contain means or other kernel convolution values precomputed by the user. The means pointer can also be NULL, in that case it's considered to be filled with zeroes. This is equivalent to spatial LRN. Note that in the backward pass the means are treated as independent inputs and the gradient over means is computed independently. In this mode to yield a net gradient over the entire LCN computational graph the destDiffMeans result should be backpropagated through the user's means layer (which can be impelemented using average pooling) and added to the destDiffData tensor produced by cudnnDivisiveNormalizationBackward.

### 3.21. cudnnDropoutDescriptor_t

cudnnDropoutDescriptor_t is a pointer to an opaque structure holding the description of a dropout operation. cudnnCreateDropoutDescriptor() is used to create one instance, cudnnSetDropoutDescriptor() is used to initialize this instance, cudnnDestroyDropoutDescriptor() is used to destroy this instance, cudnnGetDropoutDescriptor() is used to query fields of a previously initialized instance, cudnnRestoreDropoutDescriptor() is used to restore an instance to a previously saved off state.

### 3.22. cudnnErrQueryMode_t

cudnnErrQueryMode_t is an enumerated type passed to cudnnQueryRuntimeError() to select the remote kernel error query mode.

Values

CUDNN_ERRQUERY_RAWCODE
Read the error storage location regardless of the kernel completion status.
CUDNN_ERRQUERY_NONBLOCKING
Report if all tasks in the user stream of the cuDNN handle were completed. If that is the case, report the remote kernel error code.
CUDNN_ERRQUERY_BLOCKING
Wait for all tasks to complete in the user stream before reporting the remote kernel error code.

### 3.23. cudnnFilterDescriptor_t

cudnnFilterDescriptor_t is a pointer to an opaque structure holding the description of a filter dataset. cudnnCreateFilterDescriptor() is used to create one instance, and cudnnSetFilter4dDescriptor() or cudnnSetFilterNdDescriptor() must be used to initialize this instance.

### 3.24. cudnnHandle_t

cudnnHandle_t is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using cudnnCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cudnnDestroy(). The context is associated with only one GPU device, the current device at the time of the call to cudnnCreate(). However multiple contexts can be created on the same GPU device.

### 3.25. cudnnIndicesType_t

cudnnIndicesType_t is an enumerated type used to indicate the data type for the indices to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

CUDNN_32BIT_INDICES

Compute unsigned int indices

CUDNN_64BIT_INDICES

Compute unsigned long long indices

CUDNN_16BIT_INDICES

Compute unsigned short indices

CUDNN_8BIT_INDICES

Compute unsigned char indices

### 3.26. cudnnLRNMode_t

cudnnLRNMode_t is an enumerated type used to specify the mode of operation in cudnnLRNCrossChannelForward() and cudnnLRNCrossChannelBackward().

Values

CUDNN_LRN_CROSS_CHANNEL_DIM1

LRN computation is performed across tensor's dimension dimA[1].

### 3.27. cudnnMathType_t

cudnnMathType_t is an enumerated type used to indicate if the use of Tensor Core Operations is permitted a given library routine.

Values

CUDNN_DEFAULT_MATH

Tensor Core Operations are not used.

CUDNN_TENSOR_OP_MATH

The use of Tensor Core Operations is permitted.

CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION

Enables the use of FP32 tensors for both input and output.

### 3.28. cudnnNanPropagation_t

cudnnNanPropagation_t is an enumerated type used to indicate if a given routine should propagate Nan numbers. This enumerated type is used as a field for the cudnnActivationDescriptor_t descriptor and cudnnPoolingDescriptor_t descriptor.

Values

CUDNN_NOT_PROPAGATE_NAN

Nan numbers are not propagated

CUDNN_PROPAGATE_NAN

Nan numbers are propagated

### 3.29. cudnnOpTensorDescriptor_t

cudnnOpTensorDescriptor_t is a pointer to an opaque structure holding the description of a Tensor Ccore Operation, used as a parameter to cudnnOpTensor(). cudnnCreateOpTensorDescriptor() is used to create one instance, and cudnnSetOpTensorDescriptor() must be used to initialize this instance.

### 3.30. cudnnOpTensorOp_t

cudnnOpTensorOp_t is an enumerated type used to indicate the Tensor Core Operation to be used by the cudnnOpTensor() routine. This enumerated type is used as a field for the cudnnOpTensorDescriptor_t descriptor.

Values

The operation to be performed is addition

CUDNN_OP_TENSOR_MUL

The operation to be performed is multiplication

CUDNN_OP_TENSOR_MIN

The operation to be performed is a minimum comparison

CUDNN_OP_TENSOR_MAX

The operation to be performed is a maximum comparison

CUDNN_OP_TENSOR_SQRT

The operation to be performed is square root, performed on only the A tensor

CUDNN_OP_TENSOR_NOT

The operation to be performed is negation, performed on only the A tensor

### 3.31. cudnnPersistentRNNPlan_t

cudnnPersistentRNNPlan_t is a pointer to an opaque structure holding a plan to execute a dynamic persistent RNN. cudnnCreatePersistentRNNPlan() is used to create and initialize one instance.

### 3.32. cudnnPoolingDescriptor_t

cudnnPoolingDescriptor_t is a pointer to an opaque structure holding the description of a pooling operation. cudnnCreatePoolingDescriptor() is used to create one instance, and cudnnSetPoolingNdDescriptor() or cudnnSetPooling2dDescriptor() must be used to initialize this instance.

### 3.33. cudnnPoolingMode_t

cudnnPoolingMode_t is an enumerated type passed to cudnnSetPoolingDescriptor() to select the pooling method to be used by cudnnPoolingForward() and cudnnPoolingBackward().

Values

CUDNN_POOLING_MAX

The maximum value inside the pooling window is used.

Values inside the pooling window are averaged. The number of elements used to calculate the average includes spatial locations falling in the padding region.

Values inside the pooling window are averaged. The number of elements used to calculate the average excludes spatial locations falling in the padding region.

CUDNN_POOLING_MAX_DETERMINISTIC

The maximum value inside the pooling window is used. The algorithm used is deterministic.

### 3.34. cudnnRNNAlgo_t

cudnnRNNAlgo_t is an enumerated type used to specify the algorithm used in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_RNN_ALGO_STANDARD
Each RNN layer is executed as a sequence of operations. This algorithm is expected to have robust performance across a wide range of network parameters.
CUDNN_RNN_ALGO_PERSIST_STATIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (ie. a small minibatch).

CUDNN_RNN_ALGO_PERSIST_STATIC is only supported on devices with compute capability >= 6.0.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (ie. a small minibatch). When using CUDNN_RNN_ALGO_PERSIST_DYNAMIC persistent kernels are prepared at runtime and are able to optimized using the specific parameters of the network and active GPU. As such, when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC a one-time plan preparation stage must be executed. These plans can then be reused in repeated calls with the same model parameters.

The limits on the maximum number of hidden units supported when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC are significantly higher than the limits when using CUDNN_RNN_ALGO_PERSIST_STATIC, however throughput is likely to significantly reduce when exceeding the maximums supported by CUDNN_RNN_ALGO_PERSIST_STATIC. In this regime this method will still outperform CUDNN_RNN_ALGO_STANDARD for some cases.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC is only supported on devices with compute capability >= 6.0 on Linux machines.

### 3.35. cudnnRNNClipMode_t

cudnnRNNClipMode_t is an enumerated type used to select the LSTM cell clipping mode. It is used with cudnnRNNSetClip(), cudnnRNNGetClip() functions, and internally within LSTM cells.

Values

CUDNN_RNN_CLIP_NONE

Disables LSTM cell clipping.

CUDNN_RNN_CLIP_MINMAX

Enables LSTM cell clipping.

### 3.36. cudnnRNNDescriptor_t

cudnnRNNDescriptor_t is a pointer to an opaque structure holding the description of an RNN operation. cudnnCreateRNNDescriptor() is used to create one instance, and cudnnSetRNNDescriptor() must be used to initialize this instance.

cudnnRNNDataDescriptor_t is a pointer to an opaque structure holding the description of a RNN data set. The function cudnnCreateRNNDataDescriptor() is used to create one instance, and cudnnSetRNNDataDescriptor() must be used to initialize this instance.

### 3.38. cudnnRNNInputMode_t

cudnnRNNInputMode_t is an enumerated type used to specify the behavior of the first layer in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_LINEAR_INPUT
A biased matrix multiplication is performed at the input of the first recurrent layer.
CUDNN_SKIP_INPUT
No operation is performed at the input of the first recurrent layer. If CUDNN_SKIP_INPUT is used the leading dimension of the input tensor must be equal to the hidden state size of the network.

### 3.39. cudnnRNNMode_t

cudnnRNNMode_t is an enumerated type used to specify the type of network used in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_RNN_RELU

A single-gate recurrent neural network with a ReLU activation function.

In the forward pass the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt given matrices W, R and biases bW, bR from the following equation:

`ht = ReLU(Wixt + Riht-1 + bWi + bRi)`

Where ReLU(x) = max(x, 0).

CUDNN_RNN_TANH

A single-gate recurrent neural network with a tanh activation function.

In the forward pass the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt given matrices W, R and biases bW, bR from the following equation:

`ht = tanh(Wixt + Riht-1 + bWi + bRi)`

Where tanh is the hyperbolic tangent function.

CUDNN_LSTM

A four-gate Long Short-Term Memory network with no peephole connections.

In the forward pass the output ht and cell output ct for a given iteration can be computed from the recurrent input ht-1, the cell input ct-1 and the previous layer input xt given matrices W, R and biases bW, bR from the following equations:

```it = σ(Wixt + Riht-1 + bWi + bRi)
ft = σ(Wfxt + Rfht-1 + bWf + bRf)
ot = σ(Woxt + Roht-1 + bWo + bRo)
c't = tanh(Wcxt + Rcht-1 + bWc + bRc)
ct = ft◦ct-1 + it◦c't
ht = ot◦tanh(ct)```

Where σ is the sigmoid operator: σ(x) = 1 / (1 + e-x), represents a point-wise multiplication and tanh is the hyperbolic tangent function. it, ft, ot, c't represent the input, forget, output and new gates respectively.

CUDNN_GRU

A three-gate network consisting of Gated Recurrent Units.

In the forward pass the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt given matrices W, R and biases bW, bR from the following equations:

```it = σ(Wixt + Riht-1 + bWi + bRu)
rt = σ(Wrxt + Rrht-1 + bWr + bRr)
h't = tanh(Whxt + rt◦(Rhht-1 + bRh) + bWh)
ht = (1 - it)◦h't + it◦ht-1```

Where σ is the sigmoid operator: σ(x) = 1 / (1 + e-x), represents a point-wise multiplication and tanh is the hyperbolic tangent function. it, rt, h't represent the input, reset, new gates respectively.

cudnnRNNPaddingMode_t is an enumerated type used to enable or disable the padded input/output.

Values

### 3.41. cudnnReduceTensorDescriptor_t

cudnnReduceTensorDescriptor_t is a pointer to an opaque structure holding the description of a tensor reduction operation, used as a parameter to cudnnReduceTensor(). cudnnCreateReduceTensorDescriptor() is used to create one instance, and cudnnSetReduceTensorDescriptor() must be used to initialize this instance.

### cudnnReduceTensorIndices_t

cudnnReduceTensorIndices_t is an enumerated type used to indicate whether indices are to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

CUDNN_REDUCE_TENSOR_NO_INDICES

Do not compute indices

CUDNN_REDUCE_TENSOR_FLATTENED_INDICES

Compute indices. The resulting indices are relative, and flattened.

### 3.43. cudnnReduceTensorOp_t

cudnnReduceTensorOp_t is an enumerated type used to indicate the Tensor Core Operation to be used by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

The operation to be performed is addition

CUDNN_REDUCE_TENSOR_MUL

The operation to be performed is multiplication

CUDNN_REDUCE_TENSOR_MIN

The operation to be performed is a minimum comparison

CUDNN_REDUCE_TENSOR_MAX

The operation to be performed is a maximum comparison

CUDNN_REDUCE_TENSOR_AMAX

The operation to be performed is a maximum comparison of absolute values

CUDNN_REDUCE_TENSOR_AVG

The operation to be performed is averaging

CUDNN_REDUCE_TENSOR_NORM1

The operation to be performed is addition of absolute values

CUDNN_REDUCE_TENSOR_NORM2

The operation to be performed is a square root of sum of squares

CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS

The operation to be performed is multiplication, not including elements of value zero

### 3.44. cudnnSamplerType_t

cudnnSamplerType_t is an enumerated type passed to cudnnSetSpatialTransformerNdDescriptor() to select the sampler type to be used by cudnnSpatialTfSamplerForward() and cudnnSpatialTfSamplerBackward().

Values

CUDNN_SAMPLER_BILINEAR
Selects the bilinear sampler.

### 3.45. cudnnSoftmaxAlgorithm_t

cudnnSoftmaxAlgorithm_t is used to select an implementation of the softmax function used in cudnnSoftmaxForward() and cudnnSoftmaxBackward().

Values

CUDNN_SOFTMAX_FAST

This implementation applies the straightforward softmax operation.

CUDNN_SOFTMAX_ACCURATE

This implementation scales each point of the softmax input domain by its maximum value to avoid potential floating point overflows in the softmax evaluation.

CUDNN_SOFTMAX_LOG

This entry performs the Log softmax operation, avoiding overflows by scaling each point in the input domain as in CUDNN_SOFTMAX_ACCURATE

### 3.46. cudnnSoftmaxMode_t

cudnnSoftmaxMode_t is used to select over which data the cudnnSoftmaxForward() and cudnnSoftmaxBackward() are computing their results.

Values

CUDNN_SOFTMAX_MODE_INSTANCE

The softmax operation is computed per image (N) across the dimensions C,H,W.

CUDNN_SOFTMAX_MODE_CHANNEL

The softmax operation is computed per spatial location (H,W) per image (N) across the dimension C.

### 3.47. cudnnSpatialTransformerDescriptor_t

cudnnSpatialTransformerDescriptor_t is a pointer to an opaque structure holding the description of a spatial transformation operation. cudnnCreateSpatialTransformerDescriptor() is used to create one instance, cudnnSetSpatialTransformerNdDescriptor() is used to initialize this instance, cudnnDestroySpatialTransformerDescriptor() is used to destroy this instance.

### 3.48. cudnnStatus_t

cudnnStatus_t is an enumerated type used for function status returns. All cuDNN library functions return their status, which can be one of the following values:

Values

CUDNN_STATUS_SUCCESS

The operation completed successfully.

CUDNN_STATUS_NOT_INITIALIZED

The cuDNN library was not initialized properly. This error is usually returned when a call to cudnnCreate() fails or when cudnnCreate() has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called by cudnnCreate() or by an error in the hardware setup.

CUDNN_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuDNN library. This is usually caused by an internal cudaMalloc() failure.

To correct: prior to the function call, deallocate previously allocated memory as much as possible.

An incorrect value or parameter was passed to the function.

To correct: ensure that all the parameters being passed have valid values.

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note that cuDNN only supports devices with compute capabilities greater than or equal to 3.0.

To correct: compile and run the application on a device with appropriate compute capability.

CUDNN_STATUS_MAPPING_ERROR

An access to GPU memory space failed, which is usually caused by a failure to bind a texture.

To correct: prior to the function call, unbind any previously bound textures.

Otherwise, this may indicate an internal error/bug in the library.

CUDNN_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is usually caused by a failure to launch some cuDNN kernel on the GPU, which can occur for multiple reasons.

To correct: check that the hardware, an appropriate version of the driver, and the cuDNN library are correctly installed.

Otherwise, this may indicate a internal error/bug in the library.

CUDNN_STATUS_INTERNAL_ERROR

An internal cuDNN operation failed.

CUDNN_STATUS_NOT_SUPPORTED

The functionality requested is not presently supported by cuDNN.

The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable NVIDIA_LICENSE_FILE is not set properly.

CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING

Runtime library required by RNN calls (libcuda.so or nvcuda.dll) cannot be found in predefined search paths.

CUDNN_STATUS_RUNTIME_IN_PROGRESS

Some tasks in the user stream are not completed.

CUDNN_STATUS_RUNTIME_FP_OVERFLOW

Numerical overflow occurred during the GPU kernel execution.

### 3.49. cudnnTensorDescriptor_t

cudnnCreateTensorDescriptor_t is a pointer to an opaque structure holding the description of a generic n-D dataset. cudnnCreateTensorDescriptor() is used to create one instance, and one of the routrines cudnnSetTensorNdDescriptor(), cudnnSetTensor4dDescriptor() or cudnnSetTensor4dDescriptorEx() must be used to initialize this instance.

### 3.50. cudnnTensorFormat_t

cudnnTensorFormat_t is an enumerated type used by cudnnSetTensor4dDescriptor() to create a tensor with a pre-defined layout.

Values

CUDNN_TENSOR_NCHW

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NHWC

This tensor format specifies that the data is laid out in the following order: batch size, rows, columns, feature maps. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, rows, columns, and feature maps; the feature maps are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NCHW_VECT_C

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. However, each element of the tensor is a vector of multiple feature maps. The length of the vector is carried by the data type of the tensor. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension. This format is only supported with tensor data types CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32, and CUDNN_DATA_UINT8x4.

## 4. cuDNN API Reference

This chapter describes the API of all the routines of the cuDNN library.

### 4.1. cudnnActivationBackward

```cudnnStatus_t cudnnActivationBackward(
cudnnHandle_t                    handle,
cudnnActivationDescriptor_t      activationDesc,
const void                      *alpha,
const cudnnTensorDescriptor_t    yDesc,
const void                      *y,
const cudnnTensorDescriptor_t    dyDesc,
const void                      *dy,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *beta,
const cudnnTensorDescriptor_t    dxDesc,
void                            *dx)```

This routine computes the gradient of a neuron activation function.

Note: In-place operation is allowed for this routine; i.e. dy and dx pointers may be equal. However, this requires the corresponding tensor descriptors to be identical (particularly, the strides of the input and output must match for in-place operation to be allowed).
Note: All tensor formats are supported for 4 and 5 dimensions, however best performance is obtained when the strides of yDesc and xDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

Parameters

handle

Input. Handle to a previously created cuDNN context.

activationDesc,

Input. Activation descriptor.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Please refer to this section for additional details.

yDesc

Input. Handle to the previously initialized input tensor descriptor.

y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

xDesc

Input. Handle to the previously initialized output tensor descriptor.

x

Input. Data pointer to GPU memory associated with the output tensor descriptor xDesc.

dxDesc

Input. Handle to the previously initialized output differential tensor descriptor.

dx

Output. Data pointer to GPU memory associated with the output tensor descriptor dxDesc.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• The strides nStride, cStride, hStride, wStride of the input differential tensor and output differential tensors differ and in-place operation is used.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimensions n,c,h,w of the input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
• The strides nStride, cStride, hStride, wStride of the input tensor and the input differential tensor differ.
• The strides nStride, cStride, hStride, wStride of the output tensor and the output differential tensor differ.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.2. cudnnActivationForward

```cudnnStatus_t cudnnActivationForward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t     activationDesc,
const void                     *alpha,
const cudnnTensorDescriptor_t   xDesc,
const void                     *x,
const void                     *beta,
const cudnnTensorDescriptor_t   yDesc,
void                           *y)```

This routine applies a specified neuron activation function element-wise over each input value.

Note: In-place operation is allowed for this routine; i.e., xData and yData pointers may be equal. However, this requires xDesc and yDesc descriptors to be identical (particularly, the strides of the input and output must match for in-place operation to be allowed).
Note: All tensor formats are supported for 4 and 5 dimensions, however best performance is obtained when the strides of xDesc and yDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

Parameters

handle

Input. Handle to a previously created cuDNN context.

activationDesc

Input. Activation descriptor.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The parameter mode has an invalid enumerant value.
• The dimensions n,c,h,w of the input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
• The strides nStride,cStride,hStride,wStride of the input tensor and output tensors differ and in-place operation is used (i.e., x and y pointers are equal).
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

```cudnnStatus_t cudnnAddTensor(
cudnnHandle_t                     handle,
const void                       *alpha,
const void                       *A,
const void                       *beta,
const cudnnTensorDescriptor_t     cDesc,
void                             *C)```

This function adds the scaled values of a bias tensor to another tensor. Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. In the latter case, the same value from the bias tensor for those dimensions will be used to blend into the C tensor.

Note: Up to dimension 5, all tensor formats are supported. Beyond those dimensions, this routine is not supported

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows: dstValue = alpha[0]*srcValue + beta[0]*priorDstValue. Please refer to this section for additional details.

Input. Handle to a previously initialized tensor descriptor.

A

Input. Pointer to data of the tensor described by the aDesc descriptor.

cDesc

Input. Handle to a previously initialized tensor descriptor.

C

Input/Output. Pointer to data of the tensor described by the cDesc descriptor.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function executed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

The dimensions of the bias tensor refer to an amount of data that is incompatible the output tensor dimensions or the dataType of the two tensor descriptors are different.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.4. cudnnBatchNormalizationBackward

```cudnnStatus_t cudnnBatchNormalizationBackward(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alphaParamDiff,
const void                      *betaParamDiff,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    dyDesc,
const void                      *dy,
const cudnnTensorDescriptor_t    dxDesc,
void                            *dx,
const cudnnTensorDescriptor_t    bnScaleBiasDiffDesc,
const void                      *bnScale,
void                            *resultBnScaleDiff,
void                            *resultBnBiasDiff,
double                           epsilon,
const void                      *savedMean,
const void                      *savedInvVariance)```

This function performs the backward BatchNormalization layer computation.

Note: Only 4D and 5D tensors are supported.
Note: The epsilon value has to be the same during training, backpropagation and inference.
Note: Much higher performance when HW-packed tensors are used for all of x, dy, dx.

Parameters

handle

Handle to a previously created cuDNN library descriptor.

mode

Mode of operation (spatial or per-activation). cudnnBatchNormMode_t

Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output dx with a prior value in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

alphaParamDiff, betaParamDiff

Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs dBnScaleResult and dBnBiasResult with prior values in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc, x, dyDesc, dy, dxDesc, dx

Tensor descriptors and pointers in device memory for the layer's x data, backpropagated differential dy (inputs) and resulting differential with respect to x, dx (output).

bnScaleBiasDiffDesc

Shared tensor descriptor for all the 5 tensors below in the argument list (bnScale, resultBnScaleDiff, resultBnBiasDiff, savedMean, savedInvVariance). The dimensions for this tensor descriptor are dependent on normalization mode. Note: The data type of this tensor descriptor must be 'float' for FP16 and FP32 input tensors, and 'double' for FP64 input tensors.

bnScale
Input. Pointers in device memory for the batch normalization scale parameter (in original paper bias is referred to as gamma). Note that bnBias parameter is not needed for this layer's computation.
resultBnScaleDiff, resultBnBiasDiff
Outputs. Pointers in device memory for the resulting scale and bias differentials computed by this routine. Note that scale and bias gradients are not backpropagated below this layer (since they are dead-end computation DAG nodes).
epsilon

Epsilon value used in batch normalization formula. Minimum allowed value is CUDNN_BN_MIN_EPSILON defined in cudnn.h. Same epsilon value should be used in forward and backward functions.

savedMean, savedInvVariance
Inputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's x and bnScale, bnBias data has to remain unchanged until the backward function is called. Note that both of these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• Any of the pointers alpha, beta, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff is NULL.
• Number of xDesc or yDesc or dxDesc tensor descriptor dimensions is not within the [4,5] range.
• bnScaleBiasMeanVarDesc dimensions are not 1xC(x1)x1x1 for spatial or 1xC(xD)xHxW for per-activation mode (parentheses for 5D).
• Exactly one of savedMean, savedInvVariance pointers is NULL.
• epsilon value is less than CUDNN_BN_MIN_EPSILON
• Dimensions or data types mismatch for any pair of xDesc, dyDesc, dxDesc

### 4.5. cudnnBatchNormalizationForwardInference

``` cudnnStatus_t cudnnBatchNormalizationForwardInference(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alpha,
const void                      *beta,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    yDesc,
void                            *y,
const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
const void                      *bnScale,
const void                      *bnBias,
const void                      *estimatedMean,
const void                      *estimatedVariance,
double                           epsilon)```

This function performs the forward BatchNormalization layer computation for inference phase. This layer is based on the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", S. Ioffe, C. Szegedy, 2015.

Note: Only 4D and 5D tensors are supported.
Note: The input transformation performed by this function is defined as: y := alpha*y + beta *(bnScale * (x-estimatedMean)/sqrt(epsilon + estimatedVariance)+bnBias)
Note: The epsilon value has to be the same during training, backpropagation and inference.
Note: For training phase use cudnnBatchNormalizationForwardTraining.
Note: Much higher performance when HW-packed tensors are used for all of x, dy, dx.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

mode

Input. Mode of operation (spatial or per-activation). cudnnBatchNormMode_t

alpha, beta

Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc, yDesc, x, y

Tensor descriptors and pointers in device memory for the layer's x and y data.

bnScaleBiasMeanVarDesc, bnScaleData, bnBiasData

Inputs. Tensor descriptor and pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma).

estimatedMean, estimatedVariance

Inputs. Mean and variance tensors (these have the same descriptor as the bias and scale). It is suggested that resultRunningMean, resultRunningVariance from the cudnnBatchNormalizationForwardTraining call accumulated during the training phase are passed as inputs here.

epsilon

Input. Epsilon value used in the batch normalization formula. Minimum allowed value is CUDNN_BN_MIN_EPSILON defined in cudnn.h.

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• One of the pointers alpha, beta, x, y, bnScaleData, bnBiasData, estimatedMean, estimatedInvVariance is NULL.
• Number of xDesc or yDesc tensor descriptor dimensions is not within the [4,5] range.
• bnScaleBiasMeanVarDesc dimensions are not 1xC(x1)x1x1 for spatial or 1xC(xD)xHxW for per-activation mode (parenthesis for 5D).
• epsilon value is less than CUDNN_BN_MIN_EPSILON
• Dimensions or data types mismatch for xDesc, yDesc

### 4.6. cudnnBatchNormalizationForwardTraining

``` cudnnStatus_t cudnnBatchNormalizationForwardTraining(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alpha,
const void                      *beta,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    yDesc,
void                            *y,
const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
const void                      *bnScale,
const void                      *bnBias,
double                           exponentialAverageFactor,
void                            *resultRunningMean,
void                            *resultRunningVariance,
double                           epsilon,
void                            *resultSaveMean,
void                            *resultSaveInvVariance)```

This function performs the forward BatchNormalization layer computation for training phase.

Note: Only 4D and 5D tensors are supported.
Note: The epsilon value has to be the same during training, backpropagation and inference.
Note: For inference phase use cudnnBatchNormalizationForwardInference.
Note: Much higher performance for HW-packed tensors for both x and y.

Parameters

handle

Handle to a previously created cuDNN library descriptor.

mode

Mode of operation (spatial or per-activation). cudnnBatchNormMode_t

alpha, beta

Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc, yDesc, x, y

Tensor descriptors and pointers in device memory for the layer's x and y data.

bnScaleBiasMeanVarDesc

Shared tensor descriptor desc for all the 6 tensors below in the argument list. The dimensions for this tensor descriptor are dependent on the normalization mode.

bnScale, bnBias
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in original paper bias is referred to as beta and scale as gamma). Note that bnBias parameter can replace the previous layer's bias parameter for improved efficiency.
exponentialAverageFactor
Input. Factor used in the moving average computation runningMean = newMean*factor + runningMean*(1-factor). Use a factor=1/(1+n) at N-th call to the function to get Cumulative Moving Average (CMA) behavior CMA[n] = (x[1]+...+x[n])/n. Since CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1)= ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1) = CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1)
resultRunningMean, resultRunningVariance

Inputs/Outputs. Running mean and variance tensors (these have the same descriptor as the bias and scale). Both of these pointers can be NULL but only at the same time. The value stored in resultRunningVariance (or passed as an input in inference mode) is the sample variance, and is the moving average of variance[x] where variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are not NULL, the tensors should be initialized to some reasonable values or to 0.

epsilon

Epsilon value used in the batch normalization formula. Minimum allowed value is CUDNN_BN_MIN_EPSILON defined in cudnn.h. Same epsilon value should be used in forward and backward functions.

resultSaveMean, resultSaveInvVariance
Outputs. Optional cache to save intermediate results computed during the forward pass. These buffers can be used to speed up the backward pass when supplied to the cudnnBatchNormalizationBackward() function. The intermediate results stored in resultSaveMean and resultSaveInvVariance buffers should not be used directly by the user. Depending on the batch normalization mode, the results stored in resultSaveInvVariance may vary. For the cache to work correctly, the input layer data must remain unchanged until the backward function is called. Note that both parameters can be NULL but only at the same time. In such a case intermediate statistics will not be saved, and cudnnBatchNormalizationBackward() will have to re-compute them. It is recommended to use this cache as the memory overhead is relatively small because these tensors have a much lower product of dimensions than the data tensors.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• One of the pointers alpha, beta, x, y, bnScaleData, bnBiasData is NULL.
• Number of xDesc or yDesc tensor descriptor dimensions is not within the [4,5] range.
• bnScaleBiasMeanVarDesc dimensions are not 1xC(x1)x1x1 for spatial or 1xC(xD)xHxW for per-activation mode (parens for 5D).
• Exactly one of resultSaveMean, resultSaveInvVariance pointers is NULL.
• Exactly one of resultRunningMean, resultRunningInvVariance pointers is NULL.
• epsilon value is less than CUDNN_BN_MIN_EPSILON
• Dimensions or data types mismatch for xDesc, yDesc

### 4.7. cudnnCTCLoss

```cudnnStatus_t cudnnCTCLoss(
cudnnHandle_t                        handle,
const   cudnnTensorDescriptor_t      probsDesc,
const   void                        *probs,
const   int                         *labels,
const   int                         *labelLengths,
const   int                         *inputLengths,
void                                *costs,
cudnnCTCLossAlgo_t                   algo,
const   cudnnCTCLossDescriptor_t     ctcLossDesc,
void                                *workspace,
size_t                              *workSpaceSizeInBytes)```

This function returns the ctc costs and gradients, given the probabilities and labels.

Parameters

handle

Input. Handle to a previously created cuDNN context.

probsDesc

Input. Handle to the previously initialized probabilities tensor descriptor.

probs

Input. Pointer to a previously initialized probabilities tensor.

labels

Input. Pointer to a previously initialized labels list.

labelLengths

Input. Pointer to a previously initialized lengths list, to walk the above labels list.

inputLengths

Input. Pointer to a previously initialized list of the lengths of the timing steps in each batch.

costs

Output. Pointer to the computed costs of CTC.

Input. Handle to a previously initialized gradients tensor descriptor.

Output. Pointer to the computed gradients of CTC.

algo

Input. Enumerant that specifies the chosen CTC loss algorithm.

ctcLossDesc

Input. Handle to the previously initialized CTC loss descriptor.

workspace

Input. Pointer to GPU memory of a workspace needed to able to execute the specified algorithm.

sizeInBytes

Input. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The dimensions of probsDesc do not match the dimensions of gradientsDesc.
• The inputLengths do not agree with the first dimension of probsDesc.
• The workSpaceSizeInBytes is not sufficient.
• The labelLengths is greater than 256.
CUDNN_STATUS_NOT_SUPPORTED

A compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU

### 4.8. cudnnConvolutionBackwardBias

```cudnnStatus_t cudnnConvolutionBackwardBias(
cudnnHandle_t                    handle,
const void                      *alpha,
const cudnnTensorDescriptor_t    dyDesc,
const void                      *dy,
const void                      *beta,
const cudnnTensorDescriptor_t    dbDesc,
void                            *db)```

This function computes the convolution function gradient with respect to the bias, which is the sum of every element belonging to the same feature map across all of the images of the input tensor. Therefore, the number of elements produced is equal to the number of features maps of the input tensor.

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Please refer to this section for additional details.

dyDesc

Input. Handle to the previously initialized input tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

dbDesc

Input. Handle to the previously initialized output tensor descriptor.

db

Output. Data pointer to GPU memory associated with the output tensor descriptor dbDesc.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• One of the parameters n,height,width of the output tensor is not 1.
• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors are different.

### 4.9. cudnnConvolutionBackwardData

```cudnnStatus_t cudnnConvolutionBackwardData(
cudnnHandle_t                       handle,
const void                         *alpha,
const cudnnFilterDescriptor_t       wDesc,
const void                         *w,
const cudnnTensorDescriptor_t       dyDesc,
const void                         *dy,
const cudnnConvolutionDescriptor_t  convDesc,
cudnnConvolutionBwdDataAlgo_t       algo,
void                               *workSpace,
size_t                              workSpaceSizeInBytes,
const void                         *beta,
const cudnnTensorDescriptor_t       dxDesc,
void                               *dx)```

This function computes the convolution data gradient of the tensor dy, where y is the output of the forward convolution in cudnnConvolutionForward(). It uses the specified algo, and returns the results in the output tensor dx. Scaling factors alpha and beta can be used to scale the computed result or accumulate with the current dx.

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Refer to this section for additional details.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the input differential tensor descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor.

algo

Input. Enumerant that specifies which backward data convolution algorithm shoud be used to compute the results.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

dx

Input/Output. Data pointer to GPU memory associated with the output tensor descriptor dxDesc that carries the result.

TABLE OF THE SUPPORTED CONFIGURATIONS

This function supports the following combinations of data types for wDesc, dyDesc, convDesc, and dxDesc. See the following table for a list of the supported configurations.

Data Type Configurations wDesc's, dyDesc's and dxDesc's Data Type convDesc's Data Type
TRUE_HALF_CONFIG (only supported on architectures with true fp16 support, i.e., compute capability 5.3 and later). CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
Note:

Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following for a list of algorithm options, and their respective supported parameters and deterministic behavior.

TABLE OF THE SUPPORTED ALGORITHMS

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the paranthesis are used in the table below, for brevity:

• CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 (_ALGO_0)
• CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 (_ALGO_1)
• CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT (_FFT)
• CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING (_FFT_TILING)
• CUDNN_TENSOR_NCHW (_NCHW)
• CUDNN_TENSOR_NHWC (_NHWC)
• CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)

FOR 2D CONVOLUTIONS.

 Filter descriptor wDesc: _NHWC. See cudnnTensorFormat_t. Algo Name(see below for 3D Convolutions) Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important _ALGO_1 NHWC HWC-packed NHWC HWC-packed - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Filter descriptor wDesc: _NCHW. Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important _ALGO_0 No NCHW CHW-packed All except _NCHW_VECT_C. - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: greater than 0 for all dimensions - convDesc Group Count Support: Greater than 0. _ALGO_1 Yes NCHW CHW-packed _All except _NCHW_VECT_C. - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. _FFT Yes NCHW CHW-packed NCHW HW-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - dxDesc's feature map height + 2 * convDesc's zero-padding height must equal 256 or less - dxDesc's feature map width + 2 * convDesc's zero-padding width must equal 256 or less - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width _FFT_TILING Yes NCHW CHW-packed NCHW HW-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - DOUBLE_CONFIG is also supported when the task can be handled by 1D FFT, ie, one of the filter dimension, width or height is 1. - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - When neither of wDesc's filter dimension is 1, the filter width and height must not be larger than 32 - When either of wDesc's filter dimension is 1, the largest filter dimension should not exceed 256 - convDesc's vertical and horizontal filter stride must equal 1 when either the filter width or filter height is 1, otherwise the stride can be 1 or 2 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width _WINOGRAD Yes NCHW CHW-packed All except _NCHW_VECT_C. - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter height must be 3 - wDesc's filter width must be 3 _WINOGRAD_NONFUSED Yes NCHW CHW-packed All except _NCHW_VECT_C. - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter (height, width) must be (3,3) or (5,5) - If wDesc's filter (height, width) is (5,5) then the data type config TRUE_HALF_CONFIG is not supported

FOR 3D CONVOLUTIONS.

 Filter descriptor wDesc: _NCHW Algo Name (3D Convolutions) Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Support Important _ALGO_0 Yes NCDHW CDHW-packed All except _NCDHW_VECT_C. - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. - Dilation: greater than 0 for all dimensions - convDesc Group Count Support: Greater than 0. _ALGO_1 Yes NCDHW-fully-packed NCDHW-fully-packed - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. _FFT_TILING Yes NCDHW CDHW-packed NCDHW DHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - wDesc's filter height must equal 16 or less - wDesc's filter width must equal 16 or less - wDesc's filter depth must equal 16 or less - convDesc's must have all filter strides equal to 1 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width - wDesc's filter depth must be greater than convDesc's zero-padding width

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

At least one of the following conditions are met:

• At least one of the following is NULL: handle, dyDesc, wDesc, convDesc, dxDesc, dy, w, dx, alpha, beta
• wDesc and dyDesc have a non-matching number of dimensions
• wDesc and dxDesc have a non-matching number of dimensions
• wDesc has fewer than three number of dimensions
• wDesc, dxDesc and dyDesc have a non-matching data type.
• wDesc and dxDesc have a non-matching number of input feature maps per image (or group in case of Grouped Convolutions).
• dyDescs's spatial sizes do not match with the expected size as determined by cudnnGetConvolutionNdForwardOutputDim
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
• dyDesc or dxDesc have negative tensor striding
• dyDesc, wDesc or dxDesc has a number of dimensions that is not 4 or 5
• The chosen algo does not support the parameters provided; see above for exhaustive list of parameter support for each algo
• dyDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
CUDNN_STATUS_MAPPING_ERROR

An error occurs during the texture binding of the filter data or the input differential tensor data

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.10. cudnnConvolutionBackwardFilter

```cudnnStatus_t cudnnConvolutionBackwardFilter(
cudnnHandle_t                       handle,
const void                         *alpha,
const cudnnTensorDescriptor_t       xDesc,
const void                         *x,
const cudnnTensorDescriptor_t       dyDesc,
const void                         *dy,
const cudnnConvolutionDescriptor_t  convDesc,
cudnnConvolutionBwdFilterAlgo_t     algo,
void                               *workSpace,
size_t                              workSpaceSizeInBytes,
const void                         *beta,
const cudnnFilterDescriptor_t       dwDesc,
void                               *dw)```

This function computes the convolution weight (filter) gradient of the tensor dy, where y is the output of the forward convolution in cudnnConvolutionForward(). It uses the specified algo, and returns the results in the output tensor dw. Scaling factors alpha and beta can be used to scale the computed result or accumulate with the current dw.

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Refer to this section for additional details.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the backpropagation gradient tensor descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor.

algo

Input. Enumerant that specifies which convolution algorithm shoud be used to compute the results

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace

dwDesc

Input. Handle to a previously initialized filter gradient descriptor.

dw

Input/Output. Data pointer to GPU memory associated with the filter gradient descriptor dwDesc that carries the result.

TABLE OF THE SUPPORTED CONFIGURATIONS

This function supports the following combinations of data types for xDesc, dyDesc, convDesc, and dwDesc. See the following table for a list of the supported configurations.

Data Type Configurations xDesc's, dyDesc's and dwDesc's Data Type convDesc's Data Type
TRUE_HALF_CONFIG (only supported on architectures with true fp16 support, i.e., compute capability 5.3 and later). CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
Note:

Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following for an exhaustive list of algorithm options and their respective supported parameters and deterministic behavior.

TABLE OF THE SUPPORTED ALGORITHMS

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the paranthesis are used in the table below, for brevity:

• CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (_ALGO_0)
• CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (_ALGO_1)
• CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 (_ALGO_3)
• CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT (_FFT)
• CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING (_FFT_TILING)
• CUDNN_TENSOR_NCHW (_NCHW)
• CUDNN_TENSOR_NHWC (_NHWC)
• CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)

FOR 2D CONVOLUTIONS.

 Filter descriptor dwDesc: _NHWC. See cudnnTensorFormat_t. Algo Name (see below for 3D Convolutions) Deterministic (Yes or No) Tensor Formats Supported for xDesc Tensor Formats Supported for dyDesc Data Type Configurations Supported Important _ALGO_0, and _ALGO_1 NHWC HWC-packed NHWC HWC-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Filter descriptor wDesc: _NCHW. Algo Name Deterministic (Yes or No) Tensor Formats Supported for xDesc Tensor Formats Supported for dyDesc Data Type Configurations Supported Important _ALGO_0 No All except _NCHW_VECT_C. NCHW CHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: greater than 0 for all dimensions - convDesc Group Count Support: Greater than 0. - This algo is not supported if output is of type CUDNN_DATA_HALF and the number of elements in dw is odd. _ALGO_1 Yes _NCHW or _NHWC NCHW CHW-packed - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. _FFT Yes NCHW CHW-packed NCHW CHW-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - xDesc's feature map height + 2 * convDesc's zero-padding height must equal 256 or less - xDesc's feature map width + 2 * convDesc's zero-padding width must equal 256 or less - convDesc's vertical and horizontal filter stride must equal 1 - dwDesc's filter height must be greater than convDesc's zero-padding height - dwDesc's filter width must be greater than convDesc's zero-padding width _ALGO_3 Yes All except _NCHW_VECT_C. NCHW CHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. _WINOGRAD_NONFUSED Yes All except _NCHW_VECT_C. NCHW CHW-packed - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter (height, width) must be (3,3) or (5,5) - If wDesc's filter (height, width) is (5,5), then the data type config TRUE_HALF_CONFIG is not supported. _FFT_TILING Yes NCHW CHW-packed NCHW CHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0. - xDesc's width or height must equal 1 - dyDesc's width or height must equal 1 (the same dimension as in xDesc.) The other dimension must be less than or equal to 256, i.e., the largest 1D tile size currently supported. - convDesc's vertical and horizontal filter stride must equal 1 - dwDesc's filter height must be greater than convDesc's zero-padding height. - dwDesc's filter width must be greater than convDesc's zero-padding width.

FOR 3D CONVOLUTIONS.

 Filter descriptor wDesc: _NCHW Algo Name Deterministic (Yes or No) Tensor Formats Supported for xDesc Tensor Formats Supported for dyDesc Data Type Configurations Support Important _ALGO_0 No All except _NCDHW_VECT_C. NCDHW CDHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. - Dilation: greater than 0 for all dimensions - convDesc Group Count Support: Greater than 0. _ALGO_3 No NCDHW-fully-packed NCDHW-fully-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. - Dilation: 1 for all dimensions - convDesc Group Count Support: Greater than 0.

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

At least one of the following conditions are met:

• At least one of the following is NULL: handle, xDesc, dyDesc, convDesc, dwDesc, xData, dyData, dwData, alpha, beta
• xDesc and dyDesc have a non-matching number of dimensions
• xDesc and dwDesc have a non-matching number of dimensions
• xDesc has fewer than three number of dimensions
• xDesc, dyDesc and dwDesc have a non-matching data type.
• xDesc and dwDesc have a non-matching number of input feature maps per image (or group in case of Grouped Convolutions).
• yDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
• xDesc or dyDesc have negative tensor striding
• xDesc, dyDesc or dwDesc has a number of dimensions that is not 4 or 5
• The chosen algo does not support the parameters provided; see above for exhaustive list of parameter support for each algo
CUDNN_STATUS_MAPPING_ERROR

An error occurs during the texture binding of the filter data.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.11. cudnnConvolutionBiasActivationForward

```cudnnStatus_t cudnnConvolutionBiasActivationForward(
cudnnHandle_t                       handle,
const void                         *alpha1,
const cudnnTensorDescriptor_t       xDesc,
const void                         *x,
const cudnnFilterDescriptor_t       wDesc,
const void                         *w,
const cudnnConvolutionDescriptor_t  convDesc,
cudnnConvolutionFwdAlgo_t           algo,
void                               *workSpace,
size_t                              workSpaceSizeInBytes,
const void                         *alpha2,
const cudnnTensorDescriptor_t       zDesc,
const void                         *z,
const cudnnTensorDescriptor_t       biasDesc,
const void                         *bias,
const cudnnActivationDescriptor_t   activationDesc,
const cudnnTensorDescriptor_t       yDesc,
void                               *y)```

This function applies a bias and then an activation to the convolutions or cross-correlations of cudnnConvolutionForward(), returning results in y. The full computation follows the equation y = act ( alpha1 * conv(x) + alpha2 * z + bias ).

Note: The routine cudnnGetConvolution2dForwardOutputDim or cudnnGetConvolutionNdForwardOutputDim can be used to determine the proper dimensions of the output tensor descriptor yDesc with respect to xDesc, convDesc and wDesc.
Note: Only the CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM algo is enabled with CUDNN_ACTIVATION_IDENTITY. In other words, in the cudnnActivationDescriptor_t structure of the input activationDesc, if the mode of the cudnnActivationMode_t field is set to the enum value CUDNN_ACTIVATION_IDENTITY, then the input cudnnConvolutionFwdAlgo_t of this function cudnnConvolutionBiasActivationForward() must be set to the enum value CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM. See also the documentation for the function cudnnSetActivationDescriptor().

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha1, alpha2

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as described by the above equation. Please refer to this section for additional details.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

convDesc

Input. Previously initialized convolution descriptor.

algo

Input. Enumerant that specifies which convolution algorithm should be used to compute the results.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

zDesc

Input. Handle to a previously initialized tensor descriptor.

z

Input. Data pointer to GPU memory associated with the tensor descriptor zDesc.

biasDesc

Input. Handle to a previously initialized tensor descriptor.

bias

Input. Data pointer to GPU memory associated with the tensor descriptor biasDesc.

activationDesc

Input. Handle to a previously initialized activation descriptor.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc that carries the result of the convolution.

For the convolution step, this function supports the specific combinations of data types for xDesc, wDesc, convDesc and yDesc as listed in the documentation of cudnnConvolutionForward(). The following table specifies the supported combinations of data types for x, y, z, bias, and alpha1/alpha2.

Table Key: X = CUDNN_DATA

x w y and z bias alpha1/alpha2
X_DOUBLE X_DOUBLE X_DOUBLE X_DOUBLE X_DOUBLE
X_FLOAT X_FLOAT X_FLOAT X_FLOAT X_FLOAT
X_HALF X_HALF X_HALF X_HALF X_FLOAT
X_INT8 X_INT8 X_INT8 X_FLOAT X_FLOAT
X_INT8 X_INT8 X_FLOAT X_FLOAT X_FLOAT
X_INT8x4 X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT
X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT X_FLOAT
X_UINT8 X_INT8 X_INT8 X_FLOAT X_FLOAT
X_UINT8 X_INT8 X_FLOAT X_FLOAT X_FLOAT
X_UINT8x4 X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT
X_UINT8x4 X_INT8x4 X_FLOAT X_FLOAT X_FLOAT

In addition to the error values listed by the documentation of cudnnConvolutionForward(), the possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

At least one of the following conditions are met:
• At least one of the following is NULL: zDesc, zData, biasDesc, bias, activationDesc.
• The second dimension of biasDesc and the first dimension of filterDesc are not equal.
• zDesc and destDesc do not match.
CUDNN_STATUS_NOT_SUPPORTED
The function does not support the provided configuration. See the following for some examples of non-supported configurations:
• The mode of activationDesc is neither CUDNN_ACTIVATION_RELU or CUDNN_ACTIVATION_IDENTITY.
• The reluNanOpt of activationDesc is not CUDNN_NOT_PROPAGATE_NAN.
• The second stride of biasDesc is not equal to one.
• The data type of biasDesc does not correspond to the data type of yDesc as listed in the above data types table.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.12. cudnnConvolutionForward

```cudnnStatus_t cudnnConvolutionForward(
cudnnHandle_t                       handle,
const void                         *alpha,
const cudnnTensorDescriptor_t       xDesc,
const void                         *x,
const cudnnFilterDescriptor_t       wDesc,
const void                         *w,
const cudnnConvolutionDescriptor_t  convDesc,
cudnnConvolutionFwdAlgo_t           algo,
void                               *workSpace,
size_t                              workSpaceSizeInBytes,
const void                         *beta,
const cudnnTensorDescriptor_t       yDesc,
void                               *y)```

This function executes convolutions or cross-correlations over x using filters specified with w, returning results in y. Scaling factors alpha and beta can be used to scale the input tensor and the output tensor respectively.

Note: The routine cudnnGetConvolution2dForwardOutputDim or cudnnGetConvolutionNdForwardOutputDim can be used to determine the proper dimensions of the output tensor descriptor yDesc with respect to xDesc, convDesc and wDesc.

Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows: dstValue = alpha[0]*result + beta[0]*priorDstValue. Refer to this section for additional details.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

convDesc

Input. Previously initialized convolution descriptor.

algo

Input. Enumerant that specifies which convolution algorithm shoud be used to compute the results.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc that carries the result of the convolution.

TABLE OF THE SUPPORTED CONFIGURATIONS

This function supports the following combinations of data types for xDesc, wDesc, convDesc, and yDesc. See the following table for a list of the supported configurations.

Data Type Configurations xDesc and wDesc convDesc yDesc
TRUE_HALF_CONFIG (only supported on architectures with true fp16 support, i.e., compute capability 5.3 and later). CUDNN_DATA_HALF CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
INT8_CONFIG (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_INT8 CUDNN_DATA_INT32 CUDNN_DATA_INT8
INT8_EXT_CONFIG (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_INT8 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT
INT8x4_CONFIG (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_INT8x4 CUDNN_DATA_INT32 CUDNN_DATA_INT8x4
INT8x4_EXT_CONFIG (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_INT8x4 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT
UINT8x4_CONFIG (new for 7.1) (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_UINT8x4 CUDNN_DATA_INT32 CUDNN_DATA_UINT8x4
UINT8x4_EXT_CONFIG (new for 7.1) (only supported on architectures with DP4A support, i.e., compute capability 6.1 and later). CUDNN_DATA_UINT8x4 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT
Note: For this function, all algorithms perform deterministic computations. Specifying a separate algorithm can cause changes in performance and support.

TABLE OF THE SUPPORTED ALGORITHMS

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the paranthesis are used in the table below, for brevity:

• CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM (_IMPLICIT_GEMM)
• CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM (_IMPLICIT_PRECOMP_GEMM)
• CUDNN_CONVOLUTION_FWD_ALGO_GEMM (_GEMM)
• CUDNN_CONVOLUTION_FWD_ALGO_DIRECT (_DIRECT)
• CUDNN_CONVOLUTION_FWD_ALGO_FFT (_FFT)
• CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING (_FFT_TILING)
• CUDNN_TENSOR_NCHW (_NCHW)
• CUDNN_TENSOR_NHWC (_NHWC)
• CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)

FOR 2D CONVOLUTIONS.

 Filter descriptor wDesc: _NCHW. See cudnnTensorFormat_t. convDesc Group count support: Greater than 0, for all algos. Algo Name (see below for 3D Convolutions) Tensor Formats Supported for xDesc Tensor Formats Supported for yDesc Data Type Configurations Supported Important _IMPLICIT_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG Dilation: Greater than 0 for all dimensions. _IMPLICIT_PRECOMP_GEMM - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. Dilation: 1 for all dimensions. _GEMM - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG Dilation: 1 for all dimensions. _FFT NCHW HW-packed NCHW HW-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Dilation: 1 for all dimensions. - xDesc's feature map height + 2 * convDesc's zero-padding height must equal 256 or less - xDesc's feature map width + 2 * convDesc's zero-padding width must equal 256 or less - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width _FFT_TILING - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG DOUBLE_CONFIG is also supported when the task can be handled by 1D FFT, i.e., one of the filter dimension, width or height is 1. Dilation: 1 for all dimensions. - When neither of wDesc's filter dimension is 1, the filter width and height must not be larger than 32 - When either of wDesc's filter dimension is 1, the largest filter dimension should not exceed 256 - convDesc's vertical and horizontal filter stride must equal 1 when either the filter width or filter height is 1, otherwise the stride can be 1 or 2 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width _WINOGRAD All except: _NCHW_VECT_C All except: _NCHW_VECT_C - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Dilation: 1 for all dimensions. - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter height must be 3 - wDesc's filter width must be 3 _WINOGRAD_NONFUSED - TRUE_HALF_CONFIG, - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Dilation: 1 for all dimensions. - convDesc's vertical and horizontal filter stride must equal 1 - wDesc's filter (height, width) must be (3,3) or (5,5) - If wDesc's filter (height, width) is (5,5), then data type config TRUE_HALF_CONFIG is not supported _DIRECT Currently not implemented in cuDNN. Filter descriptor wDesc: _NHWC convDesc Group count support: Greater than 0. Algo Name xDesc yDesc Data Type Configurations Support Important _IMPLICIT_GEMM NCHWC HWC-packed NCHWC HWC-packed - PSEUDO_HALF_CONFIG, and - FLOAT_CONFIG Dilation: Greater than 0 for all dimensions. Filter descriptor wDesc: _NHWC convDesc Group count support: Greater than 0. Algo Name xDesc yDesc Data Type Configurations Support Important _IMPLICIT_PRECOMP_GEMM NHWC NHWC - INT8_CONFIG, - INT8_EXT_CONFIG, - INT8x4_CONFIG, - INT8x4_EXT_CONFIG, - UINT8x4_CONFIG, and - UINT8x4_EXT_CONFIG Dilation: 1 for all dimensions. Input and output features maps must be multiple of 4.

FOR 3D CONVOLUTIONS.

 Filter descriptor wDesc: _NCHW convDesc Group count support: Greater than 0, for all algos. Algo Name xDesc yDesc Data Type Configurations Support Important _IMPLICIT_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. Dilation: Greater than 0 for all dimensions. _IMPLICIT_PRECOMP_GEMM Dilation: 1 for all dimensions. _FFT_TILING NCDHW DHW-packed NCDHW DHW-packed - PSEUDO_HALF_CONFIG, - FLOAT_CONFIG, and - DOUBLE_CONFIG. Dilation: 1 for all dimensions. -wDesc's filter height must equal 16 or less - wDesc's filter width must equal 16 or less -wDesc's filter depth must equal 16 or less - convDesc's must have all filter strides equal to 1 - wDesc's filter height must be greater than convDesc's zero-padding height - wDesc's filter width must be greater than convDesc's zero-padding width - wDesc's filter depth must be greater than convDesc's zero-padding width
Note: Tensors can be converted to, and from, CUDNN_TENSOR_NCHW_VECT_C with cudnnTransformTensor().

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

At least one of the following conditions are met:
• At least one of the following is NULL: handle, xDesc, wDesc, convDesc, yDesc, xData, w, yData, alpha, beta
• xDesc and yDesc have a non-matching number of dimensions
• xDesc and wDesc have a non-matching number of dimensions
• xDesc has fewer than three number of dimensions
• xDesc's number of dimensions is not equal to convDesc's array length + 2
• xDesc and wDesc have a non-matching number of input feature maps per image (or group in case of Grouped Convolutions)
• yDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
• xDesc, wDesc and yDesc have a non-matching data type
• For some spatial dimension, wDesc has a spatial size that is larger than the input spatial size (including zero-padding size)
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
• xDesc or yDesc have negative tensor striding
• xDesc, wDesc or yDesc has a number of dimensions that is not 4 or 5
• yDescs's spatial sizes do not match with the expected size as determined by cudnnGetConvolutionNdForwardOutputDim
• The chosen algo does not support the parameters provided; see above for exhaustive list of parameter support for each algo
CUDNN_STATUS_MAPPING_ERROR

An error occured during the texture binding of the filter data.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.13. cudnnCreate

`cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)`

This function initializes the cuDNN library and creates a handle to an opaque structure holding the cuDNN library context. It allocates hardware resources on the host and device and must be called prior to making any other cuDNN library calls. The cuDNN library handle is tied to the current CUDA device (context). To use the library on multiple devices, one cuDNN handle needs to be created for each device. For a given device, multiple cuDNN handles with different configurations (e.g., different current CUDA streams) may be created. Because cudnnCreate allocates some internal resources, the release of those resources by calling cudnnDestroy will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths. For multithreaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handle(s) per thread and use that cuDNN handle for the entire life of the thread.

Parameters

handle

Output. Pointer to pointer where to store the address to the allocated cuDNN handle.

Returns

Invalid (NULL) input pointer supplied.

CUDNN_STATUS_NOT_INITIALIZED

No compatible GPU found, CUDA driver not installed or disabled, CUDA runtime API initialization failed.

CUDNN_STATUS_ARCH_MISMATCH

NVIDIA GPU architecture is too old.

CUDNN_STATUS_ALLOC_FAILED

Host memory allocation failed.

CUDNN_STATUS_INTERNAL_ERROR

CUDA resource allocation failed.

cuDNN license validation failed (only when the feature is enabled).

CUDNN_STATUS_SUCCESS

cuDNN handle was created successfully.

### 4.14. cudnnCreateActivationDescriptor

```cudnnStatus_t cudnnCreateActivationDescriptor(
cudnnActivationDescriptor_t   *activationDesc)```

This function creates a activation descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.15. cudnnCreateAlgorithmDescriptor

```cudnnStatus_t cudnnCreateAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t *algoDesc)```

(New for 7.1)

This function creates an algorithm descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.16. cudnnCreateAlgorithmPerformance

```cudnnStatus_t cudnnCreateAlgorithmPerformance(
cudnnAlgorithmPerformance_t *algoPerf,
int                         numberToCreate)```

(New for 7.1)

This function creates multiple algorithm performance objects by allocating the memory needed to hold their opaque structures.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.17. cudnnCreateCTCLossDescriptor

```cudnnStatus_t cudnnCreateCTCLossDescriptor(
cudnnCTCLossDescriptor_t* ctcLossDesc)```

This function creates a CTC loss function descriptor. .

Parameters

ctcLossDesc

Output. CTC loss descriptor to be set.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

CTC loss descriptor passed to the function is invalid.

CUDNN_STATUS_ALLOC_FAILED

Memory allocation for this CTC loss descriptor failed.

### 4.18. cudnnCreateConvolutionDescriptor

```cudnnStatus_t cudnnCreateConvolutionDescriptor(
cudnnConvolutionDescriptor_t *convDesc)```

This function creates a convolution descriptor object by allocating the memory needed to hold its opaque structure,

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.19. cudnnCreateDropoutDescriptor

```cudnnStatus_t cudnnCreateDropoutDescriptor(
cudnnDropoutDescriptor_t    *dropoutDesc)```

This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.20. cudnnCreateFilterDescriptor

```cudnnStatus_t cudnnCreateFilterDescriptor(
cudnnFilterDescriptor_t *filterDesc)```

This function creates a filter descriptor object by allocating the memory needed to hold its opaque structure,

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.21. cudnnCreateLRNDescriptor

```cudnnStatus_t cudnnCreateLRNDescriptor(
cudnnLRNDescriptor_t    *poolingDesc)```

This function allocates the memory needed to hold the data needed for LRN and DivisiveNormalization layers operation and returns a descriptor used with subsequent layer forward and backward calls.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### cudnnCreateOpTensorDescriptor

```cudnnStatus_t cudnnCreateOpTensorDescriptor(
cudnnOpTensorDescriptor_t* 	opTensorDesc)```

This function creates a Tensor Pointwise math descriptor.

Parameters

opTensorDesc

Output. Pointer to the structure holding the description of the Tensor Pointwise math such as Add, Multiply, and more.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

Tensor Pointwise math descriptor passed to the function is invalid.

CUDNN_STATUS_ALLOC_FAILED

Memory allocation for this Tensor Pointwise math descriptor failed.

### 4.23. cudnnCreatePersistentRNNPlan

```cudnnStatus_t cudnnCreatePersistentRNNPlan(
cudnnRNNDescriptor_t        rnnDesc,
const int                   minibatch,
const cudnnDataType_t       dataType,
cudnnPersistentRNNPlan_t   *plan)```

This function creates a plan to execute persistent RNNs when using the CUDNN_RNN_ALGO_PERSIST_DYNAMIC algo. This plan is tailored to the current GPU and problem hyperparemeters. This function call is expected to be expensive in terms of runtime, and should be used infrequently.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING

A prerequisite runtime library cannot be found.

CUDNN_STATUS_NOT_SUPPORTED

The current hyperparameters are invalid.

### 4.24. cudnnCreatePoolingDescriptor

```cudnnStatus_t cudnnCreatePoolingDescriptor(
cudnnPoolingDescriptor_t    *poolingDesc)```

This function creates a pooling descriptor object by allocating the memory needed to hold its opaque structure,

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.25. cudnnCreateRNNDescriptor

```cudnnStatus_t cudnnCreateRNNDescriptor(
cudnnRNNDescriptor_t    *rnnDesc)```

This function creates a generic RNN descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

```cudnnStatus_t cudnnCreateRNNDataDescriptor(
```

This function creates a RNN data descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The RNN data descriptor object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.27. cudnnCreateReduceTensorDescriptor

```cudnnStatus_t cudnnCreateReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t*	reduceTensorDesc)```

This function creates a reduce tensor descriptor object by allocating the memory needed to hold its opaque structure.

Parameters

None.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

reduceTensorDesc is a NULL pointer.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.28. cudnnCreateSpatialTransformerDescriptor

```cudnnStatus_t cudnnCreateSpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t *stDesc)```

This function creates a generic spatial transformer descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 4.29. cudnnCreateTensorDescriptor

```cudnnStatus_t cudnnCreateTensorDescriptor(
cudnnTensorDescriptor_t *tensorDesc)```

This function creates a generic tensor descriptor object by allocating the memory needed to hold its opaque structure. The data is initialized to be all zero.

Parameters

tensorDesc

Input. Pointer to pointer where the address to the allocated tensor descriptor object should be stored.

Returns

Invalid input argument.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

CUDNN_STATUS_SUCCESS

The object was created successfully.

### 4.30. cudnnDeriveBNTensorDescriptor

```cudnnStatus_t cudnnDeriveBNTensorDescriptor(
cudnnTensorDescriptor_t         derivedBnDesc,
const cudnnTensorDescriptor_t   xDesc,
cudnnBatchNormMode_t            mode)```

Derives a secondary tensor descriptor for BatchNormalization scale, invVariance, bnBias, bnScale subtensors from the layer's x data descriptor. Use the tensor descriptor produced by this function as the bnScaleBiasMeanVarDesc and bnScaleBiasDiffDesc parameters in Spatial and Per-Activation Batch Normalization forward and backward functions. Resulting dimensions will be 1xC(x1)x1x1 for BATCHNORM_MODE_SPATIAL and 1xC(xD)xHxW for BATCHNORM_MODE_PER_ACTIVATION (parentheses for 5D). For HALF input data type the resulting tensor descriptor will have a FLOAT type. For other data types it will have the same type as the input data.

Note: Only 4D and 5D tensors are supported.
Note: derivedBnDesc has to be first created using cudnnCreateTensorDescriptor
Note: xDesc is the descriptor for the layer's x data and has to be setup with proper dimensions prior to calling this function.

Parameters

derivedBnDesc

Output. Handle to a previously created tensor descriptor.

xDesc

Input. Handle to a previously created and initialized layer's x data descriptor.

mode

Input. Batch normalization layer mode of operation.

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

Invalid Batch Normalization mode.

### 4.31. cudnnDestroy

`cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)`

This function releases resources used by the cuDNN handle. This function is usually the last call with a particular handle to the cuDNN handle. Because cudnnCreate allocates some internal resources, the release of those resources by calling cudnnDestroy will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths.

Parameters

handle

Input. Pointer to the cuDNN handle to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The cuDNN context destruction was successful.

Invalid (NULL) pointer supplied.

### 4.32. cudnnDestroyActivationDescriptor

```cudnnStatus_t cudnnDestroyActivationDescriptor(
cudnnActivationDescriptor_t activationDesc)```

This function destroys a previously created activation descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.33. cudnnDestroyAlgorithmDescriptor

```cudnnStatus_t cudnnDestroyAlgorithmDescriptor(
cudnnActivationDescriptor_t algorithmDesc)```

(New for 7.1)

This function destroys a previously created algorithm descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.34. cudnnDestroyAlgorithmPerformance

```cudnnStatus_t cudnnDestroyAlgorithmPerformance(
cudnnAlgorithmPerformance_t     algoPerf)```

(New for 7.1)

This function destroys a previously created algorithm descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.35. cudnnDestroyCTCLossDescriptor

```cudnnStatus_t cudnnDestroyCTCLossDescriptor(
cudnnCTCLossDescriptor_t 	ctcLossDesc)```

This function destroys a CTC loss function descriptor object.

Parameters

ctcLossDesc

Input. CTC loss function descriptor to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

### 4.36. cudnnDestroyConvolutionDescriptor

```cudnnStatus_t cudnnDestroyConvolutionDescriptor(
cudnnConvolutionDescriptor_t convDesc)```

This function destroys a previously created convolution descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.37. cudnnDestroyDropoutDescriptor

```cudnnStatus_t cudnnDestroyDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc)```

This function destroys a previously created dropout descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.38. cudnnDestroyFilterDescriptor

```cudnnStatus_t cudnnDestroyFilterDescriptor(
cudnnFilterDescriptor_t filterDesc)```

This function destroys a previously created Tensor4D descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.39. cudnnDestroyLRNDescriptor

```cudnnStatus_t cudnnDestroyLRNDescriptor(
cudnnLRNDescriptor_t lrnDesc)```

This function destroys a previously created LRN descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.40. cudnnDestroyOpTensorDescriptor

```cudnnStatus_t cudnnDestroyOpTensorDescriptor(
cudnnOpTensorDescriptor_t   opTensorDesc)```

This function deletes a Tensor Pointwise math descriptor object.

Parameters

opTensorDesc

Input. Pointer to the structure holding the description of the Tensor Pointwise math to be deleted.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

### 4.41. cudnnDestroyPersistentRNNPlan

```cudnnStatus_t cudnnDestroyPersistentRNNPlan(
cudnnPersistentRNNPlan_t plan)```

This function destroys a previously created persistent RNN plan object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.42. cudnnDestroyPoolingDescriptor

```cudnnStatus_t cudnnDestroyPoolingDescriptor(
cudnnPoolingDescriptor_t poolingDesc)```

This function destroys a previously created pooling descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.43. cudnnDestroyRNNDescriptor

```cudnnStatus_t cudnnDestroyRNNDescriptor(
cudnnRNNDescriptor_t rnnDesc)```

This function destroys a previously created RNN descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

```cudnnStatus_t cudnnDestroyRNNDataDescriptor(
```

This function destroys a previously created RNN data descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The RNN data descriptor object was destroyed successfully.

### 4.45. cudnnDestroyReduceTensorDescriptor

```cudnnStatus_t cudnnDestroyReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t   tensorDesc)```

This function destroys a previously created reduce tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

Parameters

tensorDesc

Input. Pointer to the reduce tensor descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.46. cudnnDestroySpatialTransformerDescriptor

```cudnnStatus_t cudnnDestroySpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t stDesc)```

This function destroys a previously created spatial transformer descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.47. cudnnDestroyTensorDescriptor

`cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)`

This function destroys a previously created tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

Parameters

tensorDesc

Input. Pointer to the tensor descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 4.48. cudnnDivisiveNormalizationBackward

``` cudnnStatus_t cudnnDivisiveNormalizationBackward(
cudnnHandle_t                    handle,
cudnnLRNDescriptor_t             normDesc,
cudnnDivNormMode_t               mode,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *means,
const void                      *dy,
void                            *temp,
void                            *temp2,
const void                      *beta,
const cudnnTensorDescriptor_t    dxDesc,
void                            *dx,
void                            *dMeans)```

This function performs the backward DivisiveNormalization layer computation.

Note: Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously intialized LRN parameter descriptor (this descriptor is used for both LRN and DivisiveNormalization layers).

mode

Input. DivisiveNormalization layer mode of operation. Currently only CUDNN_DIVNORM_PRECOMPUTED_MEANS is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc, x, means

Input. Tensor descriptor and pointers in device memory for the layer's x and means data. Note: the means tensor is expected to be precomputed by the user. It can also contain any valid values (not required to be actual means, and can be for instance a result of a convolution with a Gaussian kernel).

dy

Input. Tensor pointer in device memory for the layer's dy cumulative loss differential data (error backpropagation).

temp, temp2

Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the backward pass. These tensors do not have to be preserved from forward to backward pass. Both use xDesc as a descriptor.

dxDesc

Input. Tensor descriptor for dx and dMeans.

dx, dMeans

Output. Tensor pointers (in device memory) for the layer's resulting cumulative gradients dx and dMeans (dLoss/dx and dLoss/dMeans). Both share the same descriptor.

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

At least one of the following conditions are met:

• One of the tensor pointers x, dx, temp, tmep2, dy is NULL.
• Number of any of the input or output tensor dimensions is not within the [4,5] range.
• Either alpha or beta pointer is NULL.
• A mismatch in dimensions between xDesc and dxDesc.
• LRN descriptor parameters are outside of their valid ranges.
• Any of the tensor strides is negative.
CUDNN_STATUS_UNSUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• Any of the input and output tensor strides mismatch (for the same dimension).

### 4.49. cudnnDivisiveNormalizationForward

```cudnnStatus_t cudnnDivisiveNormalizationForward(
cudnnHandle_t                    handle,
cudnnLRNDescriptor_t             normDesc,
cudnnDivNormMode_t               mode,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *means,
void                            *temp,
void                            *temp2,
const void                      *beta,
const cudnnTensorDescriptor_t    yDesc,
void                            *y)```

This function performs the forward spatial DivisiveNormalization layer computation. It divides every value in a layer by the standard deviation of its spatial neighbors as described in "What is the Best Multi-Stage Architecture for Object Recognition", Jarrett 2009, Local Contrast Normalization Layer section. Note that Divisive Normalization only implements the x/max(c, sigma_x) portion of the computation, where sigma_x is the variance over the spatial neighborhood of x. The full LCN (Local Contrastive Normalization) computation can be implemented as a two-step process:

x_m = x-mean(x);

y = x_m/max(c, sigma(x_m));

The "x-mean(x)" which is often referred to as "subtractive normalization" portion of the computation can be implemented using cuDNN average pooling layer followed by a call to addTensor.

Note: Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously intialized LRN parameter descriptor. This descriptor is used for both LRN and DivisiveNormalization layers.

divNormMode

Input. DivisiveNormalization layer mode of operation. Currently only CUDNN_DIVNORM_PRECOMPUTED_MEANS is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user.

alpha, beta

Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows: dstValue = alpha[0]*resultValue + beta[0]*priorDstValue. Please refer to this section for additional details.

xDesc, yDesc

Input. Tensor descriptor objects for the input and output tensors. Note that xDesc is shared between x, means, temp and temp2 tensors.

x

Input. Input tensor data pointer in device memory.

means

Input. Input means tensor data pointer in device memory. Note that this tensor can be NULL (in that case its values are assumed to be zero during the computation). This tensor also doesn't have to contain means, these can be any values, a frequently used variation is a result of convolution with a normalized positive kernel (such as Gaussian).

temp, temp2

Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the forward pass. These tensors do not have to be preserved as inputs from forward to the backward pass. Both use xDesc as their descriptor.

y

Output. Pointer in device memory to a tensor for the result of the forward DivisiveNormalization computation.

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

At least one of the following conditions are met:

• One of the tensor pointers x, y, temp, temp2 is NULL.
• Number of input tensor or output tensor dimensions is outside of [4,5] range.
• A mismatch in dimensions between any two of the input or output tensors.
• For in-place computation when pointers x == y, a mismatch in strides between the input data and output data tensors.
• Alpha or beta pointer is NULL.
• LRN descriptor parameters are outside of their valid ranges.
• Any of the tensor strides are negative.
CUDNN_STATUS_UNSUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• Any of the input and output tensor strides mismatch (for the same dimension).

### 4.50. cudnnDropoutBackward

```cudnnStatus_t cudnnDropoutBackward(
cudnnHandle_t                   handle,
const cudnnDropoutDescriptor_t  dropoutDesc,
const cudnnTensorDescriptor_t   dydesc,
const void                     *dy,
const cudnnTensorDescriptor_t   dxdesc,
void                           *dx,
void                           *reserveSpace,
size_t                          reserveSpaceSizeInBytes)```

This function performs backward dropout operation over dy returning results in dx. If during forward dropout operation value from x was propagated to y then during backward operation value from dy will be propagated to dx, otherwise, dx value will be set to 0.

Note: Better performance is obtained for fully packed tensors

Parameters

handle

Input. Handle to a previously created cuDNN context.

dropoutDesc

Input. Previously created dropout descriptor object.

dyDesc

Input. Handle to a previously initialized tensor descriptor.

dy

Input. Pointer to data of the tensor described by the dyDesc descriptor.

dxDesc

Input. Handle to a previously initialized tensor descriptor.

dx

Output. Pointer to data of the tensor described by the dxDesc descriptor.

reserveSpace

Input. Pointer to user-allocated GPU memory used by this function. It is expected that reserveSpace was populated during a call to cudnnDropoutForward and has not been changed.

reserveSpaceSizeInBytes

Input. Specifies size in bytes of the provided memory for the reserve space

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The number of elements of input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
• The strides of the input tensor and output tensors differ and in-place operation is used (i.e., x and y pointers are equal).
• The provided reserveSpaceSizeInBytes is less then the value returned by cudnnDropoutGetReserveSpaceSize
• cudnnSetDropoutDescriptor has not been called on dropoutDesc with the non-NULL states argument
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.51. cudnnDropoutForward

```cudnnStatus_t cudnnDropoutForward(
cudnnHandle_t                       handle,
const cudnnDropoutDescriptor_t      dropoutDesc,
const cudnnTensorDescriptor_t       xdesc,
const void                         *x,
const cudnnTensorDescriptor_t       ydesc,
void                               *y,
void                               *reserveSpace,
size_t                              reserveSpaceSizeInBytes)```

This function performs forward dropout operation over x returning results in y. If dropout was used as a parameter to cudnnSetDropoutDescriptor, the approximately dropout fraction of x values will be replaces by 0, and the rest will be scaled by 1/(1-dropout) This function should not be running concurrently with another cudnnDropoutForward function using the same states.

Note: Better performance is obtained for fully packed tensors
Note: Should not be called during inference

Parameters

handle

Input. Handle to a previously created cuDNN context.

dropoutDesc

Input. Previously created dropout descriptor object.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Pointer to data of the tensor described by the xDesc descriptor.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Output. Pointer to data of the tensor described by the yDesc descriptor.

reserveSpace

Output. Pointer to user-allocated GPU memory used by this function. It is expected that contents of reserveSpace doe not change between cudnnDropoutForward and cudnnDropoutBackward calls.

reserveSpaceSizeInBytes

Input. Specifies size in bytes of the provided memory for the reserve space.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The number of elements of input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
• The strides of the input tensor and output tensors differ and in-place operation is used (i.e., x and y pointers are equal).
• The provided reserveSpaceSizeInBytes is less then the value returned by cudnnDropoutGetReserveSpaceSize.
• cudnnSetDropoutDescriptor has not been called on dropoutDesc with the non-NULL states argument.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.52. cudnnDropoutGetReserveSpaceSize

```cudnnStatus_t cudnnDropoutGetReserveSpaceSize(
cudnnTensorDescriptor_t     xDesc,
size_t                     *sizeInBytes)```

This function is used to query the amount of reserve needed to run dropout with the input dimensions given by xDesc. The same reserve space is expected to be passed to cudnnDropoutForward and cudnnDropoutBackward, and its contents is expected to remain unchanged between cudnnDropoutForward and cudnnDropoutBackward calls.

Parameters

xDesc

Input. Handle to a previously initialized tensor descriptor, describing input to a dropout operation.

sizeInBytes

Output. Amount of GPU memory needed as reserve space to be able to run dropout with an input tensor descriptor specified by xDesc.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

### 4.53. cudnnDropoutGetStatesSize

```cudnnStatus_t cudnnDropoutGetStatesSize(
cudnnHandle_t       handle,
size_t             *sizeInBytes)```

This function is used to query the amount of space required to store the states of the random number generators used by cudnnDropoutForward function.

Parameters

handle

Input. Handle to a previously created cuDNN context.

sizeInBytes

Output. Amount of GPU memory needed to store random generator states.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

### 4.54. cudnnFindConvolutionBackwardDataAlgorithm

```cudnnStatus_t cudnnFindConvolutionBackwardDataAlgorithm(
cudnnHandle_t                          handle,
const cudnnFilterDescriptor_t          wDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnTensorDescriptor_t          dxDesc,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdDataAlgoPerf_t     *perfResults)```

This function attempts all cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionBackwardData(), using memory allocated via cudaMalloc() and outputs performance metrics to a user-allocated array of cudnnConvolutionBwdDataAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Note: This function is host blocking.
Note: It is recommend to run this function prior to allocating layer data; doing otherwise may needlessly inhibit some algorithm options due to resource usage.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• wDesc, dyDesc or dxDesc is not allocated properly.
• wDesc, dyDesc or dxDesc has fewer than 1 dimension.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_ALLOC_FAILED

This function was unable to allocate memory to store sample input, filters and output.

CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.55. cudnnFindConvolutionBackwardDataAlgorithmEx

```cudnnStatus_t cudnnFindConvolutionBackwardDataAlgorithmEx(
cudnnHandle_t                          handle,
const cudnnFilterDescriptor_t          wDesc,
const void                            *w,
const cudnnTensorDescriptor_t          dyDesc,
const void                            *dy,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnTensorDescriptor_t          dxDesc,
void                                  *dx,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdDataAlgoPerf_t     *perfResults,
void                                  *workSpace,
size_t                                 workSpaceSizeInBytes)```

This function attempts all cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionBackwardData, using user-allocated GPU memory, and outputs performance metrics to a user-allocated array of cudnnConvolutionBwdDataAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Note: This function is host blocking.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the filter descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

dxDesc

Input/Output. Data pointer to GPU memory associated with the tensor descriptor dxDesc. The content of this tensor will be overwritten with arbitary values.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workSpace

Input. Data pointer to GPU memory that is a necessary workspace for some algorithms. The size of this workspace will determine the availabilty of algorithms. A nil pointer is considered a workSpace of 0 bytes.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• wDesc, dyDesc or dxDesc is not allocated properly.
• wDesc, dyDesc or dxDesc has fewer than 1 dimension.
• w, dy or dx is nil.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.56. cudnnFindConvolutionBackwardFilterAlgorithm

```cudnnStatus_t cudnnFindConvolutionBackwardFilterAlgorithm(
cudnnHandle_t                          handle,
const cudnnTensorDescriptor_t          xDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnFilterDescriptor_t          dwDesc,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdFilterAlgoPerf_t   *perfResults)```

This function attempts all cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionBackwardFilter(), using GPU memory allocated via cudaMalloc(), and outputs performance metrics to a user-allocated array of cudnnConvolutionBwdFilterAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Note: This function is host blocking.
Note: It is recommend to run this function prior to allocating layer data; doing otherwise may needlessly inhibit some algorithm options due to resource usage.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dwDesc

Input. Handle to a previously initialized filter descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• xDesc, dyDesc or dwDesc is not allocated properly.
• xDesc, dyDesc or dwDesc has fewer than 1 dimension.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_ALLOC_FAILED

This function was unable to allocate memory to store sample input, filters and output.

CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.57. cudnnFindConvolutionBackwardFilterAlgorithmEx

```cudnnStatus_t cudnnFindConvolutionBackwardFilterAlgorithmEx(
cudnnHandle_t                          handle,
const cudnnTensorDescriptor_t          xDesc,
const void                            *x,
const cudnnTensorDescriptor_t          dyDesc,
const void                            *dy,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnFilterDescriptor_t          dwDesc,
void                                  *dw,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdFilterAlgoPerf_t   *perfResults,
void                                  *workSpace,
size_t                                 workSpaceSizeInBytes)```

This function attempts all cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionBackwardFilter, using user-allocated GPU memory, and outputs performance metrics to a user-allocated array of cudnnConvolutionBwdFilterAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Note: This function is host blocking.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

x

Input. Data pointer to GPU memory associated with the filter descriptor xDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor.

dwDesc

Input. Handle to a previously initialized filter descriptor.

dw

Input/Output. Data pointer to GPU memory associated with the filter descriptor dwDesc. The content of this tensor will be overwritten with arbitary values.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workSpace

Input. Data pointer to GPU memory that is a necessary workspace for some algorithms. The size of this workspace will determine the availabilty of algorithms. A nil pointer is considered a workSpace of 0 bytes.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• xDesc, dyDesc or dwDesc is not allocated properly.
• xDesc, dyDesc or dwDesc has fewer than 1 dimension.
• x, dy or dw is nil.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.58. cudnnFindConvolutionForwardAlgorithm

```cudnnStatus_t cudnnFindConvolutionForwardAlgorithm(
cudnnHandle_t                      handle,
const cudnnTensorDescriptor_t      xDesc,
const cudnnFilterDescriptor_t      wDesc,
const cudnnConvolutionDescriptor_t convDesc,
const cudnnTensorDescriptor_t      yDesc,
const int                          requestedAlgoCount,
int                               *returnedAlgoCount,
cudnnConvolutionFwdAlgoPerf_t     *perfResults)```

This function attempts all cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionForward(), using memory allocated via cudaMalloc(), and outputs performance metrics to a user-allocated array of cudnnConvolutionFwdAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionForwardMaxCount().

Note: This function is host blocking.
Note: It is recommend to run this function prior to allocating layer data; doing otherwise may needlessly inhibit some algorithm options due to resource usage.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

wDesc

Input. Handle to a previously initialized filter descriptor.

convDesc

Input. Previously initialized convolution descriptor.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• xDesc, wDesc or yDesc is not allocated properly.
• xDesc, wDesc or yDesc has fewer than 1 dimension.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_ALLOC_FAILED

This function was unable to allocate memory to store sample input, filters and output.

CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.59. cudnnFindConvolutionForwardAlgorithmEx

```cudnnStatus_t cudnnFindConvolutionForwardAlgorithmEx(
cudnnHandle_t                      handle,
const cudnnTensorDescriptor_t      xDesc,
const void                        *x,
const cudnnFilterDescriptor_t      wDesc,
const void                        *w,
const cudnnConvolutionDescriptor_t convDesc,
const cudnnTensorDescriptor_t      yDesc,
void                              *y,
const int                          requestedAlgoCount,
int                               *returnedAlgoCount,
cudnnConvolutionFwdAlgoPerf_t     *perfResults,
void                              *workSpace,
size_t                             workSpaceSizeInBytes)```

This function attempts all available cuDNN algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) for cudnnConvolutionForward, using user-allocated GPU memory, and outputs performance metrics to a user-allocated array of cudnnConvolutionFwdAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionForwardMaxCount().

Note: This function is host blocking.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

convDesc

Input. Previously initialized convolution descriptor.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc. The content of this tensor will be overwritten with arbitary values.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workSpace

Input. Data pointer to GPU memory that is a necessary workspace for some algorithms. The size of this workspace will determine the availability of algorithms. A nil pointer is considered a workSpace of 0 bytes.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• handle is not allocated properly.
• xDesc, wDesc or yDesc is not allocated properly.
• xDesc, wDesc or yDesc has fewer than 1 dimension.
• x, w or y is nil.
• Either returnedCount or perfResults is nil.
• requestedCount is less than 1.
CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

• The function was unable to allocate neccesary timing objects.
• The function was unable to deallocate neccesary timing objects.
• The function was unable to deallocate sample input, filters and output.

### 4.60. cudnnFindRNNBackwardDataAlgorithmEx

```cudnnStatus_t cudnnFindRNNBackwardDataAlgorithmEx(
cudnnHandle_t                    handle,
const cudnnRNNDescriptor_t       rnnDesc,
const int                        seqLength,
const cudnnTensorDescriptor_t    *yDesc,
const void                       *y,
const cudnnTensorDescriptor_t    *dyDesc,
const void                       *dy,
const cudnnTensorDescriptor_t    dhyDesc,
const void                       *dhy,
const cudnnTensorDescriptor_t    dcyDesc,
const void                       *dcy,
const cudnnFilterDescriptor_t    wDesc,
const void                       *w,
const cudnnTensorDescriptor_t    hxDesc,
const void                       *hx,
const cudnnTensorDescriptor_t    cxDesc,
const void                       *cx,
const cudnnTensorDescriptor_t    *dxDesc,
void                             *dx,
const cudnnTensorDescriptor_t    dhxDesc,
void                             *dhx,
const cudnnTensorDescriptor_t    dcxDesc,
void                             *dcx,
const float                      findIntensity,
const int                        requestedAlgoCount,
int                              *returnedAlgoCount,
cudnnAlgorithmPerformance_t      *perfResults,
void                             *workspace,
size_t                           workSpaceSizeInBytes,
const void                       *reserveSpace,
size_t                           reserveSpaceSizeInBytes)```

(New for 7.1)

This function attempts all available cuDNN algorithms for cudnnRNNBackwardData, using user-allocated GPU memory. It outputs the parameters that influence the performance of the algorithm to a user-allocated array of cudnnAlgorithmPerformance_t. These parameter metrics are written in sorted fashion where the first element has the lowest compute time.

Parameters

handle

Input. Handle to a previously created cuDNN context.

rnnDesc

Input. A previously initialized RNN descriptor.

seqLength

Input. Number of iterations to unroll over. The value of this seqLength must not exceed the value that was used in cudnnGetRNNWorkspaceSize() function for querying the workspace size required to execute the RNN.

yDesc
Input. An array of fully packed tensor descriptors describing the output from each recurrent iteration (one descriptor per iteration). The second dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the second dimension should match the hiddenSize argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the second dimension should match double the hiddenSize argument passed to cudnnSetRNNDescriptor.
The first dimension of the tensor n must match the first dimension of the tensor n in dyDesc.
y

Input. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

dyDesc
Input. An array of fully packed tensor descriptors describing the gradient at the output from each recurrent iteration (one descriptor per iteration). The second dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the second dimension should match the hiddenSize argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the second dimension should match double the hiddenSize argument passed to cudnnSetRNNDescriptor.
The first dimension of the tensor n must match the second dimension of the tensor n in dxDesc.
dy

Input. Data pointer to GPU memory associated with the tensor descriptors in the array dyDesc.

dhyDesc
Input. A fully packed tensor descriptor describing the gradients at the final hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
dhy

Input. Data pointer to GPU memory associated with the tensor descriptor dhyDesc. If a NULL pointer is passed, the gradients at the final hidden state of the network will be initialized to zero.

dcyDesc
Input. A fully packed tensor descriptor describing the gradients at the final cell state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
dcy

Input. Data pointer to GPU memory associated with the tensor descriptor dcyDesc. If a NULL pointer is passed, the gradients at the final cell state of the network will be initialized to zero.

wDesc

Input. Handle to a previously initialized filter descriptor describing the weights for the RNN.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

hxDesc
Input. A fully packed tensor descriptor describing the initial hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hx

Input. Data pointer to GPU memory associated with the tensor descriptor hxDesc. If a NULL pointer is passed, the initial hidden state of the network will be initialized to zero.

cxDesc
Input. A fully packed tensor descriptor describing the initial cell state for LSTM networks. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
cx

Input. Data pointer to GPU memory associated with the tensor descriptor cxDesc. If a NULL pointer is passed, the initial cell state of the network will be initialized to zero.

dxDesc

Input. An array of fully packed tensor descriptors describing the gradient at the input of each recurrent iteration (one descriptor per iteration). The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Each tensor descriptor must have the same second dimension (vector length).

dx

Output. Data pointer to GPU memory associated with the tensor descriptors in the array dxDesc.

dhxDesc
Input. A fully packed tensor descriptor describing the gradient at the initial hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
dhx

Output. Data pointer to GPU memory associated with the tensor descriptor dhxDesc. If a NULL pointer is passed, the gradient at the hidden input of the network will not be set.

dcxDesc
Input. A fully packed tensor descriptor describing the gradient at the initial cell state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in dxDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
dcx

Output. Data pointer to GPU memory associated with the tensor descriptor dcxDesc. If a NULL pointer is passed, the gradient at the cell input of the network will not be set.

findIntensity

Input.This input was previously unused in versions prior to 7.2.0. It is used in cuDNN 7.2.0 and later versions to control the overall runtime of the RNN find algorithms, by selecting the percentage of a large Cartesian product space to be searched.

• Setting findIntensity within the range (0,1.] will set a percentage of the entire RNN search space to search. When findIntensity is set to 1.0, a full search is performed over all RNN parameters.
• When findIntensity is set to 0.0f, a quick, minimal search is performed. This setting has the best runtime. However, in this case the parameters returned by this function will not correspond to the best performance of the algorithm; a longer search might discover better parameters. This option will execute up to three instances of the configured RNN problem. Runtime will vary proportionally to RNN problem size, as it will in the other cases, hence no guarantee of an explicit time bound can be given.
• Setting findIntensity within the range [-1.,0) sets a percentage of a reduced Cartesian product space to be searched. This reduced searched space has been heuristically selected to have good performance. The setting of -1.0 represents a full search over this reduced search space.
• Values outside the range [-1,1] are truncated to the range [-1,1], and then interpreted as per the above.
• Setting findIntensity to 1.0 in cuDNN 7.2 and later versions is equivalent to the behavior of this function in versions prior to cuDNN 7.2.0.
• This function times the single RNN executions over large parameter spaces--one execution per parameter combination. The times returned by this function are latencies.
requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workspace

Input. Data pointer to GPU memory to be used as a workspace for this call.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workspace.

reserveSpace

Input/Output. Data pointer to GPU memory to be used as a reserve space for this call.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided reserveSpace.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The descriptor rnnDesc is invalid.
• At least one of the descriptors dhxDesc, wDesc, hxDesc, cxDesc, dcxDesc, dhyDesc, dcyDesc or one of the descriptors in yDesc, dxdesc, dydesc is invalid.
• The descriptors in one of yDesc, dxDesc, dyDesc, dhxDesc, wDesc, hxDesc, cxDesc, dcxDesc, dhyDesc, dcyDesc has incorrect strides or dimensions.
• workSpaceSizeInBytes is too small.
• reserveSpaceSizeInBytes is too small.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

CUDNN_STATUS_ALLOC_FAILED

The function was unable to allocate memory.

### 4.61. cudnnFindRNNBackwardWeightsAlgorithmEx

```cudnnStatus_t cudnnFindRNNBackwardWeightsAlgorithmEx(
cudnnHandle_t                    handle,
const cudnnRNNDescriptor_t       rnnDesc,
const int                        seqLength,
const cudnnTensorDescriptor_t    *xDesc,
const void                       *x,
const cudnnTensorDescriptor_t    hxDesc,
const void                       *hx,
const cudnnTensorDescriptor_t    *yDesc,
const void                       *y,
const float                      findIntensity,
const int                        requestedAlgoCount,
int                              *returnedAlgoCount,
cudnnAlgorithmPerformance_t      *perfResults,
const void                       *workspace,
size_t                           workSpaceSizeInBytes,
const cudnnFilterDescriptor_t    dwDesc,
void                             *dw,
const void                       *reserveSpace,
size_t                           reserveSpaceSizeInBytes)```

(New for 7.1)

This function attempts all available cuDNN algorithms for cudnnRNNBackwardWeights, using user-allocated GPU memory. It outputs the parameters that influence the performance of the algorithm to a user-allocated array of cudnnAlgorithmPerformance_t. These parameter metrics are written in sorted fashion where the first element has the lowest compute time.

Parameters

handle

Input. Handle to a previously created cuDNN context.

rnnDesc

Input. A previously initialized RNN descriptor.

seqLength

Input. Number of iterations to unroll over. The value of this seqLength must not exceed the value that was used in cudnnGetRNNWorkspaceSize() function for querying the workspace size required to execute the RNN.

xDesc

Input. An array of fully packed tensor descriptors describing the input to each recurrent iteration (one descriptor per iteration). The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Each tensor descriptor must have the same second dimension (vector length).

x

Input. Data pointer to GPU memory associated with the tensor descriptors in the array xDesc.

hxDesc
Input. A fully packed tensor descriptor describing the initial hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hx

Input. Data pointer to GPU memory associated with the tensor descriptor hxDesc. If a NULL pointer is passed, the initial hidden state of the network will be initialized to zero.

yDesc
Input. An array of fully packed tensor descriptors describing the output from each recurrent iteration (one descriptor per iteration). The second dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the second dimension should match the hiddenSize argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the second dimension should match double the hiddenSize argument passed to cudnnSetRNNDescriptor.
The first dimension of the tensor n must match the first dimension of the tensor n in dyDesc.
y

Input. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

findIntensity

Input.This input was previously unused in versions prior to 7.2.0. It is used in cuDNN 7.2.0 and later versions to control the overall runtime of the RNN find algorithms, by selecting the percentage of a large Cartesian product space to be searched.

• Setting findIntensity within the range (0,1.] will set a percentage of the entire RNN search space to search. When findIntensity is set to 1.0, a full search is performed over all RNN parameters.
• When findIntensity is set to 0.0f, a quick, minimal search is performed. This setting has the best runtime. However, in this case the parameters returned by this function will not correspond to the best performance of the algorithm; a longer search might discover better parameters. This option will execute up to three instances of the configured RNN problem. Runtime will vary proportionally to RNN problem size, as it will in the other cases, hence no guarantee of an explicit time bound can be given.
• Setting findIntensity within the range [-1.,0) sets a percentage of a reduced Cartesian product space to be searched. This reduced searched space has been heuristically selected to have good performance. The setting of -1.0 represents a full search over this reduced search space.
• Values outside the range [-1,1] are truncated to the range [-1,1], and then interpreted as per the above.
• Setting findIntensity to 1.0 in cuDNN 7.2 and later versions is equivalent to the behavior of this function in versions prior to cuDNN 7.2.0.
• This function times the single RNN executions over large parameter spaces--one execution per parameter combination. The times returned by this function are latencies.
requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workspace

Input. Data pointer to GPU memory to be used as a workspace for this call.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workspace.

dwDesc

Input. Handle to a previously initialized filter descriptor describing the gradients of the weights for the RNN.

dw

Input/Output. Data pointer to GPU memory associated with the filter descriptor dwDesc.

reserveSpace

Input. Data pointer to GPU memory to be used as a reserve space for this call.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided reserveSpace

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The descriptor rnnDesc is invalid.
• At least one of the descriptors hxDesc, dwDesc or one of the descriptors in xDesc, yDesc is invalid.
• The descriptors in one of xDesc, hxDesc, yDesc, dwDesc has incorrect strides or dimensions.
• workSpaceSizeInBytes is too small.
• reserveSpaceSizeInBytes is too small.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

CUDNN_STATUS_ALLOC_FAILED

The function was unable to allocate memory.

### 4.62. cudnnFindRNNForwardInferenceAlgorithmEx

```cudnnStatus_t cudnnFindRNNForwardInferenceAlgorithmEx(
cudnnHandle_t                   handle,
const cudnnRNNDescriptor_t      rnnDesc,
const int                       seqLength,
const cudnnTensorDescriptor_t  *xDesc,
const void                     *x,
const cudnnTensorDescriptor_t   hxDesc,
const void                     *hx,
const cudnnTensorDescriptor_t   cxDesc,
const void                     *cx,
const cudnnFilterDescriptor_t   wDesc,
const void                     *w,
const cudnnTensorDescriptor_t   *yDesc,
void                           *y,
const cudnnTensorDescriptor_t   hyDesc,
void                           *hy,
const cudnnTensorDescriptor_t   cyDesc,
void                           *cy,
const float                    findIntensity,
const int                      requestedAlgoCount,
int                            *returnedAlgoCount,
cudnnAlgorithmPerformance_t    *perfResults,
void                           *workspace,
size_t                          workSpaceSizeInBytes)```

(New for 7.1)

This function attempts all available cuDNN algorithms for cudnnRNNForwardInference, using user-allocated GPU memory. It outputs the parameters that influence the performance of the algorithm to a user-allocated array of cudnnAlgorithmPerformance_t. These parameter metrics are written in sorted fashion where the first element has the lowest compute time.

Parameters

handle

Input. Handle to a previously created cuDNN context.

rnnDesc

Input. A previously initialized RNN descriptor.

seqLength

Input. Number of iterations to unroll over. The value of this seqLength must not exceed the value that was used in cudnnGetRNNWorkspaceSize() function for querying the workspace size required to execute the RNN.

xDesc

Input. An array of fully packed tensor descriptors describing the input to each recurrent iteration (one descriptor per iteration). The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Each tensor descriptor must have the same second dimension (vector length).

x

Input. Data pointer to GPU memory associated with the tensor descriptors in the array xDesc. The data are expected to be packed contiguously with the first element of iteration n+1 following directly from the last element of iteration n.

hxDesc
Input. A fully packed tensor descriptor describing the initial hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hx

Input. Data pointer to GPU memory associated with the tensor descriptor hxDesc. If a NULL pointer is passed, the initial hidden state of the network will be initialized to zero.

cxDesc
Input. A fully packed tensor descriptor describing the initial cell state for LSTM networks. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
cx

Input. Data pointer to GPU memory associated with the tensor descriptor cxDesc. If a NULL pointer is passed, the initial cell state of the network will be initialized to zero.

wDesc

Input. Handle to a previously initialized filter descriptor describing the weights for the RNN.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

yDesc
Input. An array of fully packed tensor descriptors describing the output from each recurrent iteration (one descriptor per iteration). The second dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the second dimension should match the hiddenSize argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the second dimension should match double the hiddenSize argument passed to cudnnSetRNNDescriptor.
The first dimension of the tensor n must match the first dimension of the tensor n in xDesc.
y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc. The data are expected to be packed contiguously with the first element of iteration n+1 following directly from the last element of iteration n.

hyDesc
Input. A fully packed tensor descriptor describing the final hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hy

Output. Data pointer to GPU memory associated with the tensor descriptor hyDesc. If a NULL pointer is passed, the final hidden state of the network will not be saved.

cyDesc
Input. A fully packed tensor descriptor describing the final cell state for LSTM networks. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
cy

Output. Data pointer to GPU memory associated with the tensor descriptor cyDesc. If a NULL pointer is passed, the final cell state of the network will be not be saved.

findIntensity

Input.This input was previously unused in versions prior to 7.2.0. It is used in cuDNN 7.2.0 and later versions to control the overall runtime of the RNN find algorithms, by selecting the percentage of a large Cartesian product space to be searched.

• Setting findIntensity within the range (0,1.] will set a percentage of the entire RNN search space to search. When findIntensity is set to 1.0, a full search is performed over all RNN parameters.
• When findIntensity is set to 0.0f, a quick, minimal search is performed. This setting has the best runtime. However, in this case the parameters returned by this function will not correspond to the best performance of the algorithm; a longer search might discover better parameters. This option will execute up to three instances of the configured RNN problem. Runtime will vary proportionally to RNN problem size, as it will in the other cases, hence no guarantee of an explicit time bound can be given.
• Setting findIntensity within the range [-1.,0) sets a percentage of a reduced Cartesian product space to be searched. This reduced searched space has been heuristically selected to have good performance. The setting of -1.0 represents a full search over this reduced search space.
• Values outside the range [-1,1] are truncated to the range [-1,1], and then interpreted as per the above.
• Setting findIntensity to 1.0 in cuDNN 7.2 and later versions is equivalent to the behavior of this function in versions prior to cuDNN 7.2.0.
• This function times the single RNN executions over large parameter spaces--one execution per parameter combination. The times returned by this function are latencies.
requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workspace

Input. Data pointer to GPU memory to be used as a workspace for this call.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workspace.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The descriptor rnnDesc is invalid.
• At least one of the descriptors hxDesc, cxDesc, wDesc, hyDesc, cyDesc or one of the descriptors in xDesc, yDesc is invalid.
• The descriptors in one of xDesc, hxDesc, cxDesc, wDesc, yDesc, hyDesc, cyDesc have incorrect strides or dimensions.
• workSpaceSizeInBytes is too small.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

CUDNN_STATUS_ALLOC_FAILED

The function was unable to allocate memory.

### 4.63. cudnnFindRNNForwardTrainingAlgorithmEx

```cudnnStatus_t cudnnFindRNNForwardTrainingAlgorithmEx(
cudnnHandle_t                   handle,
const cudnnRNNDescriptor_t      rnnDesc,
const int                       seqLength,
const cudnnTensorDescriptor_t  *xDesc,
const void                     *x,
const cudnnTensorDescriptor_t   hxDesc,
const void                     *hx,
const cudnnTensorDescriptor_t   cxDesc,
const void                     *cx,
const cudnnFilterDescriptor_t   wDesc,
const void                     *w,
const cudnnTensorDescriptor_t  *yDesc,
void                           *y,
const cudnnTensorDescriptor_t   hyDesc,
void                           *hy,
const cudnnTensorDescriptor_t   cyDesc,
void                           *cy,
const float                    findIntensity,
const int                      requestedAlgoCount,
int                            *returnedAlgoCount,
cudnnAlgorithmPerformance_t    *perfResults,
void                           *workspace,
size_t                          workSpaceSizeInBytes,
void                           *reserveSpace,
size_t                          reserveSpaceSizeInBytes)```

(New for 7.1)

This function attempts all available cuDNN algorithms for cudnnRNNForwardTraining, using user-allocated GPU memory. It outputs the parameters that influence the performance of the algorithm to a user-allocated array of cudnnAlgorithmPerformance_t. These parameter metrics are written in sorted fashion where the first element has the lowest compute time.

Parameters

handle

Input. Handle to a previously created cuDNN context.

rnnDesc

Input. A previously initialized RNN descriptor.

xDesc

Input. An array of fully packed tensor descriptors describing the input to each recurrent iteration (one descriptor per iteration). The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Each tensor descriptor must have the same second dimension (vector length).

seqLength

Input. Number of iterations to unroll over. The value of this seqLength must not exceed the value that was used in cudnnGetRNNWorkspaceSize() function for querying the workspace size required to execute the RNN.

x

Input. Data pointer to GPU memory associated with the tensor descriptors in the array xDesc.

hxDesc
Input. A fully packed tensor descriptor describing the initial hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hx

Input. Data pointer to GPU memory associated with the tensor descriptor hxDesc. If a NULL pointer is passed, the initial hidden state of the network will be initialized to zero.

cxDesc
Input. A fully packed tensor descriptor describing the initial cell state for LSTM networks. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
cx

Input. Data pointer to GPU memory associated with the tensor descriptor cxDesc. If a NULL pointer is passed, the initial cell state of the network will be initialized to zero.

wDesc

Input. Handle to a previously initialized filter descriptor describing the weights for the RNN.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

yDesc
Input. An array of fully packed tensor descriptors describing the output from each recurrent iteration (one descriptor per iteration). The second dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the second dimension should match the hiddenSize argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the second dimension should match double the hiddenSize argument passed to cudnnSetRNNDescriptor.
The first dimension of the tensor n must match the first dimension of the tensor n in xDesc.
y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

hyDesc
Input. A fully packed tensor descriptor describing the final hidden state of the RNN. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
hy

Output. Data pointer to GPU memory associated with the tensor descriptor hyDesc. If a NULL pointer is passed, the final hidden state of the network will not be saved.

cyDesc
Input. A fully packed tensor descriptor describing the final cell state for LSTM networks. The first dimension of the tensor depends on the direction argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc:
• If direction is CUDNN_UNIDIRECTIONAL the first dimension should match the numLayers argument passed to cudnnSetRNNDescriptor.
• If direction is CUDNN_BIDIRECTIONAL the first dimension should match double the numLayers argument passed to cudnnSetRNNDescriptor.
The second dimension must match the first dimension of the tensors described in xDesc. The third dimension must match the hiddenSize argument passed to the cudnnSetRNNDescriptor call used to initialize rnnDesc. The tensor must be fully packed.
cy

Output. Data pointer to GPU memory associated with the tensor descriptor cyDesc. If a NULL pointer is passed, the final cell state of the network will be not be saved.

findIntensity

Input.This input was previously unused in versions prior to 7.2.0. It is used in cuDNN 7.2.0 and later versions to control the overall runtime of the RNN find algorithms, by selecting the percentage of a large Cartesian product space to be searched.

• Setting findIntensity within the range (0,1.] will set a percentage of the entire RNN search space to search. When findIntensity is set to 1.0, a full search is performed over all RNN parameters.
• When findIntensity is set to 0.0f, a quick, minimal search is performed. This setting has the best runtime. However, in this case the parameters returned by this function will not correspond to the best performance of the algorithm; a longer search might discover better parameters. This option will execute up to three instances of the configured RNN problem. Runtime will vary proportionally to RNN problem size, as it will in the other cases, hence no guarantee of an explicit time bound can be given.
• Setting findIntensity within the range [-1.,0) sets a percentage of a reduced Cartesian product space to be searched. This reduced searched space has been heuristically selected to have good performance. The setting of -1.0 represents a full search over this reduced search space.
• Values outside the range [-1,1] are truncated to the range [-1,1], and then interpreted as per the above.
• Setting findIntensity to 1.0 in cuDNN 7.2 and later versions is equivalent to the behavior of this function in versions prior to cuDNN 7.2.0.
• This function times the single RNN executions over large parameter spaces--one execution per parameter combination. The times returned by this function are latencies.
requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workspace

Input. Data pointer to GPU memory to be used as a workspace for this call.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workspace.

reserveSpace

Input/Output. Data pointer to GPU memory to be used as a reserve space for this call.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided reserveSpace

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• The descriptor rnnDesc is invalid.
• At least one of the descriptors hxDesc, cxDesc, wDesc, hyDesc, cyDesc or one of the descriptors in xDesc, yDesc is invalid.
• The descriptors in one of xDesc, hxDesc, cxDesc, wDesc, yDesc, hyDesc, cyDesc have incorrect strides or dimensions.
• workSpaceSizeInBytes is too small.
• reserveSpaceSizeInBytes is too small.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

CUDNN_STATUS_ALLOC_FAILED

The function was unable to allocate memory.

### 4.64. cudnnGetActivationDescriptor

```cudnnStatus_t cudnnGetActivationDescriptor(
const cudnnActivationDescriptor_t   activationDesc,
cudnnActivationMode_t              *mode,
cudnnNanPropagation_t              *reluNanOpt,
double                             *coef)```

This function queries a previously initialized generic activation descriptor object.

Parameters

activationDesc

Input. Handle to a previously created activation descriptor.

mode

Output. Enumerant to specify the activation mode.

reluNanOpt

Output. Enumerant to specify the Nan propagation mode.

coef

Output. Floating point number to specify the clipping threashod when the activation mode is set to CUDNN_ACTIVATION_CLIPPED_RELU or to specify the alpha coefficient when the activation mode is set to CUDNN_ACTIVATION_ELU.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 4.65. cudnnGetAlgorithmDescriptor

```cudnnStatus_t cudnnGetAlgorithmDescriptor(
const cudnnAlgorithmDescriptor_t    algoDesc,
cudnnAlgorithm_t                    *algorithm)```

(New for 7.1)

This function queries a previously initialized generic algorithm descriptor object.

Parameters

algorithmDesc

Input. Handle to a previously created algorithm descriptor.

algorithm

Input. Struct to specify the algorithm.

Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 4.66. cudnnGetAlgorithmPerformance

```cudnnStatus_t cudnnGetAlgorithmPerformance(
const cudnnAlgorithmPerformance_t   algoPerf,
cudnnAlgorithmDescriptor_t*         algoDesc,
cudnnStatus_t*                      status,
float*                              time,
size_t*                             memory)```

(New for 7.1)

This function queries a previously initialized generic algorithm performance object.

Parameters

algoPerf

Input/Output. Handle to a previously created algorithm performance object.

algoDesc

Output. The algorithm descriptor which the performance results describe.

status

Output. The cudnn status returned from running the algoDesc algorithm.

timecoef

Output. The GPU time spent running the algoDesc algorithm.

memory

Output. The GPU memory needed to run the algoDesc algorithm.

Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 4.67. cudnnGetAlgorithmSpaceSize

```cudnnStatus_t cudnnGetAlgorithmSpaceSize(
cudnnHandle_t               handle,
cudnnAlgorithmDescriptor_t  algoDesc,
size_t*                     algoSpaceSizeInBytes)```

(New for 7.1)

This function queries for the amount of host memory needed to call cudnnSaveAlgorithm, much like the “get workspace size” functions query for the amount of device memory needed.

Parameters

handle

Input. Handle to a previously created cuDNN context.

algoDesc

Input. A previously created algorithm descriptor.

algoSpaceSizeInBytes

Ouptut. Amount of host memory needed as workspace to be able to save the metadata from the specified algoDesc.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the arguments is null.

### 4.68. cudnnGetCTCLossDescriptor

```cudnnStatus_t cudnnGetCTCLossDescriptor(
cudnnCTCLossDescriptor_t         ctcLossDesc,
cudnnDataType_t*                 compType)```

This function returns configuration of the passed CTC loss function descriptor.

Parameters

ctcLossDesc

Input. CTC loss function descriptor passed, from which to retrieve the configuration.

compType

Output. Compute type associated with this CTC loss function descriptor.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

Input OpTensor descriptor passed is invalid.

### 4.69. cudnnGetCTCLossWorkspaceSize

```cudnnStatus_t cudnnGetCTCLossWorkspaceSize(
cudnnHandle_t                        handle,
const   cudnnTensorDescriptor_t      probsDesc,
const   int                         *labels,
const   int                         *labelLengths,
const   int                         *inputLengths,
cudnnCTCLossAlgo_t                   algo,
const   cudnnCTCLossDescriptor_t     ctcLossDesc,
size_t                              *sizeInBytes)```

This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnCTCLoss with the specified algorithm. The workspace allocated will then be passed to the routine cudnnCTCLoss.

Parameters

handle

Input. Handle to a previously created cuDNN context.

probsDesc

Input. Handle to the previously initialized probabilities tensor descriptor.

Input. Handle to a previously initialized gradients tensor descriptor.

labels

Input. Pointer to a previously initialized labels list.

labelLengths

Input. Pointer to a previously initialized lengths list, to walk the above labels list.

inputLengths

Input. Pointer to a previously initialized list of the lengths of the timing steps in each batch.

algo

Input. Enumerant that specifies the chosen CTC loss algorithm

ctcLossDesc

Input. Handle to the previously initialized CTC loss descriptor.

sizeInBytes

Output. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The dimensions of probsDesc do not match the dimensions of gradientsDesc.
• The inputLengths do not agree with the first dimension of probsDesc.
• The workSpaceSizeInBytes is not sufficient.
• The labelLengths is greater than 256.
CUDNN_STATUS_NOT_SUPPORTED

A compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.

### 4.70. cudnnGetCallback

```cudnnStatus_t cudnnGetCallback(
void                **udata,
cudnnCallback_t     fptr)```

(New for 7.1)

This function queries the internal states of cuDNN error reporting functionality.

Parameters

Output. Pointer to the address where the current internal error reporting message bit mask will be outputted.

udata

Output. Pointer to the address where the current internally stored udata address will be stored.

fptr

Output. Pointer to the address where the current internally stored callback function pointer will be stored. When the built-in default callback function is used, NULL will be outputted.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

If any of the input parameters are NULL.

### 4.71. cudnnGetConvolution2dDescriptor

```cudnnStatus_t cudnnGetConvolution2dDescriptor(
const cudnnConvolutionDescriptor_t  convDesc,
int                                *u,
int                                *v,
int                                *dilation_h,
int                                *dilation_w,
cudnnConvolutionMode_t             *mode,
cudnnDataType_t                    *computeType)```

This function queries a previously initialized 2D convolution descriptor object.

Parameters

convDesc

Input/Output. Handle to a previously created convolution descriptor.

Output. zero-padding height: number of rows of zeros implicitly concatenated onto the top and onto the bottom of input images.

Output. zero-padding width: number of columns of zeros implicitly concatenated onto the left and onto the right of input images.

u

Output. Vertical filter stride.

v

Output. Horizontal filter stride.

dilation_h

Output. Filter height dilation.

dilation_w

Output. Filter width dilation.

mode

Output. Convolution mode.

computeType

Output. Compute precision.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The operation was successful.

The parameter convDesc is nil.

### 4.72. cudnnGetConvolution2dForwardOutputDim

```cudnnStatus_t cudnnGetConvolution2dForwardOutputDim(
const cudnnConvolutionDescriptor_t  convDesc,
const cudnnTensorDescriptor_t       inputTensorDesc,
const cudnnFilterDescriptor_t       filterDesc,
int                                *n,
int                                *c,
int                                *h,
int                                *w)```

This function returns the dimensions of the resulting 4D tensor of a 2D convolution, given the convolution descriptor, the input tensor descriptor and the filter descriptor This function can help to setup the output tensor and allocate the proper amount of memory prior to launch the actual convolution.

Each dimension h and w of the output images is computed as followed:

```    outputDim = 1 + ( inputDim + 2*pad - (((filterDim-1)*dilation)+1) )/convolutionStride;
```
Note: The dimensions provided by this routine must be strictly respected when calling cudnnConvolutionForward() or cudnnConvolutionBackwardBias(). Providing a smaller or larger output tensor is not supported by the convolution routines.

Parameters

convDesc

Input. Handle to a previously created convolution descriptor.

inputTensorDesc

Input. Handle to a previously initialized tensor descriptor.

filterDesc

Input. Handle to a previously initialized filter descriptor.

n

Output. Number of output images.

c

Output. Number of output feature maps per image.

h

Output. Height of each output feature map.

w

Output. Width of each output feature map.

The possible error values returned by this function and their meanings are listed below.

Returns

One or more of the descriptors has not been created correctly or there is a mismatch between the feature maps of inputTensorDesc and filterDesc.

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 4.73. cudnnGetConvolutionBackwardDataAlgorithm

```cudnnStatus_t cudnnGetConvolutionBackwardDataAlgorithm(
cudnnHandle_t                          handle,
const cudnnFilterDescriptor_t          wDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnTensorDescriptor_t          dxDesc,
cudnnConvolutionBwdDataPreference_t    preference,
size_t                                 memoryLimitInBytes,
cudnnConvolutionBwdDataAlgo_t         *algo)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionBackwardData for the given layer specifications. Based on the input preference, this function will either return the fastest algorithm or the fastest algorithm within a given memory limit. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionBackwardDataAlgorithm.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

preference

Input. Enumerant to express the preference criteria in terms of memory requirement and speed.

memoryLimitInBytes

Input. It is to specify the maximum amount of GPU memory the user is willing to use as a workspace. This is currently a placeholder and is not used.

algo

Output. Enumerant that specifies which convolution algorithm should be used to compute the results according to the specified preference

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.

### 4.74. cudnnGetConvolutionBackwardDataAlgorithmMaxCount

```cudnnStatus_t cudnnGetConvolutionBackwardDataAlgorithmMaxCount(
cudnnHandle_t       handle,
int                 *count)```

This function returns the maximum number of algorithms which can be returned from cudnnFindConvolutionBackwardDataAlgorithm() and cudnnGetConvolutionForwardAlgorithm_v7(). This is the sum of all algorithms plus the sum of all algorithms with Tensor Core operations supported for the current device.

Parameters

handle

Input. Handle to a previously created cuDNN context.

count

Output. The resulting maximum number of algorithms.

Returns

CUDNN_STATUS_SUCCESS

The function was successful.

The provided handle is not allocated properly.

### 4.75. cudnnGetConvolutionBackwardDataAlgorithm_v7

```cudnnStatus_t cudnnGetConvolutionBackwardDataAlgorithm_v7(
cudnnHandle_t                          handle,
const cudnnFilterDescriptor_t          wDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnTensorDescriptor_t          dxDesc,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdDataAlgoPerf_t     *perfResults)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionBackwardData for the given layer specifications. This function will return all algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) sorted by expected (based on internal heuristic) relative performance with fastest being index 0 of perfResults. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionBackwardDataAlgorithm. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• One of the parameters handle, wDesc, dyDesc, convDesc, dxDesc, perfResults, returnedAlgoCount is NULL.
• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.
• requestedAlgoCount is less than or equal to 0.

### 4.76. cudnnGetConvolutionBackwardDataWorkspaceSize

```cudnnStatus_t cudnnGetConvolutionBackwardDataWorkspaceSize(
cudnnHandle_t                       handle,
const cudnnFilterDescriptor_t       wDesc,
const cudnnTensorDescriptor_t       dyDesc,
const cudnnConvolutionDescriptor_t  convDesc,
const cudnnTensorDescriptor_t       dxDesc,
cudnnConvolutionBwdDataAlgo_t       algo,
size_t                             *sizeInBytes)```

This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnConvolutionBackwardData with the specified algorithm. The workspace allocated will then be passed to the routine cudnnConvolutionBackwardData. The specified algorithm can be the result of the call to cudnnGetConvolutionBackwardDataAlgorithm or can be chosen arbitrarily by the user. Note that not every algorithm is available for every configuration of the input tensor and/or every configuration of the convolution descriptor.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

algo

Input. Enumerant that specifies the chosen convolution algorithm

sizeInBytes

Output. Amount of GPU memory needed as workspace to be able to execute a forward convolution with the specified algo

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.
CUDNN_STATUS_NOT_SUPPORTED

The combination of the tensor descriptors, filter descriptor and convolution descriptor is not supported for the specified algorithm.

### 4.77. cudnnGetConvolutionBackwardFilterAlgorithm

```cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithm(
cudnnHandle_t                          handle,
const cudnnTensorDescriptor_t          xDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnFilterDescriptor_t          dwDesc,
cudnnConvolutionBwdFilterPreference_t  preference,
size_t                                 memoryLimitInBytes,
cudnnConvolutionBwdFilterAlgo_t       *algo)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionBackwardFilter for the given layer specifications. Based on the input preference, this function will either return the fastest algorithm or the fastest algorithm within a given memory limit. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionBackwardFilterAlgorithm.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dwDesc

Input. Handle to a previously initialized filter descriptor.

preference

Input. Enumerant to express the preference criteria in terms of memory requirement and speed.

memoryLimitInBytes

Input. It is to specify the maximum amount of GPU memory the user is willing to use as a workspace. This is currently a placeholder and is not used.

algo

Output. Enumerant that specifies which convolution algorithm should be used to compute the results according to the specified preference.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.

### 4.78. cudnnGetConvolutionBackwardFilterAlgorithmMaxCount

```cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(
cudnnHandle_t       handle,
int                 *count)```

This function returns the maximum number of algorithms which can be returned from cudnnFindConvolutionBackwardFilterAlgorithm() and cudnnGetConvolutionForwardAlgorithm_v7(). This is the sum of all algorithms plus the sum of all algorithms with Tensor Core operations supported for the current device.

Parameters

handle

Input. Handle to a previously created cuDNN context.

count

Output. The resulting maximum count of algorithms.

Returns

CUDNN_STATUS_SUCCESS

The function was successful.

The provided handle is not allocated properly.

### 4.79. cudnnGetConvolutionBackwardFilterAlgorithm_v7

```cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithm_v7(
cudnnHandle_t                          handle,
const cudnnTensorDescriptor_t          xDesc,
const cudnnTensorDescriptor_t          dyDesc,
const cudnnConvolutionDescriptor_t     convDesc,
const cudnnFilterDescriptor_t          dwDesc,
const int                              requestedAlgoCount,
int                                   *returnedAlgoCount,
cudnnConvolutionBwdFilterAlgoPerf_t   *perfResults)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionBackwardFilter for the given layer specifications. This function will return all algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) sorted by expected (based on internal heuristic) relative performance with fastest being index 0 of perfResults. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionBackwardFilterAlgorithm. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardMaxCount().

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dwDesc

Input. Handle to a previously initialized filter descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• One of the parameters handle, xDesc, dyDesc, convDesc, dwDesc, perfResults, returnedAlgoCount is NULL.
• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.
• requestedAlgoCount is less than or equal to 0.

### 4.80. cudnnGetConvolutionBackwardFilterWorkspaceSize

```cudnnStatus_t cudnnGetConvolutionBackwardFilterWorkspaceSize(
cudnnHandle_t                       handle,
const cudnnTensorDescriptor_t       xDesc,
const cudnnTensorDescriptor_t       dyDesc,
const cudnnConvolutionDescriptor_t  convDesc,
const cudnnFilterDescriptor_t       dwDesc,
cudnnConvolutionBwdFilterAlgo_t     algo,
size_t                             *sizeInBytes)```

This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnConvolutionBackwardFilter with the specified algorithm. The workspace allocated will then be passed to the routine cudnnConvolutionBackwardFilter. The specified algorithm can be the result of the call to cudnnGetConvolutionBackwardFilterAlgorithm or can be chosen arbitrarily by the user. Note that not every algorithm is available for every configuration of the input tensor and/or every configuration of the convolution descriptor.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dwDesc

Input. Handle to a previously initialized filter descriptor.

algo

Input. Enumerant that specifies the chosen convolution algorithm.

sizeInBytes

Output. Amount of GPU memory needed as workspace to be able to execute a forward convolution with the specified algo.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• The numbers of feature maps of the input tensor and output tensor differ.
• The dataType of the two tensor descriptors or the filter are different.
CUDNN_STATUS_NOT_SUPPORTED

The combination of the tensor descriptors, filter descriptor and convolution descriptor is not supported for the specified algorithm.

### 4.81. cudnnGetConvolutionForwardAlgorithm

```cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
cudnnHandle_t                      handle,
const cudnnTensorDescriptor_t      xDesc,
const cudnnFilterDescriptor_t      wDesc,
const cudnnConvolutionDescriptor_t convDesc,
const cudnnTensorDescriptor_t      yDesc,
cudnnConvolutionFwdPreference_t    preference,
size_t                             memoryLimitInBytes,
cudnnConvolutionFwdAlgo_t         *algo)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionForward for the given layer specifications. Based on the input preference, this function will either return the fastest algorithm or the fastest algorithm within a given memory limit. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionForwardAlgorithm.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

wDesc

Input. Handle to a previously initialized convolution filter descriptor.

convDesc

Input. Previously initialized convolution descriptor.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

preference

Input. Enumerant to express the preference criteria in terms of memory requirement and speed.

memoryLimitInBytes

Input. It is used when enumerant preference is set to CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT to specify the maximum amount of GPU memory the user is willing to use as a workspace

algo

Output. Enumerant that specifies which convolution algorithm should be used to compute the results according to the specified preference

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• One of the parameters handle, xDesc, wDesc, convDesc, yDesc is NULL.
• Either yDesc or wDesc have different dimensions from xDesc.
• The data types of tensors xDesc, yDesc or wDesc are not all the same.
• The number of feature maps in xDesc and wDesc differs.
• The tensor xDesc has a dimension smaller than 3.

### 4.82. cudnnGetConvolutionForwardAlgorithmMaxCount

```cudnnStatus_t cudnnGetConvolutionForwardAlgorithmMaxCount(
cudnnHandle_t   handle,
int             *count)```

This function returns the maximum number of algorithms which can be returned from cudnnFindConvolutionForwardAlgorithm() and cudnnGetConvolutionForwardAlgorithm_v7(). This is the sum of all algorithms plus the sum of all algorithms with Tensor Core operations supported for the current device.

Parameters

handle

Input. Handle to a previously created cuDNN context.

count

Output. The resulting maximum number of algorithms.

Returns

CUDNN_STATUS_SUCCESS

The function was successful.

The provided handle is not allocated properly.

### 4.83. cudnnGetConvolutionForwardAlgorithm_v7

```cudnnStatus_t cudnnGetConvolutionForwardAlgorithm_v7(
cudnnHandle_t                       handle,
const cudnnTensorDescriptor_t       xDesc,
const cudnnFilterDescriptor_t       wDesc,
const cudnnConvolutionDescriptor_t  convDesc,
const cudnnTensorDescriptor_t       yDesc,
const int                           requestedAlgoCount,
int                                *returnedAlgoCount,
cudnnConvolutionFwdAlgoPerf_t      *perfResults)```

This function serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionForward for the given layer specifications. This function will return all algorithms (including CUDNN_TENSOR_OP_MATH and CUDNN_DEFAULT_MATH versions of algorithms where CUDNN_TENSOR_OP_MATH may be available) sorted by expected (based on internal heuristic) relative performance with fastest being index 0 of perfResults. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionForwardAlgorithm. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionForwardMaxCount().

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

wDesc

Input. Handle to a previously initialized convolution filter descriptor.

convDesc

Input. Previously initialized convolution descriptor.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• One of the parameters handle, xDesc, wDesc, convDesc, yDesc, perfResults, returnedAlgoCount is NULL.
• Either yDesc or wDesc have different dimensions from xDesc.
• The data types of tensors xDesc, yDesc or wDesc are not all the same.
• The number of feature maps in xDesc and wDesc differs.
• The tensor xDesc has a dimension smaller than 3.
• requestedAlgoCount is less than or equal to 0.

### 4.84. cudnnGetConvolutionForwardWorkspaceSize

```cudnnStatus_t cudnnGetConvolutionForwardWorkspaceSize(
cudnnHandle_t   handle,
const   cudnnTensorDescriptor_t         xDesc,
const   cudnnFilterDescriptor_t         wDesc,
const   cudnnConvolutionDescriptor_t    convDesc,
const   cudnnTensorDescriptor_t         yDesc,
cudnnConvolutionFwdAlgo_t               algo,
size_t                                 *sizeInBytes)```

This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnConvolutionForward with the specified algorithm. The workspace allocated will then be passed to the routine cudnnConvolutionForward. The specified algorithm can be the result of the call to cudnnGetConvolutionForwardAlgorithm or can be chosen arbitrarily by the user. Note that not every algorithm is available for every configuration of the input tensor and/or every configuration of the convolution descriptor.

Parameters

handle

Input. Handle to a previously created cuDNN context.

xDesc

Input. Handle to the previously initialized x tensor descriptor.

wDesc

Input. Handle to a previously initialized filter descriptor.

convDesc

Input. Previously initialized convolution descriptor.

yDesc

Input. Handle to the previously initialized y tensor descriptor.

algo

Input. Enumerant that specifies the chosen convolution algorithm

sizeInBytes

Output. Amount of GPU memory needed as workspace to be able to execute a forward convolution with the specified algo

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

At least one of the following conditions are met:

• One of the parameters handle, xDesc, wDesc, convDesc, yDesc is NULL.
• The tensor yDesc or wDesc are not of the same dimension as xDesc.
• The tensor xDesc, yDesc or wDesc are not of the same data type.
• The numbers of feature maps of the tensor xDesc and wDesc differ.
• The tensor xDesc has a dimension smaller than 3.
CUDNN_STATUS_NOT_SUPPORTED

The combination of the tensor descriptors, filter descriptor and convolution descriptor is not supported for the specified algorithm.

### 4.85. cudnnGetConvolutionGroupCount

```cudnnStatus_t cudnnGetConvolutionGroupCount(
cudnnConvolutionDescriptor_t    convDesc,
int                            *groupCount)```

This function returns the group count specified in the given convolution descriptor.

Returns

CUDNN_STATUS_SUCCESS

The group count was returned successfully.

An invalid convolution descriptor was provided.

### 4.86. cudnnGetConvolutionMathType

```cudnnStatus_t cudnnGetConvolutionMathType(
cudnnConvolutionDescriptor_t    convDesc,
cudnnMathType_t                *mathType)```

This function returns the math type specified in a given convolution descriptor.

Returns

CUDNN_STATUS_SUCCESS

The math type was returned successfully.

An invalid convolution descriptor was provided.

### 4.87. cudnnGetConvolutionNdDescriptor

```cudnnStatus_t cudnnGetConvolutionNdDescriptor(
const cudnnConvolutionDescriptor_t  convDesc,
int                                 arrayLengthRequested,
int                                *arrayLength,
int                                 filterStrideA[],
int                                 dilationA[],
cudnnConvolutionMode_t             *mode,
cudnnDataType_t                    *dataType)```

This function queries a previously initialized convolution descriptor object.

Parameters

convDesc

Input/Output. Handle to a previously created convolution descriptor.

arrayLengthRequested

Input. Dimension of the expected convolution descriptor. It is also the minimum size of the arrays padA, filterStrideA and dilationA in order to be able to hold the results

arrayLength

Output. Actual dimension of the convolution descriptor.

Output. Array of dimension of at least arrayLengthRequested that will be filled with the padding parameters from the provided convolution descriptor.

filterStrideA

Output. Array of dimension of at least arrayLengthRequested that will be filled with the filter stride from the provided convolution descriptor.

dilationA

Output. Array of dimension of at least arrayLengthRequested that will be filled with the dilation parameters from the provided convolution descriptor.

mode

Output. Convolution mode of the provided descriptor.

datatype

Output. Datatype of the provided descriptor.

Returns

CUDNN_STATUS_SUCCESS

The query was successfully.

At least one of the following conditions are met:

• The descriptor convDesc is nil.
• The arrayLengthRequest is negative.
CUDNN_STATUS_NOT_SUPPORTED

The arrayLengthRequested is greater than CUDNN_DIM_MAX-2.

### 4.88. cudnnGetConvolutionNdForwardOutputDim

```cudnnStatus_t cudnnGetConvolutionNdForwardOutputDim(
const cudnnConvolutionDescriptor_t  convDesc,
const cudnnTensorDescriptor_t       inputTensorDesc,
const cudnnFilterDescriptor_t       filterDesc,
int                                 nbDims,
int                                 tensorOuputDimA[])```

This function returns the dimensions of the resulting n-D tensor of a nbDims-2-D convolution, given the convolution descriptor, the input tensor descriptor and the filter descriptor This function can help to setup the output tensor and allocate the proper amount of memory prior to launch the actual convolution.

Each dimension of the (nbDims-2)-D images of the output tensor is computed as followed:

```    outputDim = 1 + ( inputDim + 2*pad - (((filterDim-1)*dilation)+1) )/convolutionStride;
```
Note: The dimensions provided by this routine must be strictly respected when calling cudnnConvolutionForward() or cudnnConvolutionBackwardBias(). Providing a smaller or larger output tensor is not supported by the convolution routines.

Parameters

convDesc

Input. Handle to a previously created convolution descriptor.

inputTensorDesc

Input. Handle to a previously initialized tensor descriptor.

filterDesc

Input. Handle to a previously initialized filter descriptor.

nbDims

Input. Dimension of the output tensor

tensorOuputDimA

Output. Array of dimensions nbDims that contains on exit of this routine the sizes of the output tensor

The possible error values returned by this function and their meanings are listed below.

Returns

At least one of the following conditions are met:

• One of the parameters convDesc, inputTensorDesc, and filterDesc, is nil
• The dimension of the filter descriptor filterDesc is different from the dimension of input tensor descriptor inputTensorDesc.
• The dimension of the convolution descriptor is different from the dimension of input tensor descriptor inputTensorDesc -2 .
• The features map of the filter descriptor filterDesc is different from the one of input tensor descriptor inputTensorDesc.
• The size of the dilated filter filterDesc is larger than the padded sizes of the input tensor.
• The dimension nbDims of the output array is negative or greater than the dimension of input tensor descriptor inputTensorDesc.
CUDNN_STATUS_SUCCESS

The routine exits successfully.

### 4.89. cudnnGetCudartVersion

`size_t cudnnGetCudartVersion()`

The same version of a given cuDNN library can be compiled against different CUDA Toolkit versions. This routine returns the CUDA Toolkit version that the currently used cuDNN library has been compiled against.

### 4.90. cudnnGetDropoutDescriptor

```cudnnStatus_t cudnnGetDropoutDescriptor(
cudnnDropoutDescriptor_t    dropoutDesc,
cudnnHandle_t               handle,
float                      *dropout,
void                       **states,
unsigned long long         *seed)```

This function queries the fields of a previously initialized dropout descriptor.

Parameters

dropoutDesc

Input. Previously initialized dropout descriptor.

handle

Input. Handle to a previously created cuDNN context.

dropout

Output. The probability with which the value from input is set to 0 during the dropout layer.

states

Output. Pointer to user-allocated GPU memory that holds random number generator states.

seed

Output. Seed used to initialize random number generator states.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The call was successful.

One or more of the arguments was an invalid pointer.

### 4.91. cudnnGetErrorString

`const char * cudnnGetErrorString(cudnnStatus_t status)`

This function converts the cuDNN status code to a NUL terminated (ASCIIZ) static string. For example, when the input argument is CUDNN_STATUS_SUCCESS, the returned string is "CUDNN_STATUS_SUCCESS". When an invalid status value is passed to the function, the returned string is "CUDNN_UNKNOWN_STATUS".

Parameters

status

Input. cuDNN enumerated status code.

Returns

Pointer to a static, NUL terminated string with the status name.

### 4.92. cudnnGetFilter4dDescriptor

```cudnnStatus_t cudnnGetFilter4dDescriptor(
const cudnnFilterDescriptor_t     filterDesc,
cudnnDataType_t            *dataType,
cudnnTensorFormat_t        *format,
int                        *k,
int                        *c,
int                        *h,
int                        *w)```

This function queries the parameters of the previouly initialized filter descriptor object.

Parameters

filterDesc

Input. Handle to a previously created filter descriptor.

datatype

Output. Data type.

format

Output. Type of format.

k

Output. Number of output feature maps.

c

Output. Number of input feature maps.

h

Output. Height of each filter.

w

Output. Width of each filter.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 4.93. cudnnGetFilterNdDescriptor

```cudnnStatus_t cudnnGetFilterNdDescriptor(
const cudnnFilterDescriptor_t   wDesc,
int                             nbDimsRequested,
cudnnDataType_t                *dataType,
cudnnTensorFormat_t            *format,
int                            *nbDims,
int                             filterDimA[])```

This function queries a previously initialized filter descriptor object.

Parameters

wDesc

Input. Handle to a previously initialized filter descriptor.

nbDimsRequested

Input. Dimension of the expected filter descriptor. It is also the minimum size of the arrays filterDimA in order to be able to hold the results

datatype

Output. Data type.

format

Output. Type of format.

nbDims

Output. Actual dimension of the filter.

filterDimA

Output. Array of dimension of at least nbDimsRequested that will be filled with the filter parameters from the provided filter descriptor.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

The parameter nbDimsRequested is negative.

### 4.94. cudnnGetLRNDescriptor

```cudnnStatus_t cudnnGetLRNDescriptor(
cudnnLRNDescriptor_t    normDesc,
unsigned               *lrnN,
double                 *lrnAlpha,
double                 *lrnBeta,
double                 *lrnK)```

This function retrieves values stored in the previously initialized LRN descriptor object.

Parameters

normDesc

Output. Handle to a previously created LRN descriptor.

lrnN, lrnAlpha, lrnBeta, lrnK

Output. Pointers to receive values of parameters stored in the descriptor object. See cudnnSetLRNDescriptor for more details. Any of these pointers can be NULL (no value is returned for the corresponding parameter).

Possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

Function completed successfully.

### 4.95. cudnnGetOpTensorDescriptor

```cudnnStatus_t cudnnGetOpTensorDescriptor(
const cudnnOpTensorDescriptor_t opTensorDesc,
cudnnOpTensorOp_t               *opTensorOp,
cudnnDataType_t                 *opTensorCompType,
cudnnNanPropagation_t           *opTensorNanOpt)```

This function returns configuration of the passed Tensor Pointwise math descriptor.

Parameters

opTensorDesc

Input. Tensor Pointwise math descriptor passed, to get the configuration from.

opTensorOp

Output. Pointer to the Tensor Pointwise math operation type, associated with this Tensor Pointwise math descriptor.

opTensorCompType

Output. Pointer to the cuDNN data-type associated with this Tensor Pointwise math descriptor.

opTensorNanOpt

Output. Pointer to the NAN propagation option associated with this Tensor Pointwise math descriptor.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

Input Tensor Pointwise math descriptor passed is invalid.

### 4.96. cudnnGetPooling2dDescriptor

```cudnnStatus_t cudnnGetPooling2dDescriptor(
const cudnnPoolingDescriptor_t      poolingDesc,
cudnnPoolingMode_t                 *mode,
cudnnNanPropagation_t              *maxpoolingNanOpt,
int                                *windowHeight,
int                                *windowWidth,
int                                *verticalStride,
int                                *horizontalStride)```

This function queries a previously created 2D pooling descriptor object.

Parameters

poolingDesc

Input. Handle to a previously created pooling descriptor.

mode

Output. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Output. Enumerant to specify the Nan propagation mode.

windowHeight

Output. Height of the pooling window.

windowWidth

Output. Width of the pooling window.

verticalStride

Output. Pooling vertical stride.

horizontalStride

Output. Pooling horizontal stride.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 4.97. cudnnGetPooling2dForwardOutputDim

```cudnnStatus_t cudnnGetPooling2dForwardOutputDim(
const cudnnPoolingDescriptor_t      poolingDesc,
const cudnnTensorDescriptor_t       inputDesc,
int                                *outN,
int                                *outC,
int                                *outH,
int                                *outW)```

This function provides the output dimensions of a tensor after 2d pooling has been applied

Each dimension h and w of the output images is computed as followed:

```    outputDim = 1 + (inputDim + 2*padding - windowDim)/poolingStride;
```

Parameters

poolingDesc

Input. Handle to a previously inititalized pooling descriptor.

inputDesc

Input. Handle to the previously initialized input tensor descriptor.

N

Output. Number of images in the output.

C

Output. Number of channels in the output.

H

Output. Height of images in the output.

W

Output. Width of images in the output.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• poolingDesc has not been initialized.
• poolingDesc or inputDesc has an invalid number of dimensions (2 and 4 respectively are required).

### 4.98. cudnnGetPoolingNdDescriptor

```cudnnStatus_t cudnnGetPoolingNdDescriptor(
const cudnnPoolingDescriptor_t      poolingDesc,
int                                 nbDimsRequested,
cudnnPoolingMode_t                 *mode,
cudnnNanPropagation_t              *maxpoolingNanOpt,
int                                *nbDims,
int                                 windowDimA[],
int                                 strideA[])```

This function queries a previously initialized generic pooling descriptor object.

Parameters

poolingDesc

Input. Handle to a previously created pooling descriptor.

nbDimsRequested

Input. Dimension of the expected pooling descriptor. It is also the minimum size of the arrays windowDimA, paddingA and strideA in order to be able to hold the results.

mode

Output. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Input. Enumerant to specify the Nan propagation mode.

nbDims

Output. Actual dimension of the pooling descriptor.

windowDimA

Output. Array of dimension of at least nbDimsRequested that will be filled with the window parameters from the provided pooling descriptor.

Output. Array of dimension of at least nbDimsRequested that will be filled with the padding parameters from the provided pooling descriptor.

strideA

Output. Array of dimension at least nbDimsRequested that will be filled with the stride parameters from the provided pooling descriptor.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

CUDNN_STATUS_NOT_SUPPORTED

The parameter nbDimsRequested is greater than CUDNN_DIM_MAX.

### 4.99. cudnnGetPoolingNdForwardOutputDim

```cudnnStatus_t cudnnGetPoolingNdForwardOutputDim(
const cudnnPoolingDescriptor_t  poolingDesc,
const cudnnTensorDescriptor_t   inputDesc,
int                             nbDims,
int                             outDimA[])```

This function provides the output dimensions of a tensor after Nd pooling has been applied

Each dimension of the (nbDims-2)-D images of the output tensor is computed as followed:

```    outputDim = 1 + (inputDim + 2*padding - windowDim)/poolingStride;
```

Parameters

poolingDesc

Input. Handle to a previously inititalized pooling descriptor.

inputDesc

Input. Handle to the previously initialized input tensor descriptor.

nbDims

Input. Number of dimensions in which pooling is to be applied.

outDimA

Output. Array of nbDims output dimensions.

The possible error values returned by this function and their meanings are listed below.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• poolingDesc has not been initialized.
• The value of nbDims is inconsistent with the dimensionality of poolingDesc and inputDesc.

### 4.100. cudnnGetProperty

```cudnnStatus_t cudnnGetProperty(
libraryPropertyType     type,
int                    *value)```

This function writes a specific part of the cuDNN library version number into the provided host storage.

Parameters

type

Input. Enumerated type that instructs the function to report the numerical value of the cuDNN major version, minor version, or the patch level.

value

Output. Host pointer where the version information should be written.

Returns

CUDNN_STATUS_INVALID_VALUE

Invalid value of the type argument.

CUDNN_STATUS_SUCCESS

Version information was stored successfully at the provided address.

```cudnnStatus_t cudnnGetRNNDataDescriptor(
cudnnDataType_t                *dataType,
cudnnRNNDataLayout_t           *layout,
int                            *maxSeqLength,
int                            *batchSize,
int                            *vectorSize,
int                            arrayLengthRequested,
int                            seqLengthArray[],
```

This function retrieves a previously created RNN data descriptor object.

Parameters

Input. A previously created and initialized RNN descriptor.

dataType

Output. Pointer to the host memory location to store the datatype of the RNN data tensor.

layout

Output. Pointer to the host memory location to store the memory layout of the RNN data tensor.

maxSeqLength

Output. The maximum sequence length within this RNN data tensor, including the padding vectors.

batchSize

Output. The number of sequences within the mini-batch.

vectorSize

Output. The vector length (i.e. embedding size) of the input or output tensor at each timestep.

arrayLengthRequested

Input. The number of elements that the user requested for seqLengthArray.

seqLengthArray

Output. Pointer to the host memory location to store the integer array describing the length (i.e. number of timesteps) of each sequence. This is allowed to be a NULL pointer if arrayLengthRequested is zero.

Output. Pointer to the host memory location to store the user defined symbol. The symbol should be interpreted as the same data type as the RNN data tensor.

Returns

CUDNN_STATUS_SUCCESS

The parameters are fetched successfully.

Any one of these have occurred:

• Any of RNNDataDesc, dataType, layout, maxSeqLength, batchSize, vectorSize, paddingFill is NULL.
• seqLengthArray is NULL while arrayLengthRequested is greater than zero.
• arrayLengthRequested is less than zero.

### 4.102. cudnnGetRNNDescriptor

```cudnnStatus_t cudnnGetRNNDescriptor(
cudnnHandle_t               handle,
cudnnRNNDescriptor_t        rnnDesc,
int *                       hiddenSize,
int *                       numLayers,
cudnnDropoutDescriptor_t *  dropoutDesc,
cudnnRNNInputMode_t *       inputMode,
cudnnDirectionMode_t *      direction,
cudnnRNNMode_t *            mode,
cudnnRNNAlgo_t *            algo,
cudnnDataType_t *           dataType)```

This function retrieves RNN network parameters that were configured by cudnnSetRNNDescriptor(). All pointers passed to the function should be not-NULL or CUDNN_STATUS_BAD_PARAM is reported. The function does not check the validity of retrieved network parameters. The parameters are verified when they are written to the RNN descriptor.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

rnnDesc

Input. A previously created and initialized RNN descriptor.

hiddenSize

Output. Pointer where the size of the hidden state should be stored (the same value is used in every layer).

numLayers

Output. Pointer where the number of RNN layers should be stored.

dropoutDesc

Output. Pointer where the handle to a previously configured dropout descriptor should be stored.

inputMode

Output. Pointer where the mode of the first RNN layer should be saved.

direction

Output. Pointer where RNN uni-directional/bi-directional mode should be saved.

mode

Output. Pointer where RNN cell type should be saved.

algo

Output. Pointer where RNN algorithm type should be stored.

dataType

Output. Pointer where the data type of RNN weights/biases should be stored.

Returns

CUDNN_STATUS_SUCCESS

At least one pointer passed to the cudnnGetRNNDescriptor() function is NULL.

### 4.103. cudnnGetRNNLinLayerBiasParams

```cudnnStatus_t cudnnGetRNNLinLayerBiasParams(
cudnnHandle_t                   handle,
const cudnnRNNDescriptor_t      rnnDesc,
const int                       pseudoLayer,
const cudnnTensorDescriptor_t   xDesc,
const cudnnFilterDescriptor_t   wDesc,
const void                     *w,
const int                       linLayerID,
cudnnFilterDescriptor_t         linLayerBiasDesc,
void                           **linLayerBias)```

This function is used to obtain a pointer and a descriptor of every RNN bias column vector in each pseudo-layer within the recurrent network defined by rnnDesc and its input width specified in xDesc.

Note: The cudnnGetRNNLinLayerBiasParams() function was changed in cuDNN version 7.1.1 to match the behavior of cudnnGetRNNLinLayerMatrixParams().

The cudnnGetRNNLinLayerBiasParams() function returns the RNN bias vector size in two dimensions: rows and columns. Due to historical reasons, the minimum number of dimensions in the filter descriptor is three. In previous versions of the cuDNN library, the function returned the total number of vector elements in linLayerBiasDesc as follows: filterDimA[0]=total_size, filterDimA[1]=1, filterDimA[2]=1 (see the description of the cudnnGetFilterNdDescriptor() function). In v7.1.1, the format was changed to: filterDimA[0]=1, filterDimA[1]=rows, filterDimA[2]=1 (number of columns). In both cases, the "format" field of the filter descriptor should be ignored when retrieved by cudnnGetFilterNdDescriptor(). Note that the RNN implementation in cuDNN uses two bias vectors before the cell non-linear function (see equations in Chapter 3 describing the cudnnRNNMode_t enumerated type).

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

rnnDesc

Input. A previously initialized RNN descriptor.

pseudoLayer

Input. The pseudo-layer to query. In uni-directional RNN-s, a pseudo-layer is the same as a "physical" layer (pseudoLayer=0 is the RNN input layer, pseudoLayer=1 is the first hidden layer). In bi-directional RNN-s there are twice as many pseudo-layers in comparison to "physical" layers (pseudoLayer=0 and pseudoLayer=1 are both input layers; pseudoLayer=0 refers to the forward part and pseudoLayer=1 refers to the backward part of the "physical" input layer; pseudoLayer=2 is the forward part of the first hidden layer, and so on).

xDesc

Input. A fully packed tensor descriptor describing the input to one recurrent iteration (to retrieve the RNN input width).

wDesc

Input. Handle to a previously initialized filter descriptor describing the weights for the RNN.

w

Input. Data pointer to GPU memory associated with the filter