Abstract
This cuDNN Developer Guide provides an overview of cuDNN v7.6.4, and details about the types, enums, and routines within the cuDNN library API.
For previously released cuDNN developer documentation, see cuDNN Archives.
NVIDIA® cuDNN is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines arising frequently in DNN applications:
- Convolution forward and backward, including cross-correlation
- Pooling forward and backward
- Softmax forward and backward
- Neuron activations forward and backward:
- Rectified linear (ReLU)
- Sigmoid
- Hyperbolic tangent (TANH)
- Tensor transformation functions
- LRN, LCN and batch normalization forward and backward
cuDNN's convolution routines aim for a performance that is competitive with the fastest GEMM (matrix multiply)-based implementations of such routines, while using significantly less memory.
cuDNN features include customizable data layouts, supporting flexible dimension ordering, striding, and subregions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility allows easy integration into any neural network implementation, and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.
cuDNN offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams.
Basic concepts are described in this section.
2.1. Programming Model
The cuDNN Library exposes a Host API but assumes that for operations using the GPU, the necessary data is directly accessible from the device.
An application using cuDNN must initialize a handle to the library context by calling cudnnCreate()
. This handle is explicitly passed to every subsequent library function that operates on GPU data. Once the application finishes using cuDNN, it can release the resources associated with the library handle using cudnnDestroy()
. This approach allows the user to explicitly control the library's functioning when using multiple host threads, GPUs and CUDA Streams.
For example, an application can use cudaSetDevice()
to associate different devices with different host threads, and in each of those host threads, use a unique cuDNN handle that directs the library calls to the device associated with it. Thus the cuDNN library calls made with different handles will automatically run on different devices.
The device associated with a particular cuDNN context is assumed to remain unchanged between the corresponding cudnnCreate()
and cudnnDestroy()
calls. In order for the cuDNN library to use a different device within the same host thread, the application must set the new device to be used by calling cudaSetDevice()
and then create another cuDNN context, which will be associated with the new device, by calling cudnnCreate()
.
cuDNN API Compatibility
Beginning in cuDNN 7, the binary compatibility of patch and minor releases is maintained as follows:
- Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z)
- cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x)
- Applications compiled with a cuDNN version 7.y are not guaranteed to work with 7.x release when y > x.
2.2. Convolution Formulas
This section describes the various convolution formulas implemented in cuDNN convolution functions.
The convolution terms described in the table below apply to all the convolution formulas that follow.
Term | Description |
---|---|
Input (image) Tensor | |
Weight Tensor | |
Output Tensor | |
Current Batch Size | |
Current Input Channel | |
Total Input Channels | |
Input Image Height | |
Input Image Width | |
Current Output Channel | |
Total Output Channels | |
Current Output Height Position | |
Current Output Width Position | |
Group Count | |
Padding Value | |
Vertical Subsample Stride (along Height) | |
Horizontal Subsample Stride (along Width) | |
Vertical Dilation (along Height) | |
Horizontal Dilation (along Width) | |
Current Filter Height | |
Total Filter Height | |
Current Filter Width | |
Total Filter Width | |
Normal Convolution (using cross-correlation mode)
Convolution with Padding
Convolution with Subsample-Striding
Convolution with Dilation
Convolution using Convolution Mode
Convolution using Grouped Convolution
2.3. Notation
As of CUDNN v4 we have adopted a mathematicaly-inspired notation for layer inputs and outputs using x,y,dx,dy,b,w
for common layer parameters. This was done to improve the readability and ease of understanding of the meaning of the parameters. All layers now follow a uniform convention as below:
During Inference:
y = layerFunction(x, otherParams)
.
During backpropagation:
(dx, dOtherParams) = layerFunctionGradient(x,y,dy,otherParams)
For convolution the notation is
y = x*w+b
where w
is the matrix of filter weights, x
is the previous layer's data (during inference), y
is the next layer's data, b
is the bias and *
is the convolution operator.
In backpropagation routines the parameters keep their meanings.
The parameters dx,dy,dw,db
always refer to the gradient of the final network error function with respect to a given parameter. So dy
in all backpropagation routines always refers to error gradient backpropagated through the network computation graph so far. Similarly other parameters in more specialized layers, such as, for instance, dMeans
or dBnBias
refer to gradients of the loss function wrt those parameters.
w
is used in the API for both the width of the x
tensor and convolution filter matrix. To resolve this ambiguity we use w
and filter
notation interchangeably for convolution filter weight matrix. The meaning is clear from the context since the layer width is always referenced near its height.
2.4. Tensor Descriptor
The cuDNN Library describes data holding images, videos and any other data with contents with a generic n-D tensor defined with the following parameters :
- a dimension
nbDims
from 3 to 8 - a data type (32-bit floating point, 64 bit-floating point, 16 bit floating point...)
dimA
integer array defining the size of each dimensionstrideA
integer array defining the stride of each dimension (e.g the number of elements to add to reach the next element from the same dimension)
The first dimension of the tensor defines the batch size n
, and the second dimension defines the number of features maps c
. This tensor definition allows for example to have some dimensions overlapping each others within the same tensor by having the stride of one dimension smaller than the product of the dimension and the stride of the next dimension. In cuDNN, unless specified otherwise, all routines will support tensors with overlapping dimensions for forward pass input tensors, however, dimensions of the output tensors cannot overlap. Even though this tensor format supports negative strides (which can be useful for data mirroring), cuDNN routines do not support tensors with negative strides unless specified otherwise.
2.4.1. WXYZ Tensor Descriptor
Tensor descriptor formats are identified using acronyms, with each letter referencing a corresponding dimension. In this document, the usage of this terminology implies :
- all the strides are strictly positive
- the dimensions referenced by the letters are sorted in decreasing order of their respective strides
2.4.2. 4-D Tensor Descriptor
A 4-D Tensor descriptor is used to define the format for batches of 2D images with 4 letters : N,C,H,W for respectively the batch size, the number of feature maps, the height and the width. The letters are sorted in decreasing order of the strides. The commonly used 4-D tensor formats are :
- NCHW
- NHWC
- CHWN
2.4.3. 5-D Tensor Description
A 5-D Tensor descriptor is used to define the format of batch of 3D images with 5 letters : N,C,D,H,W for respectively the batch size, the number of feature maps, the depth, the height and the width. The letters are sorted in descreasing order of the strides. The commonly used 5-D tensor formats are called :
- NCDHW
- NDHWC
- CDHWN
2.4.4. Fully-packed tensors
A tensor is defined as XYZ-fully-packed
if and only if :
- the number of tensor dimensions is equal to the number of letters preceding the
fully-packed
suffix. - the stride of the i-th dimension is equal to the product of the (i+1)-th dimension by the (i+1)-th stride.
- the stride of the last dimension is 1.
2.4.5. Partially-packed tensors
The partially 'XYZ-packed' terminology only applies in a context of a tensor format described with a superset of the letters used to define a partially-packed tensor. A WXYZ tensor is defined as XYZ-packed
if and only if :
- the strides of all dimensions NOT referenced in the -packed suffix are greater or equal to the product of the next dimension by the next stride.
- the stride of each dimension referenced in the -packed suffix in position i is equal to the product of the (i+1)-st dimension by the (i+1)-st stride.
- if last tensor's dimension is present in the -packed suffix, its stride is 1.
For example a NHWC tensor WC-packed means that the c_stride is equal to 1 and w_stride is equal to c_dim x c_stride. In practice, the -packed suffix is usually with slowest changing dimensions of a tensor but it is also possible to refer to a NCHW tensor that is only N-packed.
2.4.6. Spatially packed tensors
Spatially-packed tensors are defined as partially-packed in spatial dimensions.
For example a spatially-packed 4D tensor would mean that the tensor is either NCHW HW-packed or CNHW HW-packed.
2.4.7. Overlapping tensors
A tensor is defined to be overlapping if a iterating over a full range of dimensions produces the same address more than once.
In practice an overlapped tensor will have stride[i-1] < stride[i]*dim[i] for some of the i from [1,nbDims] interval.
2.5. Data Layout Formats
This section describes how cuDNN Tensors are arranged in memory. See cudnnTensorFormat_t for enumerated Tensor format types.
2.5.1. Example
Consider a batch of images in 4D with the following dimensions:
- N, the batch size, is 1
- C, the number of feature maps (i.e., number of channels), is 64
- H, the image height, is 5, and
- W, the image width, is 4
To keep the example simple, the image pixel elements are expressed as a sequence of integers, 0, 1, 2, 3, and so on. See Figure 1.
Figure 1. Example with N=1, C=64, H=5, W=4.
2.5.2. NCHW Memory Layout
The above 4D Tensor is laid out in the memory in the NCHW format as below:
- Beginning with the first channel (c=0), the elements are arranged contiguously in row-major order.
- Continue with second and subsequent channels until the elements of all the channels are laid out.
See Figure 2.
- Proceed to the next batch (if N is > 1).
Figure 2. NCHW Memory Layout
2.5.3. NHWC Memory Layout
For the NHWC memory layout, the corresponding elements in all the C channels are laid out first, as below:
- Begin with the first element of channel 0, then proceed to the first element of channel 1, and so on, until the first elements of all the C channels are laid out.
- Next, select the second element of channel 0, then proceed to the second element of channel 1, and so on, until the second element of all the channels are laid out.
- Follow the row-major order in channel 0 and complete all the elements. See Figure 3.
- Proceed to the next batch (if N is > 1).
Figure 3. NHWC Memory Layout
2.5.4. NC/32HW32 Memory Layout
The NC/32HW32 is similar to NHWC, with a key difference. For the NC/32HW32 memory layout, the 64 channels are grouped into two groups of 32 channels each—first group consisting of channels c0 through c31, and the second group consisting of channels c32 through c63. Then each group is laid out using the NHWC format. See Figure 4.
Figure 4. NC/32HW32 Memory Layout
For the generalized NC/xHWx layout format, the following observations apply:
-
Only the channel dimension, C, is grouped into x channels each.
-
When x = 1, each group has only one channel. Hence, the elements of one channel (i.e, one group) are arranged contiguously (in the row-major order), before proceeding to the next group (i.e., next channel). This is the same as NCHW format.
-
When x = C, then NC/xHWx is identical to NHWC, i.e., the entire channel depth C is considered as a single group. The case x = C can be thought of as vectorizing entire C dimension as one big vector, laying out all the Cs, followed by the remaining dimensions, just like NHWC.
-
The tensor format CUDNN_TENSOR_NCHW_VECT_C can also be interpreted in the following way: The NCHW INT8x32 format is really N x (C/32) x H x W x 32 (32 Cs for every W), just as the NCHW INT8x4 format is N x (C/4) x H x W x 4 (4 Cs for every W). Hence the "VECT_C" name - each W is a vector (4 or 32) of Cs.
2.6. Thread Safety
The library is thread safe and its functions can be called from multiple host threads, as long as threads to do not share the same cuDNN handle simultaneously.
2.7. Reproducibility (determinism)
By design, most of cuDNN's routines from a given version generate the same bit-wise results across runs when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across versions, as the implementation of a given routine may change. With the current release, the following routines do not guarantee reproducibility because they use atomic operations:
cudnnConvolutionBackwardFilter
whenCUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
orCUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
is usedcudnnConvolutionBackwardData
whenCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
is usedcudnnPoolingBackward
whenCUDNN_POOLING_MAX
is usedcudnnSpatialTfSamplerBackward
2.8. Scaling Parameters
Many cuDNN routines like cudnnConvolutionForward accept pointers in host memory to scaling factors alpha
and beta
. These scaling factors are used to blend the computed values with the prior values in the destination tensor as follows (see Figure 5):
dstValue = alpha*computedValue + beta*priorDstValue.
The dstValue
is written to after being read.
Figure 5. Scaling Parameters for Convolution
When beta
is zero, the output is not read and may contain uninitialized data (including NaN).
These parameters are passed using a host memory pointer. The storage data types for alpha
and beta
are:
-
float
for HALF and FLOAT tensors, and -
double
for DOUBLE tensors.
For improved performance use beta
= 0.0. Use a non-zero value for beta only when you need to blend the current output tensor values with the prior values of the output tensor.
Type Conversion
When the data input x
, the filter input w
and the output y
are all in INT8 data type, the function cudnnConvolutionBiasActivationForward()
will perform the type conversion as shown in Figure 6:
Accumulators are 32-bit integers which wrap on overflow.
Figure 6. INT8 for cudnnConvolutionBiasActivationForward
2.9. Tensor Core Operations
The cuDNN v7 library introduced the acceleration of compute-intensive routines using Tensor Core hardware on supported GPU SM versions. Tensor core operations are supported on the Volta and Turing GPU families.
2.9.1. Basics
Tensor core operations perform parallel floating point accumulation of multiple floating point product terms. Setting the math mode to CUDNN_TENSOR_OP_MATH via the cudnnMathType_t enumerator indicates that the library will use Tensor Core operations. This enumerator specifies the available options to enable the Tensor Core, and should be applied on a per-routine basis.
The default math mode is CUDNN_DEFAULT_MATH, which indicates that the Tensor Core operations will be avoided by the library. Because the CUDNN_TENSOR_OP_MATH mode uses the Tensor Cores, it is possible that these two modes generate slightly different numerical results due to different sequencing of the floating point operations.
For example, the result of multiplying two matrices using Tensor Core operations is very close to, but not always identical, the result achieved using a sequence of scalar floating point operations. For this reason, the cuDNN library requires an explicit user opt-in before enabling the use of Tensor Core operations.
However, experiments with training common deep learning models show negligible differences between using Tensor Core operations and scalar floating point paths, as measured by both the final network accuracy and the iteration count to convergence. Consequently, the cuDNN library treats both modes of operation as functionally indistinguishable, and allows for the scalar paths to serve as legitimate fallbacks for cases in which the use of Tensor Core operations is unsuitable.
Kernels using Tensor Core operations are available for both convolutions and RNNs.
See also Training with Mixed Precision.
2.9.2. Convolution Functions
2.9.2.1. Prerequisite
For the supported GPUs, the Tensor Core operations will be triggered for convolution functions only when cudnnSetConvolutionMathType is called on the appropriate convolution descriptor by setting the mathType
to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.
2.9.2.2. Supported Algorithms
When the prerequisite is met, the below convolution functions can be run as Tensor Core operations:
See the table below for supported algorithms:
Supported Convolution Function | Supported Algos |
cudnnConvolutionForward | -CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM, -CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED |
cudnnConvolutionBackwardData | -CUDNN_CONVOLUTION_BWD_DATA_ALGO_1, -CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED |
cudnnConvolutionBackwardFilter | -CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, -CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED |
2.9.2.3. Data and Filter Formats
The cuDNN library may use padding, folding, and NCHW-to-NHWC transformations to call the Tensor Core operations. See Tensor Transformations.
For algorithms other than *_ALGO_WINOGRAD_NONFUSED, when the following requirements are met, the cuDNN library will trigger the Tensor Core operations:
2.9.3. RNN Functions
2.9.3.1. Prerequisite
Tensor core operations will be triggered for these RNN functions only when cudnnSetRNNMatrixMathType is called on the appropriate RNN descriptor setting mathType
to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.
2.9.3.2. Supported Algorithms
When the above prerequisite is met, the RNN functions below can be run as Tensor Core operations:
See the table below for the supported algorithms:
RNN Function | Support Algos |
All RNN functions that support Tensor Core operations | -CUDNN_RNN_ALGO_STANDARD -CUDNN_RNN_ALGO_PERSIST_STATIC (new for cuDNN 7.1) |
2.9.3.3. Data and Filter Formats
When the following requirements are met, then the cuDNN library will trigger the Tensor Core operations:
See also Features of RNN Functions.
2.9.4. Tensor Transformations
A few functions in the cuDNN library will perform transformations such as folding, padding, and NCHW-to-NHWC conversion while performing the actual function operation. See below.
2.9.4.1. FP16 Data
Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply leads to a full-precision result that is accumulated in FP32 operations with the other products in a given dot product for a matrix with m x n x k
dimensions. See Figure 7.
Figure 7. Tensor Operation with FP16 Inputs
2.9.4.2. FP32-to-FP16 Conversion
The cuDNN API for allows the user to specify that FP32 input data may be copied and converted to FP16 data internally to use Tensor Core Operations for potentially improved performance. This can be achieved by selecting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum for cudnnMathType_t. In this mode, the FP32 Tensors are internally down-converted to FP16, the Tensor Op math is performed, and finally up-converted to FP32 as outputs. See Figure 8.
Figure 8. Tensor Operation with FP32 Inputs
For Convolutions:
For convolutions, the FP32-to-FP16 conversion can be achieved by passing the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the cudnnSetConvolutionMathType() call. See the below code snippet:
// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr(cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION));
For RNNs:
For RNNs, the FP32-to-FP16 conversion can be achieved by passing the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION enum value to the cudnnSetRNNMatrixMathType() call to allow FP32 data to be converted for use in RNNs. See the below code snippet example:
// Set the math type to allow cuDNN to use Tensor Cores:
checkCudnnErr(cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION));
2.9.4.3. Padding
For packed NCHW data, when the channel dimension is not a multiple of 8, then the cuDNN library will pad the tensors as needed to enable Tensor Core operations. This padding is automatic for packed NCHW data in both the CUDNN_TENSOR_OP_MATH and the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION cases.
The padding occurs with a negligible loss of performance. Hence, the NCHW Tensor dimensions such as below are allowed:
// Set NCHW Tensor dimensions, not necessarily as multiples of eight (only the input tensor is shown here):
int dimA[] = {1, 7, 32, 32};
int strideA[] = {7168, 1024, 32, 1};
2.9.4.4. Folding
In the folding operation the cuDNN library implicitly performs the formatting of input tensors and saves the input tensors in an internal workspace. This can lead to an acceleration of the call to Tensor Cores.
Folding enables the input Tensors to be transformed to a format that the Tensor Cores support (i.e., no strides).
2.9.4.5. Conversion Between NCHW and NHWC
Tensor Cores require that the Tensors be in NHWC data layout. Conversion between NCHW and NHWC is performed when the user requests Tensor Op math. However, as stated in Basics, a request to use Tensor Cores is just that, a request, and Tensor Cores may not be used in some cases. The cuDNN library converts between NCHW and NHWC if and only if Tensor Cores are requested and are actually used.
If your input (and output) are NCHW, then expect a layout change. See also for packed NCHW data.
Non-Tensor Op convolutions will not perform conversions between NCHW and NHWC.
In very rare, and difficult-to-qualify, cases that are a complex function of padding and filter sizes, it is possible that Tensor Ops are not enabled. In such cases, users should pre-pad.
2.9.5. Guidelines for a Deep Learning Compiler
For a deep learning compiler, the following are the key guidelines:
2.10. GPU and driver requirements
cuDNN v7.0 supports NVIDIA GPUs of compute capability 3.0 and higher. For x86_64 platform, cuDNN v7.0 comes with two deliverables: one requires a NVIDIA Driver compatible with CUDA Toolkit 8.0, the other requires a NVIDIA Driver compatible with CUDA Toolkit 9.0.
If you are using cuDNN with a Volta GPU, version 7 or later is required.
2.11. Backward compatibility and deprecation policy
When changing the API of an existing cuDNN function "foo" (usually to support some new functionality), first, a new routine "foo_v<n>
" is created where n
represents the cuDNN version where the new API is first introduced, leaving "foo" untouched. This ensures backward compatibility with the version n-1
of cuDNN. At this point, "foo" is considered deprecated, and should be treated as such by users of cuDNN. We gradually eliminate deprecated and suffixed API entries over the course of a few releases of the library per the following policy:
As a rule of thumb, when a routine appears in two forms, one with a suffix and one with no suffix, the non-suffixed entry is to be treated as deprecated. In this case, it is strongly advised that users migrate to the new suffixed API entry to guarantee backwards compatibility in the following cuDNN release. When a routine appears with multiple suffixes, the unsuffixed API entry is mapped to the higher numbered suffix. In that case it is strongly advised to use the non-suffixed API entry to guarantee backward compatibiliy with the following cuDNN release.
2.12. Grouped Convolutions
cuDNN supports grouped convolutions by setting groupCount > 1 for the convolution descriptor convDesc
, using cudnnSetConvolutionGroupCount()
.
By default the convolution descriptor convDesc
is set to groupCount of 1.
Basic Idea
Conceptually, in grouped convolutions the input channels and the filter channels are split into groupCount number of independent groups, with each group having a reduced number of channels. Convolution operation is then performed separately on these input and filter groups.
For example, consider the following: if the number of input channels is 4, and the number of filter channels of 12. For a normal, ungrouped convolution, the number of computation operations performed are 12*4.
If the groupCount is set to 2, then there are now two input channel groups of two input channels each, and two filter channel groups of six filter channels each.
As a result, each grouped convolution will now perform 2*6 computation operations, and two such grouped convolutions are performed. Hence the computation savings are 2x: (12*4)/(2*(2*6))
See Convolution Formulas for the math behind the cuDNN Grouped Convolution.
Example
Below is an example showing the dimensions and strides for grouped convolutions for NCHW format, for 2D convolution.
Note that the symbols "*" and "/" are used to indicate multiplication and division.
- Group Count:
groupCount
2.13. API Logging
cuDNN API logging is a tool that records all input parameters passed into every cuDNN API function call. This functionality is disabled by default, and can be enabled through methods described in this section.
The log output contains variable names, data types, parameter values, device pointers, process ID, thread ID, cuDNN handle, cuda stream ID, and metadata such as time of the function call in microseconds.
When logging is enabled, the log output will be handled by the built-in default callback function. The user may also write their own callback function, and use the cudnnSetCallback
to pass in the function pointer of their own callback function. The following is a sample output of the API log.
Function cudnnSetActivationDescriptor() called:
mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1);
reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0);
coef: type=double; val=1000.000000;
Time: 2017-11-21T14:14:21.366171 (0d+0h+1m+5s since start)
Process: 21264, Thread: 21264, cudnn_handle: NULL, cudnn_stream: NULL.
There are two methods to enable API logging.
Method 1: Using Environment Variables
To enable API logging using environment variables, follow these steps:
See also Table 1 for the impact on performance of API logging using environment variables.
Environment variables | CUDNN_LOGINFO_DBG=0 | CUDNN_LOGINFO_DBG=1 |
---|---|---|
CUDNN_LOGDEST_DBG not set |
- No logging output - No performance loss |
- No logging output - No performance loss |
CUDNN_LOGDEST_DBG= |
- No logging output - No performance loss |
- No logging output - No performance loss |
CUDNN_LOGDEST_DBG= |
- No logging output - No performance loss |
- Logging to - Some performance loss |
CUDNN_LOGDEST_DBG=
|
- No logging output - No performance loss |
- Logging to - Some performance loss |
Method 2
Method 2: To use API function calls to enable API logging, refer to the API description of cudnnSetCallback()
and cudnnGetCallback()
.
2.14. Features of RNN Functions
See the table below for a list of features supported by each RNN function:
For each of these terms, the short-form versions shown in the paranthesis are used in the tables below for brevity: CUDNN_RNN_ALGO_STANDARD
(_ALGO_STANDARD), CUDNN_RNN_ALGO_PERSIST_STATIC
(_ALGO_PERSIST_STATIC), CUDNN_RNN_ALGO_PERSIST_DYNAMIC
(_ALGO_PERSIST_DYNAMIC), and CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
(_ALLOW_CONVERSION).
Functions | Input output layout supported | Supports variable sequence length in batch | Commonly supported |
cudnnRNNForwardInference |
Only Sequence major, packed (non-padded) | Only with Require input sequences descending sorted according to length |
Mode (cell type) supported:
Algo supported* (see the table below for an elaboration on these algorithms):
Math mode supported:
(will automatically fall back if run on pre-Volta or if algo doesn’t support Tensor Cores)
Direction mode supported:
RNN input mode:
|
|
|||
cudnnRNNBackwardData |
|||
|
|||
cudnnRNNForwardInferenceEx |
Sequence major unpacked, Batch major unpacked**, Sequence major packed** |
Only with For unpacked layout**, no input sorting required. For packed layout, require input sequences descending sorted according to length |
|
|
|||
cudnnRNNBackwardDataEx |
|||
cudnnRNNBackwardWeightsEx |
* Do not mix different algos for different steps of training. It’s also not recommended to mix non-extended and extended API for different steps of training.
** To use unpacked layout, user need to set CUDNN_RNN_PADDED_IO_ENABLED through cudnnSetRNNPaddingMode
.
The following table provides the features supported by the algorithms referred in the above table: CUDNN_RNN_ALGO_STANDARD
, CUDNN_RNN_ALGO_PERSIST_STATIC
, and CUDNN_RNN_ALGO_PERSIST_DYNAMIC
.
Features | _ALGO_STANDARD |
_ALGO_PERSIST_STATIC |
_ALGO_PERSIST_DYNAMIC |
Half input Single accumulation Half output |
Supported Half intermediate storage Single accumulation |
||
Single input Single accumulation Single output |
Supported If running on Volta, with Otherwise: Single intermediate storage Single accumulation |
||
Double input Double accumulation Double output |
Supported Double intermediate storage Double accumulation |
Not Supported | Supported Double intermediate storage Double accumulation |
LSTM recurrent projection | Supported | Not Supported | Not Supported |
LSTM cell clipping | Supported | ||
Variable sequence length in batch | Supported | Not Supported | Not Supported |
Tensor Cores on Volta/Xavier | Supported For half input/output, acceleration requires setting
Acceleration requires For single input/output, acceleration requires setting
Acceleration requires |
Not Supported, will execute normally ignoring CUDNN_TENSOR_OP_MATH! or
|
|
Other limitations | Max problem size is limited by GPU specifications. | Requires real time compilation through NVRTC |
!CUDNN_TENSOR_OP_MATH
or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
can be set through cudnnSetRNNMatrixMathType
.
2.15. Mixed Precision Numerical Accuracy
When the computation precision and the output precision are not the same, it is possible that the numerical accuracy will vary from one algorithm to the other.
For example, when the computation is performed in FP32 and the output is in FP16, the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 ("ALGO_0") has lower accuracy compared to the CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 ("ALGO_1"). This is because ALGO_0 does not use extra workspace, and is forced to accumulate the intermediate results in FP16, i.e., half precision float, and this reduces the accuracy. The ALGO_1, on the other hand, uses additonal workspace to accumulate the intermediate values in FP32, i.e., full precision float.
This chapter describes all the types and enums of the cuDNN library API.
3.1. cudnnActivationDescriptor_t
cudnnActivationDescriptor_t
is a pointer to an opaque structure holding the description of an activation operation. cudnnCreateActivationDescriptor is used to create one instance, and cudnnSetActivationDescriptor must be used to initialize this instance.
3.2. cudnnActivationMode_t
cudnnActivationMode_t
is an enumerated type used to select the neuron activation function used in cudnnActivationForward()
, cudnnActivationBackward()
and cudnnConvolutionBiasActivationForward()
.
Values
-
CUDNN_ACTIVATION_SIGMOID
-
Selects the sigmoid function.
-
CUDNN_ACTIVATION_RELU
-
Selects the rectified linear function.
-
CUDNN_ACTIVATION_TANH
-
Selects the hyperbolic tangent function.
-
CUDNN_ACTIVATION_CLIPPED_RELU
-
Selects the clipped rectified linear function.
-
CUDNN_ACTIVATION_ELU
-
Selects the exponential linear function.
-
CUDNN_ACTIVATION_IDENTITY
-
Selects the identity function, intended for bypassing the activation step in
cudnnConvolutionBiasActivationForward().
(ThecudnnConvolutionBiasActivationForward()
function must useCUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
.) Does not work withcudnnActivationForward()
orcudnnActivationBackward()
.
3.3. cudnnAttnDescriptor_t
cudnnAttnDescriptor_t
is a pointer to an opaque structure holding parameters of the multi-head attention layer such as:
- weight and bias tensor shapes (vector lengths before and after linear projections)
- parameters that can be set in advance and do not change when invoking functions to evaluate forward responses and gradients (number of attention heads, softmax smoothing/sharpening coefficient)
- other settings that are necessary to compute temporary buffer sizes.
Use the cudnnCreateAttnDescriptor function to create an instance of the attention descriptor object and cudnnDestroyAttnDescriptor to delete the previously created descriptor. Use the cudnnSetAttnDescriptor function to configure the descriptor.
3.4. cudnnBatchNormMode_t
cudnnBatchNormMode_t
is an enumerated type used to specify the mode of operation in cudnnBatchNormalizationForwardInference, cudnnBatchNormalizationForwardTraining, cudnnBatchNormalizationBackward and cudnnDeriveBNTensorDescriptor routines.
Values
-
CUDNN_BATCHNORM_PER_ACTIVATION
-
Normalization is performed per-activation. This mode is intended to be used after the non-convolutional network layers. In this mode, the tensor dimensions of
bnBias
andbnScale
and the parameters used in thecudnnBatchNormalization*
functions, are 1xCxHxW. -
CUDNN_BATCHNORM_SPATIAL
-
Normalization is performed over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode the
bnBias
andbnScale
tensor dimensions are 1xCx1x1. -
CUDNN_BATCHNORM_SPATIAL_PERSISTENT
-
This mode is similar to
CUDNN_BATCHNORM_SPATIAL
but it can be faster for some tasks.An optimized path may be selected for
CUDNN_DATA_FLOAT
andCUDNN_DATA_HALF
types, compute capability 6.0 or higher for the following two batch normalization API calls: cudnnBatchNormalizationForwardTraining, and cudnnBatchNormalizationBackward. In the case of cudnnBatchNormalizationBackward, thesavedMean
andsavedInvVariance
arguments should not beNULL
.The rest of this section applies to
NCHW
mode only:This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaN-s or Inf-s (infinity) in output buffers.
When Inf-s/NaN-s are present in the input data, the output in this mode is the same as from a pure floating-point implementation.
For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Inf-s/NaN-s while
CUDNN_BATCHNORM_SPATIAL
will produce finite results. The user can invoke cudnnQueryRuntimeError to check if a numerical overflow occurred in this mode.
3.5. cudnnBatchNormOps_t
cudnnBatchNormOps_t
is an enumerated type used to specify the mode of operation in cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
, cudnnBatchNormalizationForwardTrainingEx()
, cudnnGetBatchNormalizationBackwardExWorkspaceSize()
, cudnnBatchNormalizationBackwardEx()
, and cudnnGetBatchNormalizationTrainingExReserveSpaceSize()
functions.
Values
-
CUDNN_BATCHNORM_OPS_BN
-
Only batch normalization is performed, per-activation.
-
CUDNN_BATCHNORM_OPS_BN_ACTIVATION
-
First, the batch normalization is performed, and then the activation is performed.
-
CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
-
Performs the batch normalization, then element-wise addition, followed by the activation operation.
3.6. cudnnConvolutionBwdDataAlgo_t
cudnnConvolutionBwdDataAlgo_t
is an enumerated type that exposes the different algorithms available to execute the backward data convolution operation.
Values
-
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
-
This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.
-
CUDNN_CONVOLUTION_BWD_DATA_ALGO_1
-
This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT
-
This algorithm uses a Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING
-
This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT
for large size images. The results are deterministic. -
CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD
-
This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED
-
This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results. The results are deterministic.
3.7. cudnnConvolutionBwdDataAlgoPerf_t
cudnnConvolutionBwdDataAlgoPerf_t
is a structure containing performance results returned by cudnnFindConvolutionBackwardDataAlgorithm()
or heuristic results returned by cudnnGetConvolutionBackwardDataAlgorithm_v7()
.
Data Members
-
cudnnConvolutionBwdDataAlgo_t algo
-
The algorithm runs to obtain the associated performance metrics.
-
cudnnStatus_t status
-
If any error occurs during the workspace allocation or timing of
cudnnConvolutionBackwardData()
, this status will represent that error. Otherwise, this status will be the return status ofcudnnConvolutionBackwardData()
.CUDNN_STATUS_ALLOC_FAILED
if any error occurred during workspace allocation or if the provided workspace is insufficient.CUDNN_STATUS_INTERNAL_ERROR
if any error occurred during timing calculations or workspace deallocation.- Otherwise, this will be the return status of
cudnnConvolutionBackwardData()
.
-
float time
-
The execution time of
cudnnConvolutionBackwardData()
(in milliseconds). -
size_t memory
-
The workspace size (in bytes).
-
cudnnDeterminism_t determinism
-
The determinism of the algorithm.
-
cudnnMathType_t mathType
-
The math type provided to the algorithm.
-
int reserved[3]
-
Reserved space for future properties.
3.8. cudnnConvolutionBwdDataPreference_t
cudnnConvolutionBwdDataPreference_t
is an enumerated type used by cudnnGetConvolutionBackwardDataAlgorithm()
to help the choice of the algorithm used for the backward data convolution.
Values
-
CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE
-
In this configuration, the routine
cudnnGetConvolutionBackwardDataAlgorithm()
is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user. -
CUDNN_CONVOLUTION_BWD_DATA_PREFER_FASTEST
-
In this configuration, the routine
cudnnGetConvolutionBackwardDataAlgorithm()
will return the fastest algorithm regardless of how much workspace is needed to execute it. -
CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT
-
In this configuration, the routine
cudnnGetConvolutionBackwardDataAlgorithm()
will return the fastest algorithm that fits within the memory limit that the user provided.
3.9. cudnnConvolutionBwdFilterAlgo_t
cudnnConvolutionBwdFilterAlgo_t
is an enumerated type that exposes the different algorithms available to execute the backward filter convolution operation.
Values
-
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
-
This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.
-
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
-
This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT
-
This algorithm uses the Fast-Fourier Transform approach to compute the convolution. A significant workspace is needed to store intermediate results. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
-
This algorithm is similar to
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
but uses some small workspace to precomputes some indices. The results are also non-deterministic. -
CUDNN_CONVOLUTION_BWD_FILTER_WINOGRAD_NONFUSED
-
This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results. The results are deterministic.
-
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING
-
This algorithm uses the Fast-Fourier Transform approach to compute the convolution but splits the input tensor into tiles. A significant workspace may be needed to store intermediate results. The results are deterministic.
3.10. cudnnConvolutionBwdFilterAlgoPerf_t
cudnnConvolutionBwdFilterAlgoPerf_t
is a structure containing performance results returned by cudnnFindConvolutionBackwardFilterAlgorithm()
or heuristic results returned by cudnnGetConvolutionBackwardFilterAlgorithm_v7()
.
Data Members
-
cudnnConvolutionBwdFilterAlgo_t algo
-
The algorithm runs to obtain the associated performance metrics.
-
cudnnStatus_t status
-
If any error occurs during the workspace allocation or timing of
cudnnConvolutionBackwardFilter()
, this status will represent that error. Otherwise, this status will be the return status ofcudnnConvolutionBackwardFilter()
.CUDNN_STATUS_ALLOC_FAILED
if any error occurred during workspace allocation or if the provided workspace is insufficient.CUDNN_STATUS_INTERNAL_ERROR
if any error occurred during timing calculations or workspace deallocation.- Otherwise, this will be the return status of
cudnnConvolutionBackwardFilter()
.
-
float time
-
The execution time of
cudnnConvolutionBackwardFilter()
(in milliseconds). -
size_t memory
-
The workspace size (in bytes).
-
cudnnDeterminism_t determinism
-
The determinism of the algorithm.
-
cudnnMathType_t mathType
-
The math type provided to the algorithm.
-
int reserved[3]
-
Reserved space for future properties.
3.11. cudnnConvolutionBwdFilterPreference_t
cudnnConvolutionBwdFilterPreference_t
is an enumerated type used by cudnnGetConvolutionBackwardFilterAlgorithm()
to help the choice of the algorithm used for the backward filter convolution.
Values
-
CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE
-
In this configuration, the routine
cudnnGetConvolutionBackwardFilterAlgorithm()
is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user. -
CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST
-
In this configuration, the routine
cudnnGetConvolutionBackwardFilterAlgorithm()
will return the fastest algorithm regardless of how much workspace is needed to execute it. -
CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT
-
In this configuration, the routine
cudnnGetConvolutionBackwardFilterAlgorithm()
will return the fastest algorithm that fits within the memory limit that the user provided.
3.12. cudnnConvolutionDescriptor_t
cudnnConvolutionDescriptor_t
is a pointer to an opaque structure holding the description of a convolution operation. cudnnCreateConvolutionDescriptor()
is used to create one instance, and cudnnSetConvolutionNdDescriptor()
or cudnnSetConvolution2dDescriptor()
must be used to initialize this instance.
3.13. cudnnConvolutionFwdAlgo_t
cudnnConvolutionFwdAlgo_t
is an enumerated type that exposes the different algorithms available to execute the forward convolution operation.
Values
-
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
-
This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data.
-
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
-
This algorithm expresses convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data, but still needs some memory workspace to precompute some indices in order to facilitate the implicit construction of the matrix that holds the input tensor data.
-
CUDNN_CONVOLUTION_FWD_ALGO_GEMM
-
This algorithm expresses the convolution as an explicit matrix product. A significant memory workspace is needed to store the matrix that holds the input tensor data.
-
CUDNN_CONVOLUTION_FWD_ALGO_DIRECT
-
This algorithm expresses the convolution as a direct convolution (for example, without implicitly or explicitly doing a matrix multiplication).
-
CUDNN_CONVOLUTION_FWD_ALGO_FFT
-
This algorithm uses the Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results.
-
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING
-
This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than
CUDNN_CONVOLUTION_FWD_ALGO_FFT
for large size images. -
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
-
This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results.
-
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
-
This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results.
3.14. cudnnConvolutionFwdAlgoPerf_t
cudnnConvolutionFwdAlgoPerf_t
is a structure containing performance results returned by cudnnFindConvolutionForwardAlgorithm()
or heuristic results returned by cudnnGetConvolutionForwardAlgorithm_v7()
.
Data Members
-
cudnnConvolutionFwdAlgo_t algo
-
The algorithm runs to obtain the associated performance metrics.
-
cudnnStatus_t status
-
If any error occurs during the workspace allocation or timing of
cudnnConvolutionForward()
, this status will represent that error. Otherwise, this status will be the return status ofcudnnConvolutionForward()
.CUDNN_STATUS_ALLOC_FAILED
if any error occurred during workspace allocation or if the provided workspace is insufficient.CUDNN_STATUS_INTERNAL_ERROR
if any error occurred during timing calculations or workspace deallocation.- Otherwise, this will be the return status of
cudnnConvolutionForward()
.
-
float time
-
The execution time of
cudnnConvolutionForward()
(in milliseconds). -
size_t memory
-
The workspace size (in bytes).
-
cudnnDeterminism_t determinism
-
The determinism of the algorithm.
-
cudnnMathType_t mathType
-
The math type provided to the algorithm.
-
int reserved[3]
-
Reserved space for future properties.
3.15. cudnnConvolutionFwdPreference_t
cudnnConvolutionFwdPreference_t
is an enumerated type used by cudnnGetConvolutionForwardAlgorithm()
to help the choice of the algorithm used for the forward convolution.
Values
-
CUDNN_CONVOLUTION_FWD_NO_WORKSPACE
-
In this configuration, the routine
cudnnGetConvolutionForwardAlgorithm()
is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user. -
CUDNN_CONVOLUTION_FWD_PREFER_FASTEST
-
In this configuration, the routine
cudnnGetConvolutionForwardAlgorithm()
will return the fastest algorithm regardless of how much workspace is needed to execute it. -
CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT
-
In this configuration, the routine
cudnnGetConvolutionForwardAlgorithm()
will return the fastest algorithm that fits within the memory limit that the user provided.
3.16. cudnnConvolutionMode_t
cudnnConvolutionMode_t
is an enumerated type used by cudnnSetConvolutionDescriptor()
to configure a convolution descriptor. The filter used for the convolution can be applied in two different ways, corresponding mathematically to a convolution or to a cross-correlation. (A cross-correlation is equivalent to a convolution with its filter rotated by 180 degrees.)
Values
-
CUDNN_CONVOLUTION
-
In this mode, a convolution operation will be done when applying the filter to the images.
-
CUDNN_CROSS_CORRELATION
-
In this mode, a cross-correlation operation will be done when applying the filter to the images.
3.17. cudnnCTCLossAlgo_t
cudnnCTCLossAlgo_t
is an enumerated type that exposes the different algorithms available to execute the CTC loss operation.
Values
-
CUDNN_CTC_LOSS_ALGO_DETERMINISTIC
-
Results are guaranteed to be reproducible
-
CUDNN_CTC_LOSS_ALGO_NON_DETERMINISTIC
-
Results are not guaranteed to be reproducible
3.18. cudnnCTCLossDescriptor_t
cudnnCTCLossDescriptor_t
is a pointer to an opaque structure holding the description of a CTC loss operation. cudnnCreateCTCLossDescriptor()
is used to create one instance, cudnnSetCTCLossDescriptor()
is used to initialize this instance, and cudnnDestroyCTCLossDescriptor()
is used to destroy this instance.
3.19. cudnnDataType_t
cudnnDataType_t
is an enumerated type indicating the data type to which a tensor descriptor or filter descriptor refers.
Values
-
CUDNN_DATA_FLOAT
-
The data is a 32-bit single-precision floating-point (
float
). -
CUDNN_DATA_DOUBLE
-
The data is a 64-bit double-precision floating-point (
double
). -
CUDNN_DATA_HALF
-
The data is a 16-bit floating-point.
-
CUDNN_DATA_INT8
-
The data is an 8-bit signed integer.
-
CUDNN_DATA_UINT8
-
The data is an 8-bit unsigned integer.
-
CUDNN_DATA_INT32
-
The data is a 32-bit signed integer.
-
CUDNN_DATA_INT8x4
-
The data is 32-bit elements each composed of 4 8-bit signed integers. This data type is only supported with tensor format
CUDNN_TENSOR_NCHW_VECT_C
. -
CUDNN_DATA_INT8x32
-
The data is 32-element vectors, each element being an 8-bit signed integer. This data type is only supported with the tensor format
CUDNN_TENSOR_NCHW_VECT_C
. Moreover, this data type can only be used withalgo 1
, meaning,CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
. For more information, see cudnnConvolutionFwdAlgo_t. -
CUDNN_DATA_UINT8x4
-
The data is 32-bit elements each composed of 4 8-bit unsigned integers. This data type is only supported with tensor format
CUDNN_TENSOR_NCHW_VECT_C
.
3.20. cudnnDeterminism_t
cudnnDeterminism_t
is an enumerated type used to indicate if the computed results are deterministic (reproducible). For more information, see Reproducibility (determinism).
Values
-
CUDNN_NON_DETERMINISTIC
-
Results are not guaranteed to be reproducible.
-
CUDNN_DETERMINISTIC
-
Results are guaranteed to be reproducible.
3.21. cudnnDirectionMode_t
cudnnDirectionMode_t
is an enumerated type used to specify the recurrence pattern in the cudnnRNNForwardInference()
, cudnnRNNForwardTraining()
, cudnnRNNBackwardData()
and cudnnRNNBackwardWeights()
routines.
Values
-
CUDNN_UNIDIRECTIONAL
- The network iterates recurrently from the first input to the last.
-
CUDNN_BIDIRECTIONAL
- Each layer of the network iterates recurrently from the first input to the last and separately from the last input to the first. The outputs of the two are concatenated at each iteration giving the output of the layer.
3.22. cudnnDivNormMode_t
cudnnDivNormMode_t
is an enumerated type used to specify the mode of operation in cudnnDivisiveNormalizationForward()
and cudnnDivisiveNormalizationBackward()
.
Values
-
CUDNN_DIVNORM_PRECOMPUTED_MEANS
-
The means tensor data pointer is expected to contain means or other kernel convolution values precomputed by the user. The means pointer can also be
NULL
, in that case, it's considered to be filled with zeroes. This is equivalent to spatial LRN.Note:In the backward pass, the means are treated as independent inputs and the gradient over means is computed independently. In this mode, to yield a net gradient over the entire LCN computational graph, the
destDiffMeans
result should be backpropagated through the user's means layer (which can be implemented using average pooling) and added to thedestDiffData
tensor produced bycudnnDivisiveNormalizationBackward()
.
3.23. cudnnDropoutDescriptor_t
cudnnDropoutDescriptor_t
is a pointer to an opaque structure holding the description of a dropout operation. cudnnCreateDropoutDescriptor()
is used to create one instance, cudnnSetDropoutDescriptor()
is used to initialize this instance, cudnnDestroyDropoutDescriptor()
is used to destroy this instance, cudnnGetDropoutDescriptor()
is used to query fields of a previously initialized instance, cudnnRestoreDropoutDescriptor()
is used to restore an instance to a previously saved off state.
3.24. cudnnErrQueryMode_t
cudnnErrQueryMode_t
is an enumerated type passed to cudnnQueryRuntimeError()
to select the remote kernel error query mode.
Values
-
CUDNN_ERRQUERY_RAWCODE
- Read the error storage location regardless of the kernel completion status.
-
CUDNN_ERRQUERY_NONBLOCKING
- Report if all tasks in the user stream of the cuDNN handle were completed. If that is the case, report the remote kernel error code.
-
CUDNN_ERRQUERY_BLOCKING
- Wait for all tasks to complete in the user stream before reporting the remote kernel error code.
3.25. cudnnFilterDescriptor_t
cudnnFilterDescriptor_t
is a pointer to an opaque structure holding the description of a filter dataset. cudnnCreateFilterDescriptor()
is used to create one instance, and cudnnSetFilter4dDescriptor()
or cudnnSetFilterNdDescriptor()
must be used to initialize this instance.
3.26. cudnnFoldingDirection_t
cudnnFoldingDirection_t
is an enumerated type used to select the folding direction. For more information, see cudnnTensorTransformDescriptor_t.
Member | Description |
---|---|
CUDNN_TRANSFORM_FOLD = 0U |
Selects folding. |
CUDNN_TRANSFORM_UNFOLD = 1U |
Selects unfolding. |
3.27. cudnnFusedOps_t
The cudnnFusedOps_t
type is an enumerated type to select a specific sequence of computations to perform in the fused operations.
Member | Description |
---|---|
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS = 0 |
On a per-channel basis, performs these operations in this order: scale, add bias, activation, convolution, and generate batchnorm statistics. |
|
|
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD = 1 |
On a per-channel basis, performs these operations in this order: scale, add bias, activation, convolution backward weights, and generate batchnorm statistics. |
CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING = 2 |
Computes the equivalent scale and bias from ySum , ySqSum and learned scale , bias . Optionally update running statistics and generate saved stats |
CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE = 3 |
Computes the equivalent scale and bias from the learned running statistics and the learned scale, bias. |
CUDNN_FUSED_CONV_SCALE_BIAS_ADD_ACTIVATION = 4 |
On a per-channel basis, performs these operations in this order: convolution, scale, add bias, element-wise addition with another tensor, and activation. |
CUDNN_FUSED_SCALE_BIAS_ADD_ACTIVATION_GEN_BITMASK = 5 |
On a per-channel basis, performs these operations in this order: scale and bias on one tensor, scale, and bias on a second tensor, element-wise addition of these two tensors, and on the resulting tensor perform activation, and generate activation bit mask. |
CUDNN_FUSED_DACTIVATION_FORK_DBATCHNORM = 6 |
On a per-channel basis, performs these operations in this order: backward activation, fork (meaning, write out gradient for the residual branch), and backward batch norm. |
3.28. cudnnFusedOpsConstParamLabel_t
The cudnnFusedOpsConstParamLabel_t
is an enumerated type for the selection of the type of the cudnnFusedOps
descriptor. For more information, see cudnnSetFusedOpsConstParamPackAttribute.
typedef enum {
CUDNN_PARAM_XDESC = 0,
CUDNN_PARAM_XDATA_PLACEHOLDER = 1,
CUDNN_PARAM_BN_MODE = 2,
CUDNN_PARAM_BN_EQSCALEBIAS_DESC = 3,
CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER = 4,
CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER = 5,
CUDNN_PARAM_ACTIVATION_DESC = 6,
CUDNN_PARAM_CONV_DESC = 7,
CUDNN_PARAM_WDESC = 8,
CUDNN_PARAM_WDATA_PLACEHOLDER = 9,
CUDNN_PARAM_DWDESC = 10,
CUDNN_PARAM_DWDATA_PLACEHOLDER = 11,
CUDNN_PARAM_YDESC = 12,
CUDNN_PARAM_YDATA_PLACEHOLDER = 13,
CUDNN_PARAM_DYDESC = 14,
CUDNN_PARAM_DYDATA_PLACEHOLDER = 15,
CUDNN_PARAM_YSTATS_DESC = 16,
CUDNN_PARAM_YSUM_PLACEHOLDER = 17,
CUDNN_PARAM_YSQSUM_PLACEHOLDER = 18,
CUDNN_PARAM_BN_SCALEBIAS_MEANVAR_DESC = 19,
CUDNN_PARAM_BN_SCALE_PLACEHOLDER = 20,
CUDNN_PARAM_BN_BIAS_PLACEHOLDER = 21,
CUDNN_PARAM_BN_SAVED_MEAN_PLACEHOLDER = 22,
CUDNN_PARAM_BN_SAVED_INVSTD_PLACEHOLDER = 23,
CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER = 24,
CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER = 25,
CUDNN_PARAM_ZDESC = 26,
CUDNN_PARAM_ZDATA_PLACEHOLDER = 27,
CUDNN_PARAM_BN_Z_EQSCALEBIAS_DESC = 28,
CUDNN_PARAM_BN_Z_EQSCALE_PLACEHOLDER = 29,
CUDNN_PARAM_BN_Z_EQBIAS_PLACEHOLDER = 30,
CUDNN_PARAM_ACTIVATION_BITMASK_DESC = 31,
CUDNN_PARAM_ACTIVATION_BITMASK_PLACEHOLDER = 32,
CUDNN_PARAM_DXDESC = 33,
CUDNN_PARAM_DXDATA_PLACEHOLDER = 34,
CUDNN_PARAM_DZDESC = 35,
CUDNN_PARAM_DZDATA_PLACEHOLDER = 36,
CUDNN_PARAM_BN_DSCALE_PLACEHOLDER = 37,
CUDNN_PARAM_BN_DBIAS_PLACEHOLDER = 38,
} cudnnFusedOpsConstParamLabel_t;
Short-form used | Stands for |
---|---|
Setter | cudnnSetFusedOpsConstParamPackAttribute |
Getter | cudnnGetFusedOpsConstParamPackAttribute |
X_PointerPlaceHolder_t |
cudnnFusedOpsPointerPlaceHolder_t |
X_ prefix in the Attribute column |
Stands for CUDNN_PARAM_ in the enumerator name |
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS in cudnnFusedOp_t |
|||
---|---|---|---|
Attribute | Expected Descriptor Type Passed in, in the Setter | Description | Default Value After Creation |
X_XDESC |
In the setter, the *param should be xDesc , a pointer to a previously initialized cudnnTensorDescriptor_t . |
Tensor descriptor describing the size, layout, and datatype of the x (input) tensor. |
NULL |
X_XDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether xData pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_MODE |
In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. |
Describes the mode of operation for the scale, bias and the statistics. As of cuDNN 7.6.0, only |
CUDNN_BATCHNORM_PER_ACTIVATION |
X_BN_EQSCALEBIAS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. |
Tensor descriptor describing the size, layout, and datatype of the batchNorm equivalent scale and bias tensors. The shapes must match the mode specified in CUDNN_PARAM_BN_MODE . If set to NULL , both scale and bias operation will become a NOP. |
NULL |
X_BN_EQSCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_EQBIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_ACTIVATION_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnActivationDescriptor_t*. |
Describes the activation operation. As of 7.6.0, only activation mode of |
NULL |
X_CONV_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnConvolutionDescriptor_t*. |
Describes the convolution operation. | NULL |
X_WDESC |
In the setter, the *param should be a pointer to a previously initialized cudnnFilterDescriptor_t*. |
Filter descriptor describing the size, layout and datatype of the w (filter) tensor. |
NULL |
X_WDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether w (filter) tensor pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_YDESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. |
Tensor descriptor describing the size, layout and datatype of the y (output) tensor. |
NULL |
X_YDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether y (output) tensor pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_YSTATS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t* . |
Tensor descriptor describing the size, layout and datatype of the sum of y and sum of y square tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE . If set to |
NULL |
X_YSUM_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether sum of y pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_YSQSUM_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether sum of y square pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
- If the corresponding pointer placeholder in ConstParamPack is set to
CUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
need to beNULL
as well. - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and need to be at least element-aligned or 16 bytes-aligned, respectively.
As of cuDNN 7.6.0, if the conditions in Table 3 are met, then the fully fused fast path will be triggered. Otherwise, a slower partially fused path will be triggered.
Parameter | Condition |
---|---|
Device compute capability | Need to be one of 7.0 , 7.2 or 7.5 . |
CUDNN_PARAM_XDESC
|
Tensor is 4 dimensional Datatype is Layout is Alignment is Tensor’s |
CUDNN_PARAM_BN_EQSCALEBIAS_DESC
|
If either one of scale and bias operation is not turned into a NOP: Tensor is 4 dimensional with shape 1xCx1x1 Datatype is Layout is fully packed Alignment is |
CUDNN_PARAM_CONV_DESC
|
Convolution descriptor’s mode needs to be CUDNN_CROSS_CORRELATION . Convolution descriptor’s Convolution descriptor’s Convolution descriptor’s group count needs to be Convolution descriptor’s Filter is in Filter’s data type is Filter’s K dimension is a multiple of 32 Filter size RxS is either 1x1 or 3x3 If filter size RxS is 1x1, convolution descriptor’s Filter’s alignment is |
CUDNN_PARAM_YDESC
|
Tensor is 4 dimensional Datatype is Layout is Alignment is |
CUDNN_PARAM_YSTATS_DESC
|
If the generate statistics operation is not turned into a NOP: Tensor is 4 dimensional with shape 1xKx1x1 Datatype is Layout is fully packed Alignment is |
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD in cudnnFusedOp_t |
|||
---|---|---|---|
Attribute | Expected Descriptor Type Passed in, in the Setter | Description | Default Value After Creation |
X_XDESC |
In the setter, the *param should be xDesc , a pointer to a previously initialized cudnnTensorDescriptor_t . |
Tensor descriptor describing the size, layout and datatype of the x (input) tensor |
NULL |
X_XDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether xData pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_MODE |
In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. |
Describes the mode of operation for the scale, bias and the statistics. As of cuDNN 7.6.0, only |
CUDNN_BATCHNORM_PER_ACTIVATION |
X_BN_EQSCALEBIAS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. |
Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes must match the mode specified in CUDNN_PARAM_BN_MODE . If set to NULL , both scale and bias operation will become a NOP. |
NULL |
X_BN_EQSCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_EQBIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_ACTIVATION_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnActivationDescriptor_t*. |
Describes the activation operation. As of 7.6.0, only activation mode of |
NULL |
X_CONV_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnConvolutionDescriptor_t*. |
Describes the convolution operation. | NULL |
X_DWDESC |
In the setter, the *param should be a pointer to a previously initialized cudnnFilterDescriptor_t*. |
Filter descriptor describing the size, layout and datatype of the dw (filter gradient output) tensor. |
NULL |
X_DWDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether dw (filter gradient output) tensor pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_DYDESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. |
Tensor descriptor describing the size, layout and datatype of the dy (gradient input) tensor. |
NULL |
X_DYDATA_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether dy (gradient input) tensor pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment *. |
CUDNN_PTR_NULL |
- If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
needs to beNULL
as well. - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and needs to be at least element-aligned or 16 bytes-aligned, respectively.
As of cuDNN 7.6.0, if the conditions in Table 5 are met, then the fully fused fast path will be triggered. Otherwise a slower partially fused path will be triggered.
Parameter | Condition |
---|---|
Device compute capability | Needs to be one of 7.0 , 7.2 or 7.5 . |
CUDNN_PARAM_XDESC
|
Tensor is 4 dimensional Datatype is Layout is Alignment is Tensor’s |
CUDNN_PARAM_BN_EQSCALEBIAS_DESC
|
If either one of scale and bias operation is not turned into a NOP: Tensor is 4 dimensional with shape 1xCx1x1 Datatype is Layout is fully packed Alignment is |
CUDNN_PARAM_CONV_DESC
|
Convolution descriptor’s mode needs to be CUDNN_CROSS_CORRELATION . Convolution descriptor’s dataType needs to be Convolution descriptor’s Convolution descriptor’s group count needs to be Convolution descriptor’s Filter gradient is in Filter gradient’s data type is Filter gradient’s Filter gradient size RxS is either 1x1 or 3x3 If filter gradient size RxS is 1x1, convolution descriptor’s Filter gradient’s alignment is |
CUDNN_PARAM_DYDESC
|
Tensor is 4 dimensional Datatype is Layout is Alignment is |
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING in cudnnFusedOp_t |
|||
---|---|---|---|
Attribute | Expected Descriptor Type Passed in, in the Setter | Description | Default Value After Creation |
X_BN_MODE |
In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. |
Describes the mode of operation for the scale, bias and the statistics. As of cuDNN 7.6.0, only |
CUDNN_BATCHNORM_PER_ACTIVATION |
X_YSTATS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t* . |
Tensor descriptor describing the size, layout and datatype of the sum of y and sum of y square tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE . |
NULL |
X_YSUM_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether sum of y pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_YSQSUM_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether sum of y square pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_SCALEBIAS_MEANVAR_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. |
A common tensor descriptor describing the size, layout and datatype of the batchNorm trained scale, bias and statistics tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE (similar to the bnScaleBiasMeanVarDesc field in the cudnnBatchNormalization* API). |
NULL |
X_BN_SCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm trained scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If the output of |
CUDNN_PTR_NULL |
X_BN_BIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm trained bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If neither output of |
CUDNN_PTR_NULL |
X_BN_SAVED_MEAN_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm saved mean pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_SAVED_INVSTD_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm saved inverse standard deviation pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_RUNNING_MEAN_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm running mean pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_RUNNING_VAR_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm running variance pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_EQSCALEBIAS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. |
Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE . If neither output of |
NULL |
X_BN_EQSCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_EQBIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE in cudnnFusedOp_t |
|||
---|---|---|---|
Attribute | Expected Descriptor Type Passed in, in the Setter | Description | Default Value After Creation |
X_BN_MODE |
In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. |
Describes the mode of operation for the scale, bias and the statistics. As of cuDNN 7.6.0, only |
CUDNN_BATCHNORM_PER_ACTIVATION |
X_BN_SCALEBIAS_MEANVAR_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. |
A common tensor descriptor describing the size, layout and datatype of the batchNorm trained scale, bias and statistics tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE (similar to the bnScaleBiasMeanVarDesc field in the cudnnBatchNormalization* API). |
NULL |
X_BN_SCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm trained scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_BIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm trained bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_RUNNING_MEAN_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm running mean pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_RUNNING_VAR_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether the batchNorm running variance pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . |
CUDNN_PTR_NULL |
X_BN_EQSCALEBIAS_DESC |
In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. |
Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE . |
NULL |
X_BN_EQSCALE_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
X_BN_EQBIAS_PLACEHOLDER |
In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t* . |
Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL , or if not, user promised pointer alignment * . If set to |
CUDNN_PTR_NULL |
3.29. cudnnFusedOpsConstParamPack_t
cudnnFusedOpsConstParamPack_t
is a pointer to an opaque structure holding the description of the cudnnFusedOps
constant parameters. Use the function cudnnCreateFusedOpsConstParamPack to create one instance of this structure, and the function cudnnDestroyFusedOpsConstParamPack to destroy a previously-created descriptor.
3.30. cudnnFusedOpsPlan_t
cudnnFusedOpsPlan_t
is a pointer to an opaque structure holding the description of the cudnnFusedOpsPlan
. This descriptor contains the plan information, including the problem type and size, which kernels should be run, and the internal workspace partition. Use the function cudnnCreateFusedOpsPlan to create one instance of this structure, and the function cudnnDestroyFusedOpsPlan to destroy a previously-created descriptor.
3.31. cudnnFusedOpsPointerPlaceHolder_t
cudnnFusedOpsPointerPlaceHolder_t is an enumerated type used to select the alignment type of the cudnnFusedOps
descriptor pointer.
Member | Description |
---|---|
CUDNN_PTR_NULL = 0 |
Indicates that the pointer to the tensor in the variantPack will be NULL . |
CUDNN_PTR_ELEM_ALIGNED = 1 |
Indicates that the pointer to the tensor in the variantPack will not be NULL , and will have element alignment. |
CUDNN_PTR_16B_ALIGNED = 2 |
Indicates that the pointer to the tensor in the variantPack will not be NULL , and will have 16 byte alignment. |
3.32. cudnnFusedOpsVariantParamLabel_t
The cudnnFusedOpsVariantParamLabel_t
is an enumerated type that is used to set the buffer pointers. These buffer pointers can be changed in each iteration.
typedef enum {
CUDNN_PTR_XDATA = 0,
CUDNN_PTR_BN_EQSCALE = 1,
CUDNN_PTR_BN_EQBIAS = 2,
CUDNN_PTR_WDATA = 3,
CUDNN_PTR_DWDATA = 4,
CUDNN_PTR_YDATA = 5,
CUDNN_PTR_DYDATA = 6,
CUDNN_PTR_YSUM = 7,
CUDNN_PTR_YSQSUM = 8,
CUDNN_PTR_WORKSPACE = 9,
CUDNN_PTR_BN_SCALE = 10,
CUDNN_PTR_BN_BIAS = 11,
CUDNN_PTR_BN_SAVED_MEAN = 12,
CUDNN_PTR_BN_SAVED_INVSTD = 13,
CUDNN_PTR_BN_RUNNING_MEAN = 14,
CUDNN_PTR_BN_RUNNING_VAR = 15,
CUDNN_PTR_ZDATA = 16,
CUDNN_PTR_BN_Z_EQSCALE = 17,
CUDNN_PTR_BN_Z_EQBIAS = 18,
CUDNN_PTR_ACTIVATION_BITMASK = 19,
CUDNN_PTR_DXDATA = 20,
CUDNN_PTR_DZDATA = 21,
CUDNN_PTR_BN_DSCALE = 22,
CUDNN_PTR_BN_DBIAS = 23,
CUDNN_SCALAR_SIZE_T_WORKSPACE_SIZE_IN_BYTES = 100,
CUDNN_SCALAR_INT64_T_BN_ACCUMULATION_COUNT = 101,
CUDNN_SCALAR_DOUBLE_BN_EXP_AVG_FACTOR = 102,
CUDNN_SCALAR_DOUBLE_BN_EPSILON = 103,
} cudnnFusedOpsVariantParamLabel_t;
Short-form used | Stands for |
---|---|
Setter | cudnnSetFusedOpsVariantParamPackAttribute |
Getter | cudnnGetFusedOpsVariantParamPackAttribute |
X_ prefix in the Attribute key column |
Stands for CUDNN_PTR_ or CUDNN_SCALAR_ in the enumerator name. |
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS in cudnnFusedOp_t |
||||
---|---|---|---|---|
Attribute key | Expected Descriptor Type Passed in, in the Setter | I/O Type | Description | Default Value |
X_XDATA |
void * |
input | Pointer to x (input) tensor on device, need to agree with previously set CUDNN_PARAM_XDATA_PLACEHOLDER attribute * . |
NULL |
X_BN_EQSCALE |
void * |
input | Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_EQBIAS |
void * |
input | Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute * . |
NULL |
X_WDATA |
void * |
input | Pointer to w (filter) tensor on device, need to agree with previously set CUDNN_PARAM_WDATA_PLACEHOLDER attribute * . |
NULL |
X_YDATA |
void * |
output | Pointer to y (output) tensor on device, need to agree with previously set CUDNN_PARAM_YDATA_PLACEHOLDER attribute * . |
NULL |
X_YSUM |
void * |
output | Pointer to sum of y tensor on device, need to agree with previously set CUDNN_PARAM_YSUM_PLACEHOLDER attribute * . |
NULL |
X_YSQSUM |
void * |
output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_YSQSUM_PLACEHOLDER attribute * . |
NULL |
X_WORKSPACE |
void * |
input | Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0 . |
NULL |
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES |
size_t * |
input | Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount needs to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan . |
0 |
- If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
needs to beNULL
as well - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and needs to be at least element-aligned or 16 bytes-aligned, respectively.
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD in cudnnFusedOp_t |
||||
---|---|---|---|---|
Attribute key | Expected Descriptor Type Passed in, in the Setter | I/O Type | Description | Default Value |
X_XDATA |
void * |
input | Pointer to x (input) tensor on device, need to agree with previously set CUDNN_PARAM_XDATA_PLACEHOLDER attribute * . |
NULL |
X_BN_EQSCALE |
void * |
input | Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_EQBIAS |
void * |
input | Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute * . |
NULL |
X_DWDATA |
void * |
output | Pointer to dw (filter gradient output) tensor on device, need to agree with previously set CUDNN_PARAM_WDATA_PLACEHOLDER attribute * . |
NULL |
X_DYDATA |
void * |
input | Pointer to dy (gradient input) tensor on device, need to agree with previously set CUDNN_PARAM_YDATA_PLACEHOLDER attribute * . |
NULL |
X_WORKSPACE |
void * |
input | Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0 . |
NULL |
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES |
size_t * |
input | Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount needs to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan . |
0 |
- If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
needs to beNULL
as well. - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and needs to be at least element-aligned or 16 bytes-aligned, respectively.
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING in cudnnFusedOp_t |
||||
---|---|---|---|---|
Attribute key | Expected Descriptor Type Passed in, in the Setter | I/O Type | Description | Default Value |
X_YSUM |
void * |
input | Pointer to sum of y tensor on device, need to agree with previously set CUDNN_PARAM_YSUM_PLACEHOLDER attribute * . |
NULL |
X_YSQSUM |
void * |
input | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_YSQSUM_PLACEHOLDER attribute * . |
NULL |
X_BN_SCALE |
void * |
input | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_BIAS |
void * |
input | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_BIAS_PLACEHOLDER attribute * . |
NULL |
X_BN_SAVED_MEAN |
void * |
output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SAVED_MEAN_PLACEHOLDER attribute * . |
NULL |
X_BN_SAVED_INVSTD |
void * |
output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SAVED_INVSTD_PLACEHOLDER attribute * . |
NULL |
X_BN_RUNNING_MEAN |
void * |
input/output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER attribute * . |
NULL |
X_BN_RUNNING_VAR |
void * |
input/output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER attribute * . |
NULL |
X_BN_EQSCALE |
void * |
output | Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_EQBIAS |
void * |
output | Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute * . |
NULL |
X_INT64_T_BN_ACCUMULATION_COUNT |
int64_t * |
input | Pointer to a scalar value in int64_t on host memory. This value should describe the number of tensor elements accumulated in the sum of For example, in the single GPU use case, if the mode is In multi-GPU use case, if all-reduce has been performed on the sum of |
0 |
X_DOUBLE_BN_EXP_AVG_FACTOR |
double * |
input | Pointer to a scalar value in double on host memory. Factor used in the moving average computation. See |
0.0 |
X_DOUBLE_BN_EPSILON |
double * |
input | Pointer to a scalar value in double on host memory. A conditioning constant used in the batch normalization formula. Its value should be equal to or greater than the value defined for See |
0.0 |
X_WORKSPACE |
void * |
input | Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0 . |
NULL |
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES |
size_t * |
input | Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount need to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan . |
0 |
- If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
need to beNULL
as well. - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and needs to be at least element-aligned or 16 bytes-aligned, respectively.
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE in cudnnFusedOp_t |
||||
---|---|---|---|---|
Attribute key | Expected Descriptor Type Passed in, in the Setter | I/O Type | Description | Default Value |
X_BN_SCALE |
void * |
input | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_BIAS |
void * |
input | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_BIAS_PLACEHOLDER attribute * . |
NULL |
X_BN_RUNNING_MEAN |
void * |
input/output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER attribute * . |
NULL |
X_BN_RUNNING_VAR |
void * |
input/output | Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER attribute * . |
NULL |
X_BN_EQSCALE |
void * |
output | Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute * . |
NULL |
X_BN_EQBIAS |
void * |
output | Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute * . |
NULL |
X_DOUBLE_BN_EPSILON |
double * |
input | Pointer to a scalar value in double on host memory. A conditioning constant used in the batch normalization formula. Its value should be equal to or greater than the value defined for See |
0.0 |
X_WORKSPACE |
void * |
input | Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0 . |
NULL |
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES |
size_t * |
input | Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount need to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan . |
0 |
- If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_NULL
, then the device pointer in theVariantParamPack
needs to beNULL
as well. - If the corresponding pointer placeholder in
ConstParamPack
is set toCUDNN_PTR_ELEM_ALIGNED
orCUDNN_PTR_16B_ALIGNED
, then the device pointer in theVariantParamPack
may not beNULL
and needs to be at least element-aligned or 16 bytes-aligned, respectively.
3.33. cudnnFusedOpsVariantParamPack_t
cudnnFusedOpsVariantParamPack_t
is a pointer to an opaque structure holding the description of the cudnnFusedOps
variant parameters. Use the function cudnnCreateFusedOpsVariantParamPack to create one instance of this structure, and the function cudnnDestroyFusedOpsVariantParamPack to destroy a previously-created descriptor.
3.34. cudnnHandle_t
cudnnHandle_t
is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using cudnnCreate()
and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cudnnDestroy()
. The context is associated with only one GPU device, the current device at the time of the call to cudnnCreate()
. However, multiple contexts can be created on the same GPU device.
3.35. cudnnIndicesType_t
cudnnIndicesType_t
is an enumerated type used to indicate the data type for the indices to be computed by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
-
CUDNN_32BIT_INDICES
-
Compute unsigned int indices.
-
CUDNN_64BIT_INDICES
-
Compute unsigned long indices.
-
CUDNN_16BIT_INDICES
-
Compute unsigned short indices.
-
CUDNN_8BIT_INDICES
-
Compute unsigned char indices.
3.36. cudnnLossNormalizationMode_t
cudnnLossNormalizationMode_t
is an enumerated type that controls the input normalization mode for a loss function. This type can be used with cudnnSetCTCLossDescriptorEx.
Values
-
CUDNN_LOSS_NORMALIZATION_NONE
-
The input
probs
of cudnnCTCLoss function is expected to be the normalized probability, and the outputgradients
is the gradient of loss with respect to the unnormalized probability. -
CUDNN_LOSS_NORMALIZATION_SOFTMAX
-
The input
probs
of cudnnCTCLoss function is expected to be the unnormalized activation from the previous layer, and the outputgradients
is the gradient with respect to the activation. Internally the probability is computed by softmax normalization.
3.37. cudnnLRNMode_t
cudnnLRNMode_t
is an enumerated type used to specify the mode of operation in cudnnLRNCrossChannelForward()
and cudnnLRNCrossChannelBackward()
.
Values
-
CUDNN_LRN_CROSS_CHANNEL_DIM1
-
LRN computation is performed across tensor's dimension
dimA[1]
.
3.38. cudnnMathType_t
cudnnMathType_t
is an enumerated type used to indicate if the use of Tensor Core operations is permitted a given library routine.
Values
-
CUDNN_DEFAULT_MATH
-
Tensor Core operations are not used.
-
CUDNN_TENSOR_OP_MATH
-
The use of Tensor Core operations is permitted.
-
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
-
Enables the use of FP32 tensors for both input and output.
3.39. cudnnMultiHeadAttnWeightKind_t
cudnnMultiHeadAttnWeightKind_t
is an enumerated type that specifies a group of weights or biases in the cudnnGetMultiHeadAttnWeights()
function.
Values
-
CUDNN_MH_ATTN_Q_WEIGHTS
-
Selects the input projection weights for
queries
. -
CUDNN_MH_ATTN_K_WEIGHTS
-
Selects the input projection weights for
keys
. -
CUDNN_MH_ATTN_V_WEIGHTS
-
Selects the input projection weights for
values
. -
CUDNN_MH_ATTN_O_WEIGHTS
-
Selects the output projection weights.
-
CUDNN_MH_ATTN_Q_BIASES
-
Selects the input projection biases for
queries
. -
CUDNN_MH_ATTN_K_BIASES
-
Selects the input projection biases for
keys
. -
CUDNN_MH_ATTN_V_BIASES
-
Selects the input projection biases for
values
. -
CUDNN_MH_ATTN_O_BIASES
-
Selects the output projection biases.
3.40. cudnnNanPropagation_t
cudnnNanPropagation_t
is an enumerated type used to indicate if a given routine should propagate Nan
numbers. This enumerated type is used as a field for the cudnnActivationDescriptor_t
descriptor and cudnnPoolingDescriptor_t
descriptor.
Values
-
CUDNN_NOT_PROPAGATE_NAN
-
Nan
numbers are not propagated. -
CUDNN_PROPAGATE_NAN
-
Nan
numbers are propagated.
3.41. cudnnOpTensorDescriptor_t
cudnnOpTensorDescriptor_t
is a pointer to an opaque structure holding the description of a Tensor Core operation, used as a parameter to cudnnOpTensor()
. cudnnCreateOpTensorDescriptor()
is used to create one instance, and cudnnSetOpTensorDescriptor()
must be used to initialize this instance.
3.42. cudnnOpTensorOp_t
cudnnOpTensorOp_t
is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnOpTensor()
routine. This enumerated type is used as a field for the cudnnOpTensorDescriptor_t
descriptor.
Values
-
CUDNN_OP_TENSOR_ADD
-
The operation to be performed is addition.
-
CUDNN_OP_TENSOR_MUL
-
The operation to be performed is multiplication.
-
CUDNN_OP_TENSOR_MIN
-
The operation to be performed is a minimum comparison.
-
CUDNN_OP_TENSOR_MAX
-
The operation to be performed is a maximum comparison.
-
CUDNN_OP_TENSOR_SQRT
-
The operation to be performed is square root, performed on only the
A
tensor. -
CUDNN_OP_TENSOR_NOT
-
The operation to be performed is negation, performed on only the
A
tensor.
3.43. cudnnPersistentRNNPlan_t
cudnnPersistentRNNPlan_t
is a pointer to an opaque structure holding a plan to execute a dynamic persistent RNN. cudnnCreatePersistentRNNPlan()
is used to create and initialize one instance.
3.44. cudnnPoolingDescriptor_t
cudnnPoolingDescriptor_t
is a pointer to an opaque structure holding the description of a pooling operation. cudnnCreatePoolingDescriptor()
is used to create one instance, and cudnnSetPoolingNdDescriptor()
or cudnnSetPooling2dDescriptor()
must be used to initialize this instance.
3.45. cudnnPoolingMode_t
cudnnPoolingMode_t
is an enumerated type passed to cudnnSetPoolingDescriptor()
to select the pooling method to be used by cudnnPoolingForward()
and cudnnPoolingBackward()
.
Values
-
CUDNN_POOLING_MAX
-
The maximum value inside the pooling window is used.
-
CUDNN_POOLING_AVERAGE_COUNT_INCLUDE_PADDING
-
Values inside the pooling window are averaged. The number of elements used to calculate the average includes spatial locations falling in the padding region.
-
CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING
-
Values inside the pooling window are averaged. The number of elements used to calculate the average excludes spatial locations falling in the padding region.
-
CUDNN_POOLING_MAX_DETERMINISTIC
-
The maximum value inside the pooling window is used. The algorithm used is deterministic.
3.46. cudnnReduceTensorDescriptor_t
cudnnReduceTensorDescriptor_t
is a pointer to an opaque structure holding the description of a tensor reduction operation, used as a parameter to cudnnReduceTensor()
. cudnnCreateReduceTensorDescriptor()
is used to create one instance, and cudnnSetReduceTensorDescriptor()
must be used to initialize this instance.
cudnnReduceTensorIndices_t
cudnnReduceTensorIndices_t
is an enumerated type used to indicate whether indices are to be computed by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
-
CUDNN_REDUCE_TENSOR_NO_INDICES
-
Do not compute indices.
-
CUDNN_REDUCE_TENSOR_FLATTENED_INDICES
-
Compute indices. The resulting indices are relative, and flattened.
3.48. cudnnReduceTensorOp_t
cudnnReduceTensorOp_t
is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
-
CUDNN_REDUCE_TENSOR_ADD
-
The operation to be performed is addition.
-
CUDNN_REDUCE_TENSOR_MUL
-
The operation to be performed is multiplication.
-
CUDNN_REDUCE_TENSOR_MIN
-
The operation to be performed is a minimum comparison.
-
CUDNN_REDUCE_TENSOR_MAX
-
The operation to be performed is a maximum comparison.
-
CUDNN_REDUCE_TENSOR_AMAX
-
The operation to be performed is a maximum comparison of absolute values.
-
CUDNN_REDUCE_TENSOR_AVG
-
The operation to be performed is averaging.
-
CUDNN_REDUCE_TENSOR_NORM1
-
The operation to be performed is addition of absolute values.
-
CUDNN_REDUCE_TENSOR_NORM2
-
The operation to be performed is a square root of sum of squares.
-
CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS
-
The operation to be performed is multiplication, not including elements of value zero.
3.49. cudnnReorderType_t
typedef enum {
CUDNN_DEFAULT_REORDER = 0,
CUDNN_NO_REORDER = 1,
} cudnnReorderType_t;
cudnnReorderType_t
is an enumerated type to set the convolution reordering type. The reordering type can be set by cudnnSetConvolutionReorderType and its status can be read by cudnnGetConvolutionReorderType.
3.50. cudnnRNNAlgo_t
cudnnRNNAlgo_t
is an enumerated type used to specify the algorithm used in the cudnnRNNForwardInference()
, cudnnRNNForwardTraining()
, cudnnRNNBackwardData()
and cudnnRNNBackwardWeights()
routines.
Values
-
CUDNN_RNN_ALGO_STANDARD
- Each RNN layer is executed as a sequence of operations. This algorithm is expected to have robust performance across a wide range of network parameters.
-
CUDNN_RNN_ALGO_PERSIST_STATIC
-
The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch).
CUDNN_RNN_ALGO_PERSIST_STATIC
is only supported on devices with compute capability >= 6.0. -
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
-
The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (ie. a small minibatch). When using
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
persistent kernels are prepared at runtime and are able to optimize using the specific parameters of the network and active GPU. As such, when usingCUDNN_RNN_ALGO_PERSIST_DYNAMIC
a one-time plan preparation stage must be executed. These plans can then be reused in repeated calls with the same model parameters.The limits on the maximum number of hidden units supported when using
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
are significantly higher than the limits when usingCUDNN_RNN_ALGO_PERSIST_STATIC
, however throughput is likely to significantly reduce when exceeding the maximums supported byCUDNN_RNN_ALGO_PERSIST_STATIC
. In this regime, this method will still outperformCUDNN_RNN_ALGO_STANDARD
for some cases.CUDNN_RNN_ALGO_PERSIST_DYNAMIC
is only supported on devices with compute capability >= 6.0 on Linux machines.
3.51. cudnnRNNBiasMode_t
cudnnRNNBiasMode_t
is an enumerated type used to specify the number of bias vectors for RNN functions. See the description of the cudnnRNNMode_t enumerated type for the equations for each cell type based on the bias mode.
Values
-
CUDNN_RNN_NO_BIAS
-
Applies RNN cell formulas that do not use biases.
-
CUDNN_RNN_SINGLE_INP_BIAS
-
Applies RNN cell formulas that use one input bias vector in the input GEMM.
-
CUDNN_RNN_DOUBLE_BIAS
-
Applies RNN cell formulas that use two bias vectors.
-
CUDNN_RNN_SINGLE_REC_BIAS
-
Applies RNN cell formulas that use one recurrent bias vector in the recurrent GEMM.
3.52. cudnnRNNClipMode_t
cudnnRNNClipMode_t
is an enumerated type used to select the LSTM cell clipping mode. It is used with cudnnRNNSetClip()
, cudnnRNNGetClip()
functions, and internally within LSTM cells.
Values
-
CUDNN_RNN_CLIP_NONE
-
Disables LSTM cell clipping.
-
CUDNN_RNN_CLIP_MINMAX
-
Enables LSTM cell clipping.
3.53. cudnnRNNDataDescriptor_t
cudnnRNNDataDescriptor_t
is a pointer to an opaque structure holding the description of an RNN data set. The function cudnnCreateRNNDataDescriptor()
is used to create one instance, and cudnnSetRNNDataDescriptor()
must be used to initialize this instance.
3.54. cudnnRNNDataLayout_t
cudnnRNNDataLayout_t
is an enumerated type used to select the RNN data layout. It is used used in the API calls cudnnGetRNNDataDescriptor and cudnnSetRNNDataDescriptor.
Values
-
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED
-
Data layout is padded, with outer stride from one time-step to the next.
-
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED
-
The sequence length is sorted and packed as in basic RNN API.
-
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED
-
Data layout is padded, with outer stride from one batch to the next.
3.55. cudnnRNNDescriptor_t
cudnnRNNDescriptor_t
is a pointer to an opaque structure holding the description of an RNN operation. cudnnCreateRNNDescriptor()
is used to create one instance, and cudnnSetRNNDescriptor()
must be used to initialize this instance.
3.56. cudnnRNNInputMode_t
cudnnRNNInputMode_t
is an enumerated type used to specify the behavior of the first layer in the cudnnRNNForwardInference()
, cudnnRNNForwardTraining()
, cudnnRNNBackwardData()
and cudnnRNNBackwardWeights()
routines.
Values
-
CUDNN_LINEAR_INPUT
- A biased matrix multiplication is performed at the input of the first recurrent layer.
-
CUDNN_SKIP_INPUT
-
No operation is performed at the input of the first recurrent layer. If
CUDNN_SKIP_INPUT
is used the leading dimension of the input tensor must be equal to the hidden state size of the network.
3.57. cudnnRNNMode_t
cudnnRNNMode_t
is an enumerated type used to specify the type of network used in the cudnnRNNForwardInference, cudnnRNNForwardTraining, cudnnRNNBackwardData and cudnnRNNBackwardWeights routines.
Values
-
CUDNN_RNN_RELU
-
A single-gate recurrent neural network with a ReLU activation function.
In the forward pass, the output
ht
for a given iteration can be computed from the recurrent inputht-1
and the previous layer inputxt
, given the matricesW, R
and the bias vectors, whereReLU(x) = max(x, 0)
.If
cudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equation with biasesbW
andbR
applies:ht = ReLU(Wixt + Riht-1 + bWi + bRi)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_SINGLE_INP_BIAS
orCUDNN_RNN_SINGLE_REC_BIAS
, then the following equation with biasb
applies:ht = ReLU(Wixt + Riht-1 + bi)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_NO_BIAS
, then the following equation applies:ht = ReLU(Wixt + Riht-1)
-
CUDNN_RNN_TANH
-
A single-gate recurrent neural network with a
tanh
activation function.In the forward pass, the output
ht
for a given iteration can be computed from the recurrent inputht-1
and the previous layer inputxt
, given the matricesW, R
and the bias vectors, and wheretanh
is the hyperbolic tangent function.If
cudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equation with biasesbW
andbR
applies:ht = tanh(Wixt + Riht-1 + bWi + bRi)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_SINGLE_INP_BIAS
orCUDNN_RNN_SINGLE_REC_BIAS
, then the following equation with biasb
applies:ht = tanh(Wixt + Riht-1 + bi)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_NO_BIAS
, then the following equation applies:ht = tanh(Wixt + Riht-1)
-
CUDNN_LSTM
-
A four-gate Long Short-Term Memory network with no peephole connections.
In the forward pass, the output
ht
and cell outputct
for a given iteration can be computed from the recurrent inputht-1
, the cell inputct-1
and the previous layer inputxt
, given the matricesW, R
and the bias vectors.In addition, the following applies:
If
cudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equations with biasesbW
andbR
apply:it = σ(Wixt + Riht-1 + bWi + bRi) ft = σ(Wfxt + Rfht-1 + bWf + bRf) ot = σ(Woxt + Roht-1 + bWo + bRo) c't = tanh(Wcxt + Rcht-1 + bWc + bRc) ct = ft ◦ ct-1 + it ◦ c't ht = ot ◦ tanh(ct)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_SINGLE_INP_BIAS
orCUDNN_RNN_SINGLE_REC_BIAS
, then the following equations with biasb
apply:it = σ(Wixt + Riht-1 + bi) ft = σ(Wfxt + Rfht-1 + bf) ot = σ(Woxt + Roht-1 + bo) c't = tanh(Wcxt + Rcht-1 + bc) ct = ft ◦ ct-1 + it ◦ c't ht = ot ◦ tanh(ct)
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_NO_BIAS
, then the following equations apply:it = σ(Wixt + Riht-1) ft = σ(Wfxt + Rfht-1) ot = σ(Woxt + Roht-1) c't = tanh(Wcxt + Rcht-1) ct = ft ◦ ct-1 + it ◦ c't ht = ot◦tanh(ct)
-
CUDNN_GRU
-
A three-gate network consisting of Gated Recurrent Units.
In the forward pass, the output
ht
for a given iteration can be computed from the recurrent inputht-1
and the previous layer inputxt
given matricesW, R
and the bias vectors.In addition, the following applies:
If
cudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equations with biasesbW
andbR
apply:it = σ(Wixt + Riht-1 + bWi + bRu) rt = σ(Wrxt + Rrht-1 + bWr + bRr) h't = tanh(Whxt + rt◦(Rhht-1 + bRh) + bWh) ht = (1 - it) ◦ h't + it ◦ ht-1
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_SINGLE_INP_BIAS
, then the following equations with biasb
apply:it = σ(Wixt + Riht-1 + bi) rt = σ(Wrxt + Rrht-1 + br) h't = tanh(Whxt + rt ◦ (Rhht-1) + bWh) ht = (1 - it) ◦ h't + it ◦ ht-1
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_SINGLE_REC_BIAS
, then the following equations with biasb
apply:it = σ(Wixt + Riht-1 + bi) rt = σ(Wrxt + Rrht-1 + br) h't = tanh(Whxt + rt ◦ (Rhht-1 + bRh)) ht = (1 - it) ◦ h't + it ◦ ht-1
IfcudnnRNNBiasMode_t biasMode
inrnnDesc
isCUDNN_RNN_NO_BIAS
, then the following equations apply:it = σ(Wixt + Riht-1) rt = σ(Wrxt + Rrht-1) h't = tanh(Whxt + rt ◦ (Rhht-1)) ht = (1 - it) ◦ h't + it ◦ ht-1
3.58. cudnnRNNPaddingMode_t
cudnnRNNPaddingMode_t
is an enumerated type used to enable or disable the padded input/output.
Values
-
CUDNN_RNN_PADDED_IO_DISABLED
- Disables the padded input/output.
-
CUDNN_RNN_PADDED_IO_ENABLED
- Enables the padded input/output.
3.59. cudnnSamplerType_t
cudnnSamplerType_t
is an enumerated type passed to cudnnSetSpatialTransformerNdDescriptor()
to select the sampler type to be used by cudnnSpatialTfSamplerForward()
and cudnnSpatialTfSamplerBackward()
.
Values
-
CUDNN_SAMPLER_BILINEAR
- Selects the bilinear sampler.
3.60. cudnnSeqDataAxis_t
cudnnSeqDataAxis_t
is an enumerated type that indexes active dimensions in the dimA[]
argument that is passed to the cudnnSetSeqDataDescriptor()
function to configure the sequence data descriptor of type cudnnSeqDataDescriptor_t
.
cudnnSeqDataAxis_t
constants are also used in the axis[]
argument of the cudnnSetSeqDataDescriptor()
call to define the layout of the sequence data buffer in memory.
See cudnnSetSeqDataDescriptor()
for a detailed description on how to use the cudnnSeqDataAxis_t
enumerated type.
The CUDNN_SEQDATA_DIM_COUNT
macro defines the number of constants in the cudnnSeqDataAxis_t
enumerated type. This value is currently set to 4
.
Values
-
CUDNN_SEQDATA_TIME_DIM
-
Identifies the
TIME
(sequence length) dimension or specifies theTIME
in the data layout. -
CUDNN_SEQDATA_BATCH_DIM
-
Identifies the
BATCH
dimension or specifies theBATCH
in the data layout. -
CUDNN_SEQDATA_BEAM_DIM
-
Identifies the
BEAM
dimension or specifies theBEAM
in the data layout. -
CUDNN_SEQDATA_VECT_DIM
-
Identifies the
VECT
(vector) dimension or specifies theVECT
in the data layout.
3.61. cudnnSeqDataDescriptor_t
cudnnSeqDataDescriptor_t
is a pointer to an opaque structure holding parameters of the sequence data container or buffer. The sequence data container is used to store fixed size vectors defined by the VECT
dimension. Vectors are arranged in additional three dimensions: TIME
, BATCH
and BEAM
.
The TIME
dimension is used to bundle vectors into sequences of vectors. The actual sequences can be shorter than the TIME
dimension, therefore, additional information is needed about each sequence length and how unused (padding) vectors should be saved.
It is assumed that the sequence data container is fully packed. The TIME
, BATCH
and BEAM
dimensions can be in any order when vectors are traversed in the ascending order of addresses. Six data layouts (permutation of TIME
, BATCH
and BEAM
) are possible.
The cudnnSeqDataDescriptor_t
object holds the following parameters:
- data type used by vectors
TIME
,BATCH
,BEAM
andVECT
dimensions- data layout
- the length of each sequence along the
TIME
dimension - an optional value to be copied to output padding vectors
Use the cudnnCreateSeqDataDescriptor()
function to create one instance of the sequence data descriptor object and cudnnDestroySeqDataDescriptor()
to delete a previously created descriptor. Use the cudnnSetSeqDataDescriptor()
function to configure the descriptor.
This descriptor is used by multi-head attention API functions.
3.62. cudnnSoftmaxAlgorithm_t
cudnnSoftmaxAlgorithm_t
is used to select an implementation of the softmax function used in cudnnSoftmaxForward()
and cudnnSoftmaxBackward()
.
Values
-
CUDNN_SOFTMAX_FAST
-
This implementation applies the straightforward softmax operation.
-
CUDNN_SOFTMAX_ACCURATE
-
This implementation scales each point of the softmax input domain by its maximum value to avoid potential floating point overflows in the softmax evaluation.
-
CUDNN_SOFTMAX_LOG
-
This entry performs the log softmax operation, avoiding overflows by scaling each point in the input domain as in
CUDNN_SOFTMAX_ACCURATE
.
3.63. cudnnSoftmaxMode_t
cudnnSoftmaxMode_t
is used to select over which data the cudnnSoftmaxForward()
and cudnnSoftmaxBackward()
are computing their results.
Values
-
CUDNN_SOFTMAX_MODE_INSTANCE
-
The softmax operation is computed per image (
N
) across the dimensions C,H,W. -
CUDNN_SOFTMAX_MODE_CHANNEL
-
The softmax operation is computed per spatial location (
H,W
) per image (N
) across the dimensionC
.
3.64. cudnnSpatialTransformerDescriptor_t
cudnnSpatialTransformerDescriptor_t
is a pointer to an opaque structure holding the description of a spatial transformation operation. cudnnCreateSpatialTransformerDescriptor()
is used to create one instance, cudnnSetSpatialTransformerNdDescriptor()
is used to initialize this instance, and cudnnDestroySpatialTransformerDescriptor()
is used to destroy this instance.
3.65. cudnnStatus_t
cudnnStatus_t
is an enumerated type used for function status returns. All cuDNN library functions return their status, which can be one of the following values:
Values
-
CUDNN_STATUS_SUCCESS
-
The operation completed successfully.
-
CUDNN_STATUS_NOT_INITIALIZED
-
The cuDNN library was not initialized properly. This error is usually returned when a call to
cudnnCreate()
fails or whencudnnCreate()
has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called bycudnnCreate()
or by an error in the hardware setup. -
CUDNN_STATUS_ALLOC_FAILED
-
Resource allocation failed inside the cuDNN library. This is usually caused by an internal
cudaMalloc()
failure.To correct: prior to the function call, deallocate previously allocated memory as much as possible.
-
CUDNN_STATUS_BAD_PARAM
-
An incorrect value or parameter was passed to the function.
To correct: ensure that all the parameters being passed have valid values.
-
CUDNN_STATUS_ARCH_MISMATCH
-
The function requires a feature absent from the current GPU device. Note that cuDNN only supports devices with compute capabilities greater than or equal to 3.0.
To correct: compile and run the application on a device with appropriate compute capability.
-
CUDNN_STATUS_MAPPING_ERROR
-
An access to GPU memory space failed, which is usually caused by a failure to bind a texture.
To correct: prior to the function call, unbind any previously bound textures.
Otherwise, this may indicate an internal error/bug in the library.
-
CUDNN_STATUS_EXECUTION_FAILED
-
The GPU program failed to execute. This is usually caused by a failure to launch some cuDNN kernel on the GPU, which can occur for multiple reasons.
To correct: check that the hardware, an appropriate version of the driver, and the cuDNN library are correctly installed.
Otherwise, this may indicate an internal error/bug in the library.
-
CUDNN_STATUS_INTERNAL_ERROR
-
An internal cuDNN operation failed.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The functionality requested is not presently supported by cuDNN.
-
CUDNN_STATUS_LICENSE_ERROR
-
The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable
NVIDIA_LICENSE_FILE
is not set properly. -
CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
-
Runtime library required by RNN calls (
libcuda.so
ornvcuda.dll
) cannot be found in predefined search paths. -
CUDNN_STATUS_RUNTIME_IN_PROGRESS
-
Some tasks in the user stream are not completed.
-
CUDNN_STATUS_RUNTIME_FP_OVERFLOW
-
Numerical overflow occurred during the GPU kernel execution.
3.66. cudnnTensorDescriptor_t
cudnnCreateTensorDescriptor_t
is a pointer to an opaque structure holding the description of a generic n-D dataset. cudnnCreateTensorDescriptor()
is used to create one instance, and one of the routines cudnnSetTensorNdDescriptor()
, cudnnSetTensor4dDescriptor()
or cudnnSetTensor4dDescriptorEx()
must be used to initialize this instance.
3.67. cudnnTensorFormat_t
cudnnTensorFormat_t
is an enumerated type used by cudnnSetTensor4dDescriptor()
to create a tensor with a pre-defined layout. For a detailed explanation of how these tensors are arranged in memory, see Data Layout Formats.
Values
-
CUDNN_TENSOR_NCHW
-
This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension.
-
CUDNN_TENSOR_NHWC
-
This tensor format specifies that the data is laid out in the following order: batch size, rows, columns, feature maps. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, rows, columns, and feature maps; the feature maps are the inner dimension and the images are the outermost dimension.
-
CUDNN_TENSOR_NCHW_VECT_C
-
This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. However, each element of the tensor is a vector of multiple feature maps. The length of the vector is carried by the data type of the tensor. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension. This format is only supported with tensor data types
CUDNN_DATA_INT8x4
,CUDNN_DATA_INT8x32
, andCUDNN_DATA_UINT8x4
.The
CUDNN_TENSOR_NCHW_VECT_C
can also be interpreted in the following way: The NCHW INT8x32 format is really N x (C/32) x H x W x 32 (32 Cs for every W), just as the NCHW INT8x4 format is N x (C/4) x H x W x 4 (4 Cs for every W). Hence, theVECT_C
name - each W is a vector (4 or 32) of Cs.
3.68. cudnnTensorTransformDescriptor_t
cudnnTensorTransformDescriptor_t
is an opaque structure containing the description of the tensor transform. Use the cudnnCreateTensorTransformDescriptor function to create an instance of this descriptor, and cudnnDestroyTensorTransformDescriptor function to destroy a previously created instance.
3.69. cudnnWgradMode_t
cudnnWgradMode_t
is an enumerated type that selects how buffers holding gradients of the loss function, computed with respect to trainable parameters, are updated. Currently, this type is used by the cudnnGetMultiHeadAttnWeights()
function only.
Values
-
CUDNN_WGRAD_MODE_ADD
-
A weight gradient component corresponding to a new batch of inputs is added to previously evaluated weight gradients. Before using this mode, the buffer holding weight gradients should be initialized to zero. Alternatively, the first API call outputting to an uninitialized buffer should use the
CUDNN_WGRAD_MODE_SET
option. -
CUDNN_WGRAD_MODE_SET
- A weight gradient component, corresponding to a new batch of inputs, overwrites previously stored weight gradients in the output buffer.
This chapter describes the API of all the routines of the cuDNN library.
4.1. cudnnActivationBackward
cudnnStatus_t cudnnActivationBackward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t activationDesc,
const void *alpha,
const cudnnTensorDescriptor_t yDesc,
const void *y,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx)
This routine computes the gradient of a neuron activation function.
- In-place operation is allowed for this routine; meaning
dy
anddx
pointers may be equal. However, this requires the corresponding tensor descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed). - All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of
yDesc
andxDesc
are equal andHW-packed
. For more than 5 dimensions the tensors must have their spatial dimensions packed.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
activationDesc
-
Input. Activation descriptor. See cudnnActivationDescriptor_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
yDesc
-
Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
y
-
Input. Data pointer to GPU memory associated with the tensor descriptor
yDesc
. -
dyDesc
-
Input. Handle to the previously initialized input differential tensor descriptor.
-
dy
-
Input. Data pointer to GPU memory associated with the tensor descriptor
dyDesc
. -
xDesc
-
Input. Handle to the previously initialized output tensor descriptor.
-
x
-
Input. Data pointer to GPU memory associated with the output tensor descriptor
xDesc
. -
dxDesc
-
Input. Handle to the previously initialized output differential tensor descriptor.
-
dx
-
Output. Data pointer to GPU memory associated with the output tensor descriptor
dxDesc
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function launched successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- The strides
nStride, cStride, hStride, wStride
of the input differential tensor and output differential tensor differ and in-place operation is used.
- The strides
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration. See the following for some examples of non-supported configurations:
- The dimensions
n, c, h, w
of the input tensor and output tensor differ. - The
datatype
of the input tensor and output tensor differs. - The strides
nStride, cStride, hStride, wStride
of the input tensor and the input differential tensor differ. - The strides
nStride, cStride, hStride, wStride
of the output tensor and the output differential tensor differ.
- The dimensions
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.2. cudnnActivationForward
cudnnStatus_t cudnnActivationForward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t activationDesc,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
This routine applies a specified neuron activation function element-wise over each input value.
- In-place operation is allowed for this routine; meaning,
xData
andyData
pointers may be equal. However, this requiresxDesc
andyDesc
descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed). - All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of
xDesc
andyDesc
are equal andHW-packed
. For more than 5 dimensions the tensors must have their spatial dimensions packed.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
activationDesc
-
Input. Activation descriptor. For more information, see cudnnActivationDescriptor_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
-
Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. -
yDesc
-
Input. Handle to the previously initialized output tensor descriptor.
-
y
-
Output. Data pointer to GPU memory associated with the output tensor descriptor
yDesc
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function launched successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- The parameter
mode
has an invalid enumerant value. - The dimensions
n, c, h, w
of the input tensor and output tensor differ. - The
datatype
of the input tensor and output tensor differs. - The strides
nStride, cStride, hStride, wStride
of the input tensor and output tensor differ and in-place operation is used (meaning,x
andy
pointers are equal).
- The parameter
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.3. cudnnAddTensor
cudnnStatus_t cudnnAddTensor(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t aDesc,
const void *A,
const void *beta,
const cudnnTensorDescriptor_t cDesc,
void *C)
This function adds the scaled values of a bias tensor to another tensor. Each dimension of the bias tensor A
must match the corresponding dimension of the destination tensor C
or must be equal to 1. In the latter case, the same value from the bias tensor for those dimensions will be used to blend into the C
tensor.
Up to dimension 5, all tensor formats are supported. Beyond those dimensions, this routine is not supported
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the source value with the prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
aDesc
-
Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
A
-
Input. Pointer to data of the tensor described by the
aDesc
descriptor. -
cDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
C
-
Input/Output. Pointer to data of the tensor described by the
cDesc
descriptor.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function executed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
The dimensions of the bias tensor refer to an amount of data that is incompatible with the output tensor dimensions or the
dataType
of the two tensor descriptors are different. -
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.4. cudnnBatchNormalizationBackward
cudnnStatus_t cudnnBatchNormalizationBackward(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alphaDataDiff,
const void *betaDataDiff,
const void *alphaParamDiff,
const void *betaParamDiff,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnTensorDescriptor_t dxDesc,
void *dx,
const cudnnTensorDescriptor_t bnScaleBiasDiffDesc,
const void *bnScale,
void *resultBnScaleDiff,
void *resultBnBiasDiff,
double epsilon,
const void *savedMean,
const void *savedInvVariance)
This function performs the backward batch normalization layer computation. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015. .
For more information, see cudnnDeriveBNTensorDescriptor
for the secondary tensor descriptor generation for the parameters used in this function.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.
-
mode
-
Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.
-
*alphaDataDiff
,*betaDataDiff
-
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output
dx
with a prior value in the destination tensor as follows:dstValue = alphaDataDiff[0]*resultValue + betaDataDiff[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
*alphaParamDiff
,*betaParamDiff
-
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs
resultBnScaleDiff
andresultBnBiasDiff
with prior values in the destination tensor as follows:dstValue = alphaParamDiff[0]*resultValue + betaParamDiff[0]*priorDstValue
For more information, see Scaling Parameters.
-
xDesc, dxDesc, dyDesc
-
Inputs. Handles to the previously initialized tensor descriptors.
-
*x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
data. -
*dy
-
Inputs. Data pointer to GPU memory associated with the tensor descriptor
dyDesc
, for the backpropagated differentialdy
input. -
*dx
-
Inputs. Data pointer to GPU memory associated with the tensor descriptor
dxDesc
, for the resulting differential output with respect tox
. -
bnScaleBiasDiffDesc
-
Input. Shared tensor descriptor for the following five tensors:
bnScale, resultBnScaleDiff, resultBnBiasDiff, savedMean, savedInvVariance
. The dimensions for this tensor descriptor are dependent on normalization mode. For more information, see cudnnDeriveBNTensorDescriptor.Note:The data type of this tensor descriptor must be
float
for FP16 and FP32 input tensors, anddouble
for FP64 input tensors.
-
*bnScale
-
Input. Pointer in the device memory for the batch normalization
scale
parameter (in the original paper the quantityscale
is referred to as gamma).Note:The
bnBias
parameter is not needed for this layer's computation. -
resultBnScaleDiff
,resultBnBiasDiff
- Outputs. Pointers in device memory for the resulting scale and bias differentials computed by this routine. Note that these scale and bias gradients are weight gradients specific to this batch normalization operation, and by definition are not backpropagated.
-
epsilon
-
Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. -
*savedMean
,*savedInvVariance
-
Inputs. Optional cache parameters containing saved intermediate results that were computed during the forward pass. For this to work correctly, the layer's
x
andbnScale
data have to remain unchanged until this backward function is called.Note:Both these parameters can be
NULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- Any of the pointers
alpha, beta, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff
isNULL
. - The number of
xDesc
oryDesc
ordxDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). bnScaleBiasDiffDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.- Exactly one of
savedMean
,savedInvVariance
pointers isNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
.- Dimensions or data types mismatch for any pair of
xDesc, dyDesc, dxDesc
.
- Any of the pointers
4.5. cudnnBatchNormalizationBackwardEx
cudnnStatus_t cudnnBatchNormalizationBackwardEx (
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const void *alphaDataDiff,
const void *betaDataDiff,
const void *alphaParamDiff,
const void *betaParamDiff,
const cudnnTensorDescriptor_t xDesc,
const void *xData,
const cudnnTensorDescriptor_t yDesc,
const void *yData,
const cudnnTensorDescriptor_t dyDesc,
const void *dyData,
const cudnnTensorDescriptor_t dzDesc,
void *dzData,
const cudnnTensorDescriptor_t dxDesc,
void *dxData,
const cudnnTensorDescriptor_t dBnScaleBiasDesc,
const void *bnScaleData,
const void *bnBiasData,
void *dBnScaleData,
void *dBnBiasData,
double epsilon,
const void *savedMean,
const void *savedInvVariance,
const cudnnActivationDescriptor_t activationDesc,
void *workspace,
size_t workSpaceSizeInBytes
void *reserveSpace
size_t reserveSpaceSizeInBytes);
This function is an extension of the cudnnBatchNormalizationBackward for performing the backward batch normalization layer computation with a fast NHWC semi-persistent kernel. This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:
- All tensors, namely,
x, y, dz, dy, dx
must be NHWC-fully packed, and must be of the typeCUDNN_DATA_HALF
. - The tensor C dimension should be a multiple of 4.
- The input parameter
mode
must be set toCUDNN_BATCHNORM_SPATIAL_PERSISTENT
. workspace
is notNULL
.workSpaceSizeInBytes
is equal or larger than the amount required bycudnnGetBatchNormalizationBackwardExWorkspaceSize()
.reserveSpaceSizeInBytes
is equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.- The content in
reserveSpace
stored by cudnnBatchNormalizationForwardTrainingEx must be preserved.
If workspace
is NULL
and workSpaceSizeInBytes
of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationBackward
.
This workspace
is not required to be clean. Moreover, the workspace
does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.
This extended function can accept a *workspace
pointer to the GPU workspace, and workSpaceSizeInBytes
, the size of the workspace, from the user.
The bnOps
input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
Only 4D and 5D tensors are supported. The epsilon
value has to be the same during the training, the backpropagation, and the inference.
When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx
.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.
-
mode
-
Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.
-
bnOps
- Input. Mode of operation for the fast NHWC kernel. For more information, see cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
-
*alphaDataDiff
,*betaDataDiff
-
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output
dx
with a prior value in the destination tensor as follows:dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
*alphaParamDiff
,*betaParamDiff
-
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs
dBnScaleData
anddBnBiasData
with prior values in the destination tensor as follows:dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,*x
,yDesc
,*yData
,dyDesc
,*dyData
,dzDesc
,*dzData
,dxDesc
,*dx/dt
-
Inputs. Tensor descriptors and pointers in the device memory for the layer's
x
data, backpropagated differentialdy
(inputs), the optionaly
input data, the optionaldz
output, and thedx
output, which is the resulting differential with respect tox
. For more information, see cudnnTensorDescriptor_t. -
dBnScaleBiasDesc
-
Input. Shared tensor descriptor for the following six tensors:
bnScaleData
,bnBiasData
,dBnScaleData
,dBnBiasData
,savedMean
, andsavedInvVariance
. For more information, see cudnnDeriveBNTensorDescriptor.The dimensions for this tensor descriptor are dependent on normalization mode.
Note:Note: The data type of this tensor descriptor must be
float
for FP16 and FP32 input tensors anddouble
for FP64 input tensors.For more information, see cudnnTensorDescriptor_t.
-
*bnScaleData
-
Input. Pointer in the device memory for the batch normalization scale parameter (in the original paper the quantity scale is referred to as gamma).
-
*bnBiasData
- Input. Pointers in the device memory for the batch normalization bias parameter (in the original paper bias is referred to as beta). This parameter is used only when activation should be performed.
-
*dBnScaleData
,dBnBiasData
-
Inputs. Pointers in the device memory for the gradients of
bnScaleData
andbnBiasData
, respectively. -
epsilon
-
Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The same epsilon value should be used in forward and backward functions. -
*savedMean
,*savedInvVariance
-
Inputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's
x
andbnScaleData
,bnBiasData
data has to remain unchanged until this backward function is called. Note that both these parameters can beNULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small. -
activationDesc
- Input. The tensor descriptor for the activation operation.
-
workspace
-
Input. Pointer to the GPU workspace. If
workspace
isNULL
andworkSpaceSizeInBytes
of zero is passed in, then this API will function exactly like the non-extended function cudnnBatchNormalizationBackward. -
workSpaceSizeInBytes
- Input. The size of the workspace. It must be large enough to trigger the fast NHWC semi-persistent kernel by this function.
-
*reserveSpace
-
Input. Pointer to the GPU workspace for the
reserveSpace
. -
reserveSpaceSizeInBytes
-
Input. The size of the
reserveSpace
. It must be equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- Any of the pointers
alphaDataDiff, betaDataDiff, alphaParamDiff, betaParamDiff, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff
isNULL
. - The number of
xDesc
oryDesc
ordxDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). dBnScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.- Exactly one of
savedMean
,savedInvVariance
pointers isNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
.- Dimensions or data types mismatch for any pair of
xDesc
,dyDesc
,dxDesc
.
- Any of the pointers
4.6. cudnnBatchNormalizationForwardInference
cudnnStatus_t cudnnBatchNormalizationForwardInference(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t yDesc,
void *y,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScale,
const void *bnBias,
const void *estimatedMean,
const void *estimatedVariance,
double epsilon)
This function performs the forward batch normalization layer computation for the inference phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
- Only 4D and 5D tensors are supported.
- The input transformation performed by this function is defined as:
y = beta*y + alpha *[bnBias + (bnScale * (x-estimatedMean)/sqrt(epsilon + estimatedVariance)]
- The
epsilon
value has to be the same during training, backpropagation and inference. - For the training phase, use cudnnBatchNormalizationForwardTraining.
- Higher performance can be obtained when HW-packed tensors are used for all of
x
anddx
.
For more information, see cudnnDeriveBNTensorDescriptor for the secondary tensor descriptor generation for the parameters used in this function.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.
-
mode
-
Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.
-
alpha
,beta
-
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,yDesc
-
Input. Handles to the previously initialized tensor descriptors.
-
*x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
input data. -
*y
-
Input. Data pointer to GPU memory associated with the tensor descriptor
yDesc
, for they
output of the batch normalization layer. -
bnScaleBiasMeanVarDesc
,bnScale
,bnBias
-
Inputs. Tensor descriptors and pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma).
-
estimatedMean
,estimatedVariance
-
Inputs. Mean and variance tensors (these have the same descriptor as the bias and scale). The
resultRunningMean
andresultRunningVariance
, accumulated during the training phase from thecudnnBatchNormalizationForwardTraining()
call, should be passed as inputs here. -
epsilon
-
Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the pointers
alpha, beta, x, y, bnScale, bnBias, estimatedMean, estimatedInvVariance
isNULL
. - The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported.) bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.epsilon
value is less thanCUDNN_BN_MIN_EPSILON
.- Dimensions or data types mismatch for
xDesc
,yDesc.
- One of the pointers
4.7. cudnnBatchNormalizationForwardTraining
cudnnStatus_t cudnnBatchNormalizationForwardTraining(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t yDesc,
void *y,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScale,
const void *bnBias,
double exponentialAverageFactor,
void *resultRunningMean,
void *resultRunningVariance,
double epsilon,
void *resultSaveMean,
void *resultSaveInvVariance)
This function performs the forward batch normalization layer computation for the training phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
- Only 4D and 5D tensors are supported.
- The epsilon value has to be the same during training, backpropagation, and inference.
- For the inference phase, use cudnnBatchNormalizationForwardInference.
- Higher performance can be obtained when HW-packed tensors are used for both x and y.
See cudnnDeriveBNTensorDescriptor for the secondary tensor descriptor generation for the parameters used in this function.
Parameters
-
handle
-
Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.
-
mode
-
Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.
-
alpha
,beta
-
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,yDesc
-
Tensor descriptors and pointers in device memory for the layer's
x
andy
data. For more information, see cudnnTensorDescriptor_t. -
*x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
input data. -
*y
-
Input. Data pointer to GPU memory associated with the tensor descriptor
yDesc
, for they
output of the batch normalization layer. -
bnScaleBiasMeanVarDesc
-
Shared tensor descriptor
desc
for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor. The dimensions for this tensor descriptor are dependent on the normalization mode. -
bnScale
,bnBias
-
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma). Note that
bnBias
parameter can replace the previous layer's bias parameter for improved efficiency. -
exponentialAverageFactor
-
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
factor=1/(1+n)
atN
-th call to the function to get Cumulative Moving Average (CMA) behavior such that:CMA[n] = (x[1]+...+x[n])/n
This is proved below:
Writing
CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) = ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1) = CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1) = CMA[n]*(1-factor) + x(n+1)*factor
-
resultRunningMean
,resultRunningVariance
-
Inputs/Outputs. Running mean and variance tensors (these have the same descriptor as the bias and scale). Both of these pointers can be
NULL
but only at the same time. The value stored inresultRunningVariance
(or passed as an input in inference mode) is the sample variance and is the moving average ofvariance[x]
where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are notNULL
, the tensors should be initialized to some reasonable values or to 0. -
epsilon
-
Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. -
resultSaveMean
,resultSaveInvVariance
-
Outputs. Optional cache to save intermediate results computed during the forward pass. These buffers can be used to speed up the backward pass when supplied to the cudnnBatchNormalizationBackward function. The intermediate results stored in
resultSaveMean
andresultSaveInvVariance
buffers should not be used directly by the user. Depending on the batch normalization mode, the results stored inresultSaveInvVariance
may vary. For the cache to work correctly, the input layer data must remain unchanged until the backward function is called. Note that both parameters can beNULL
but only at the same time. In such a case, intermediate statistics will not be saved, and cudnnBatchNormalizationBackward will have to re-compute them. It is recommended to use this cache as the memory overhead is relatively small because these tensors have a much lower product of dimensions than the data tensors.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the pointers
alpha, beta, x, y, bnScale, bnBias
isNULL
. - The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.- Exactly one of
resultSaveMean
,resultSaveInvVariance
pointers areNULL
. - Exactly one of
resultRunningMean
,resultRunningInvVariance
pointers areNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
.- Dimensions or data types mismatch for
xDesc
,yDesc
- One of the pointers
4.8. cudnnBatchNormalizationForwardTrainingEx
cudnnStatus_t cudnnBatchNormalizationForwardTrainingEx(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *xData,
const cudnnTensorDescriptor_t zDesc,
const void *zData,
const cudnnTensorDescriptor_t yDesc,
void *yData,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScaleData,
const void *bnBiasData,
double exponentialAverageFactor,
void *resultRunningMeanData,
void *resultRunningVarianceData,
double epsilon,
void *saveMean,
void *saveInvVariance,
const cudnnActivationDescriptor_t activationDesc,
void *workspace,
size_t workSpaceSizeInBytes
void *reserveSpace
size_t reserveSpaceSizeInBytes);
This function is an extension of the cudnnBatchNormalizationForwardTraining()
for performing the forward batch normalization layer computation.
This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:
- All tensors, namely,
x, y, dz, dy, dx
must be NHWC-fully packed and must be of the typeCUDNN_DATA_HALF
. - The tensor
C
dimension should be a multiple of 4. - The input parameter
mode
must be set toCUDNN_BATCHNORM_SPATIAL_PERSISTENT
. workspace
is notNULL
.workSpaceSizeInBytes
is equal or larger than the amount required bycudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
.reserveSpaceSizeInBytes
is equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.- The content in
reserveSpace
stored by cudnnBatchNormalizationForwardTrainingEx must be preserved.
If workspace
is NULL
and workSpaceSizeInBytes
of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationForwardTraining()
.
This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.
This extended function can accept a *workspace
pointer to the GPU workspace, and workSpaceSizeInBytes
, the size of the workspace, from the user.
The bnOps
input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
Only 4D and 5D tensors are supported. The epsilon
value has to be the same during the training, the backpropagation, and the inference.
When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx
.
Parameters
-
handle
-
Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.
-
mode
-
Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.
-
bnOps
- Input. Mode of operation for the fast NHWC kernel. See cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
-
*alpha
,*beta
-
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,*xData
,zDesc
,*zData
,yDesc
,*yData
-
Tensor descriptors and pointers in device memory for the layer's
x
andy
data, and for the optionalz
tensor input for residual addition to the result of the batch normalization operation, prior to the activation. The optional tensor inputz
should be exactly the same size asx
and the final outputy
. Thisz
input is element-wise added to the output of batch normalization. This addition optionally happens after batch normalization and before the activation. For more information, see cudnnTensorDescriptor_t . -
bnScaleBiasMeanVarDesc
-
Shared tensor descriptor
desc
for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor. The dimensions for this tensor descriptor are dependent on the normalization mode. -
*bnScaleData
,*bnBiasData
-
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper, bias is referred to as beta and scale as gamma). Note that
bnBiasData
parameter can replace the previous layer's bias parameter for improved efficiency. -
exponentialAverageFactor
-
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
factor=1/(1+n)
atN
-th call to the function to get Cumulative Moving Average (CMA) behavior such that:CMA[n] = (x[1]+...+x[n])/n
This is proved below:
Writing
CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) = ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1) = CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1) = CMA[n]*(1-factor) + x(n+1)*factor
-
*resultRunningMeanData
,*resultRunningVarianceData
-
Inputs/Outputs. Pointers to the running mean and running variance data. Both these pointers can be
NULL
but only at the same time. The value stored inresultRunningVarianceData
(or passed as an input in inference mode) is the sample variance and is the moving average ofvariance[x]
where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are notNULL
, the tensors should be initialized to some reasonable values or to 0. -
epsilon
-
Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. -
*saveMean
,*saveInvVariance
-
Outputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's
x
andbnScaleData
,bnBiasData
data has to remain unchanged until this backward function is called. Note that both these parameters can beNULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small. -
activationDesc
-
Input. The tensor descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
then this activation is used. -
*workspace
,workSpaceSizeInBytes
-
Inputs.
*workspace
is a pointer to the GPU workspace, andworkSpaceSizeInBytes
is the size of the workspace. When*workspace
is notNULL
and*workSpaceSizeInBytes
is large enough, and the tensor layout is NHWC and the data type configuration is supported, then this function will trigger a new semi-persistent NHWC kernel for batch normalization. The workspace is not required to be clean. Also, the workspace does not need to remain unchanged between the forward and backward passes. -
*reserveSpace
-
Input. Pointer to the GPU workspace for the
reserveSpace
. -
reserveSpaceSizeInBytes
-
Input. The size of the
reserveSpace
. Must be equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the pointers
alpha, beta, x, y, bnScaleData, bnBiasData
isNULL
. - The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the [4,5] range (only 4D and 5D tensors are supported). bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.- Exactly one of
saveMean
,saveInvVariance
pointers areNULL
. - Exactly one of
resultRunningMeanData
,resultRunningInvVarianceData
pointers areNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
.- Dimensions or data types mismatch for
xDesc
,yDesc
- One of the pointers
4.9. cudnnConvolutionBackwardBias
cudnnStatus_t cudnnConvolutionBackwardBias(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const void *beta,
const cudnnTensorDescriptor_t dbDesc,
void *db)
This function computes the convolution function gradient with respect to the bias, which is the sum of every element belonging to the same feature map across all of the images of the input tensor. Therefore, the number of elements produced is equal to the number of features maps of the input tensor.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
dyDesc
-
Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
dy
-
Input. Data pointer to GPU memory associated with the tensor descriptor
dyDesc
. -
dbDesc
-
Input. Handle to the previously initialized output tensor descriptor.
-
db
-
Output. Data pointer to GPU memory associated with the output tensor descriptor
dbDesc
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The operation was launched successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the parameters
n, height, width
of the output tensor is not 1. - The numbers of feature maps of the input tensor and output tensor differ.
- The
dataType
of the two tensor descriptors is different.
- One of the parameters
4.10. cudnnConvolutionBackwardData
cudnnStatus_t cudnnConvolutionBackwardData(
cudnnHandle_t handle,
const void *alpha,
const cudnnFilterDescriptor_t wDesc,
const void *w,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnConvolutionDescriptor_t convDesc,
cudnnConvolutionBwdDataAlgo_t algo,
void *workSpace,
size_t workSpaceSizeInBytes,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx)
This function computes the convolution data gradient of the tensor dy
, where y
is the output of the forward convolution in cudnnConvolutionForward()
. It uses the specified algo
, and returns the results in the output tensor dx
. Scaling factors alpha
and beta
can be used to scale the computed result or accumulate with the current dx
.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
wDesc
-
Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.
-
w
-
Input. Data pointer to GPU memory associated with the filter descriptor
wDesc
. -
dyDesc
-
Input. Handle to the previously initialized input differential tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
dy
-
Input. Data pointer to GPU memory associated with the input differential tensor descriptor
dyDesc
. -
convDesc
-
Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.
-
algo
-
Input. Enumerant that specifies which backward data convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionBwdDataAlgo_t.
-
workSpace
-
Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.
-
workSpaceSizeInBytes
-
Input. Specifies the size in bytes of the provided
workSpace
. -
dxDesc
-
Input. Handle to the previously initialized output tensor descriptor.
-
dx
-
Input/Output. Data pointer to GPU memory associated with the output tensor descriptor
dxDesc
that carries the result.
Supported configurations
This function supports the following combinations of data types for wDesc
, dyDesc
, convDesc
, and dxDesc
.
Data Type Configurations | wDesc , dyDesc and dxDesc Data Type |
convDesc Data Type |
---|---|---|
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) |
CUDNN_DATA_HALF |
CUDNN_DATA_HALF |
PSEUDO_HALF_CONFIG |
CUDNN_DATA_HALF |
CUDNN_DATA_FLOAT |
FLOAT_CONFIG |
CUDNN_DATA_FLOAT |
CUDNN_DATA_FLOAT |
DOUBLE_CONFIG |
CUDNN_DATA_DOUBLE |
CUDNN_DATA_DOUBLE |
Supported algorithms
Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following for a list of algorithm options, and their respective supported parameters and deterministic behavior.
The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions. For the following terms, the short-form versions shown in the parentheses are used in the table below, for brevity:
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 (_ALGO_0)
CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 (_ALGO_1)
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT (_FFT)
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING (_FFT_TILING)
CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD (_WINOGRAD)
CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
CUDNN_TENSOR_NCHW (_NCHW)
CUDNN_TENSOR_NHWC (_NHWC)
CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Filter descriptor
wDesc: _NHWC (see cudnnTensorFormat_t) |
|||||
---|---|---|---|---|---|
Algo Name | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_1 |
NHWC HWC-packed | NHWC HWC-packed | TRUE_HALF_CONFIG
|
Filter descriptor
wDesc: _NCHW . |
|||||
---|---|---|---|---|---|
Algo Name | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_0 |
No | NCHW CHW-packed | All except _NCHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: greater than 0 for all dimensions
|
_ALGO_1 |
Yes | NCHW CHW-packed | All except _NCHW_VECT_C . |
TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_FFT |
Yes | NCHW CHW-packed | NCHW HW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_FFT_TILING |
Yes | NCHW CHW-packed | NCHW HW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
When neither of When either of
|
_WINOGRAD |
Yes | NCHW CHW-packed | All except _NCHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_WINOGRAD_NONFUSED |
Yes | NCHW CHW-packed | All except _NCHW_VECT_C . |
TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
If |
Filter descriptor
wDesc: _NCHW . |
|||||
---|---|---|---|---|---|
Algo Name | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_0 |
Yes | NCDHW CDHW-packed | All except _NCDHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: greater than 0 for all dimensions
|
_ALGO_1 |
Yes | NCDHW fully-packed | NCDHW fully-packed | TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_FFT_TILING |
Yes | NCDHW CDHW-packed | NCDHW DHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
Returns
-
CUDNN_STATUS_SUCCESS
-
The operation was launched successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- At least one of the following is
NULL
:handle
,dyDesc
,wDesc
,convDesc
,dxDesc
,dy
,w
,dx
,alpha
,beta
wDesc
anddyDesc
have a non-matching number of dimensionswDesc
anddxDesc
have a non-matching number of dimensionswDesc
has fewer than three number of dimensionswDesc
,dxDesc
, anddyDesc
have a non-matching data type.wDesc
anddxDesc
have a non-matching number of input feature maps per image (or group in case of grouped convolutions).dyDesc
spatial sizes do not match with the expected size as determined bycudnnGetConvolutionNdForwardOutputDim
- At least one of the following is
-
CUDNN_STATUS_NOT_SUPPORTED
-
At least one of the following conditions are met:
dyDesc
ordxDesc
have a negative tensor stridingdyDesc
,wDesc
ordxDesc
has a number of dimensions that is not 4 or 5- The chosen algo does not support the parameters provided; see above for an exhaustive list of parameters that support each algo
dyDesc
orwDesc
indicate an output channel count that isn't a multiple of group count (if group count has been set inconvDesc
).
-
CUDNN_STATUS_MAPPING_ERROR
-
An error occurs during the texture binding of the filter data or the input differential tensor data
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.11. cudnnConvolutionBackwardFilter
cudnnStatus_t cudnnConvolutionBackwardFilter(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnConvolutionDescriptor_t convDesc,
cudnnConvolutionBwdFilterAlgo_t algo,
void *workSpace,
size_t workSpaceSizeInBytes,
const void *beta,
const cudnnFilterDescriptor_t dwDesc,
void *dw)
This function computes the convolution weight (filter) gradient of the tensor dy
, where y
is the output of the forward convolution in cudnnConvolutionForward()
. It uses the specified algo
, and returns the results in the output tensor dw
. Scaling factors alpha
and beta
can be used to scale the computed result or accumulate with the current dw
.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
-
Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. -
dyDesc
-
Input. Handle to the previously initialized input differential tensor descriptor.
-
dy
-
Input. Data pointer to GPU memory associated with the backpropagation gradient tensor descriptor
dyDesc
. -
convDesc
-
Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.
-
algo
-
Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionBwdFilterAlgo_t.
-
workSpace
-
Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.
-
workSpaceSizeInBytes
-
Input. Specifies the size in bytes of the provided
workSpace
. -
dwDesc
-
Input. Handle to a previously initialized filter gradient descriptor. For more information, see cudnnFilterDescriptor_t.
-
dw
-
Input/Output. Data pointer to GPU memory associated with the filter gradient descriptor
dwDesc
that carries the result.
Supported configurations
This function supports the following combinations of data types for xDesc
, dyDesc
, convDesc
, and dwDesc
.
Data Type Configurations | xDesc , dyDesc , and dwDesc Data Type |
convDesc Data Type |
---|---|---|
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) |
CUDNN_DATA_HALF |
CUDNN_DATA_HALF |
PSEUDO_HALF_CONFIG |
CUDNN_DATA_HALF |
CUDNN_DATA_FLOAT |
FLOAT_CONFIG |
CUDNN_DATA_FLOAT |
CUDNN_DATA_FLOAT |
DOUBLE_CONFIG |
CUDNN_DATA_DOUBLE |
CUDNN_DATA_DOUBLE |
Supported algorithms
Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following table for an exhaustive list of algorithm options and their respective supported parameters and deterministic behavior.
The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions. For the following terms, the short-form versions shown in the parentheses are used in the table below, for brevity:
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (_ALGO_0)
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (_ALGO_1)
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 (_ALGO_3)
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT (_FFT)
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING (_FFT_TILING)
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
CUDNN_TENSOR_NCHW (_NCHW)
CUDNN_TENSOR_NHWC (_NHWC)
CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Filter descriptor
dwDesc: _NHWC (see cudnnTensorFormat_t) |
|||||
---|---|---|---|---|---|
Algo Name | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_0 and _ALGO_1 |
NHWC HWC-packed | NHWC HWC-packed | PSEUDO_HALF_CONFIG
|
Filter descriptor
wDesc: _NCHW |
|||||
---|---|---|---|---|---|
Algo Name | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_0 |
No | All except _NCHW_VECT_C . |
NCHW CHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: greater than 0 for all dimensions
This algo is not supported if output is of type |
_ALGO_1 |
Yes | _NCHW or _NHWC |
NCHW CHW-packed | TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_FFT |
Yes | NCHW CHW-packed | NCHW CHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_ALGO_3 |
Yes | All except _NCHW_VECT_C |
NCHW CHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_WINOGRAD_NONFUSED |
Yes | All except _NCHW_VECT_C |
NCHW CHW-packed | TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
If |
_FFT_TILING |
Yes | NCHW CHW-packed | NCHW CHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
Filter descriptor
wDesc: _NCHW . |
|||||
---|---|---|---|---|---|
Algo Name (3D Convolutions) | Deterministic (Yes or No) | Tensor Formats Supported for dyDesc |
Tensor Formats Supported for dxDesc |
Data Type Configurations Supported | Important |
_ALGO_0 |
No | All except _NCDHW_VECT_C . |
NCDHW CDHW-packed | PSEUDO_HALF_CONFIG
|
Dilation: greater than 0 for all dimensions
|
_ALGO_3 |
No | NCDHW fully-packed | NCDHW fully-packed | PSEUDO_HALF_CONFIG
|
Dilation: greater than 0 for all dimensions
|
Returns
-
CUDNN_STATUS_SUCCESS
-
The operation was launched successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- At least one of the following is NULL:
handle
,xDesc
,dyDesc
,convDesc
,dwDesc
,xData
,dyData
,dwData
,alpha
,beta
xDesc
anddyDesc
have a non-matching number of dimensionsxDesc
anddwDesc
have a non-matching number of dimensionsxDesc
has fewer than three number of dimensionsxDesc
,dyDesc
, anddwDesc
have a non-matching data type.xDesc
anddwDesc
have a non-matching number of input feature maps per image (or group in case of grouped convolutions).yDesc
orwDesc
indicate an output channel count that isn't a multiple of group count (if group count has been set inconvDesc
).
- At least one of the following is NULL:
-
CUDNN_STATUS_NOT_SUPPORTED
-
At least one of the following conditions are met:
xDesc
ordyDesc
have negative tensor stridingxDesc
,dyDesc
ordwDesc
has a number of dimensions that is not 4 or 5- The chosen algo does not support the parameters provided; see above for exhaustive list of parameter support for each algo
-
CUDNN_STATUS_MAPPING_ERROR
-
An error occurs during the texture binding of the filter data.
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.12. cudnnConvolutionBiasActivationForward
cudnnStatus_t cudnnConvolutionBiasActivationForward(
cudnnHandle_t handle,
const void *alpha1,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnFilterDescriptor_t wDesc,
const void *w,
const cudnnConvolutionDescriptor_t convDesc,
cudnnConvolutionFwdAlgo_t algo,
void *workSpace,
size_t workSpaceSizeInBytes,
const void *alpha2,
const cudnnTensorDescriptor_t zDesc,
const void *z,
const cudnnTensorDescriptor_t biasDesc,
const void *bias,
const cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t yDesc,
void *y)
This function applies a bias and then an activation to the convolutions or cross-correlations of cudnnConvolutionForward(), returning results in y
. The full computation follows the equation y = act ( alpha1 * conv(x) + alpha2 * z + bias )
.
- The routine
cudnnGetConvolution2dForwardOutputDim
orcudnnGetConvolutionNdForwardOutputDim
can be used to determine the proper dimensions of the output tensor descriptoryDesc
with respect toxDesc
,convDesc
, andwDesc
. - Only the
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
algo is enabled withCUDNN_ACTIVATION_IDENTITY
. In other words, in thecudnnActivationDescriptor_t
structure of the inputactivationDesc
, if the mode of thecudnnActivationMode_t
field is set to the enum valueCUDNN_ACTIVATION_IDENTITY
, then the inputcudnnConvolutionFwdAlgo_t
of this functioncudnnConvolutionBiasActivationForward()
must be set to the enum valueCUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
. For more information, seecudnnSetActivationDescriptor()
.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha1
,alpha2
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
-
Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. -
wDesc
-
Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.
-
w
-
Input. Data pointer to GPU memory associated with the filter descriptor
wDesc
. -
convDesc
-
Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.
-
algo
-
Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionFwdAlgo_t.
-
workSpace
-
Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.
-
workSpaceSizeInBytes
-
Input. Specifies the size in bytes of the provided
workSpace
. -
zDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
z
-
Input. Data pointer to GPU memory associated with the tensor descriptor
zDesc
. -
biasDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
bias
-
Input. Data pointer to GPU memory associated with the tensor descriptor
biasDesc
. -
activationDesc
-
Input. Handle to a previously initialized activation descriptor. For more information, see cudnnActivationDescriptor_t.
-
yDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
y
-
Input/Output. Data pointer to GPU memory associated with the tensor descriptor
yDesc
that carries the result of the convolution.
For the convolution step, this function supports the specific combinations of data types for xDesc
, wDesc
, convDesc
, and yDesc
as listed in the documentation of cudnnConvolutionForward()
. The following table specifies the supported combinations of data types for x
, y
, z
, bias
, and alpha1/alpha2
.
x |
w |
y and z |
bias |
alpha1/alpha2 |
---|---|---|---|---|
X_DOUBLE |
X_DOUBLE |
X_DOUBLE |
X_DOUBLE |
X_DOUBLE |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_HALF |
X_HALF |
X_HALF |
X_HALF |
X_FLOAT |
X_INT8 |
X_INT8 |
X_INT8 |
X_FLOAT |
X_FLOAT |
X_INT8 |
X_INT8 |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_INT8x4 |
X_INT8x4 |
X_INT8x4 |
X_FLOAT |
X_FLOAT |
X_INT8x4 |
X_INT8x4 |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_UINT8 |
X_INT8 |
X_INT8 |
X_FLOAT |
X_FLOAT |
X_UINT8 |
X_INT8 |
X_FLOAT |
X_FLOAT |
X_FLOAT |
X_UINT8x4 |
X_INT8x4 |
X_INT8x4 |
X_FLOAT |
X_FLOAT |
X_UINT8x4 |
X_INT8x4 |
X_FLOAT |
X_FLOAT |
X_FLOAT |
Returns
In addition to the error values listed by the documentation of cudnnConvolutionForward()
, the possible error values returned by this function and their meanings are listed below.
-
CUDNN_STATUS_SUCCESS
-
The operation was launched successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- At least one of the following is
NULL
:zDesc
,zData
,biasDesc
,bias
,activationDesc
. - The second dimension of
biasDesc
and the first dimension offilterDesc
are not equal. zDesc
anddestDesc
do not match.
- At least one of the following is
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration. Some examples of non-supported configurations are as follows:
- The
mode
ofactivationDesc
is neitherCUDNN_ACTIVATION_RELU
orCUDNN_ACTIVATION_IDENTITY
. - The
reluNanOpt
ofactivationDesc
is notCUDNN_NOT_PROPAGATE_NAN
. - The second stride of
biasDesc
is not equal to one. - The data type of
biasDesc
does not correspond to the data type ofyDesc
as listed in the above data types table.
- The
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.13. cudnnConvolutionForward
cudnnStatus_t cudnnConvolutionForward(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnFilterDescriptor_t wDesc,
const void *w,
const cudnnConvolutionDescriptor_t convDesc,
cudnnConvolutionFwdAlgo_t algo,
void *workSpace,
size_t workSpaceSizeInBytes,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
This function executes convolutions or cross-correlations over x
using filters specified with w
, returning results in y
. Scaling factors alpha
and beta
can be used to scale the input tensor and the output tensor respectively.
The routine cudnnGetConvolution2dForwardOutputDim
or cudnnGetConvolutionNdForwardOutputDim
can be used to determine the proper dimensions of the output tensor descriptor yDesc
with respect to xDesc
, convDesc
, and wDesc
.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
-
Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
x
-
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. -
wDesc
-
Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.
-
w
-
Input. Data pointer to GPU memory associated with the filter descriptor
wDesc
. -
convDesc
-
Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.
-
algo
-
Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionFwdAlgo_t.
-
workSpace
-
Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.
-
workSpaceSizeInBytes
-
Input. Specifies the size in bytes of the provided
workSpace
. -
yDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
y
-
Input/Output. Data pointer to GPU memory associated with the tensor descriptor
yDesc
that carries the result of the convolution.
Supported configurations
This function supports the following combinations of data types for xDesc
, wDesc
, convDesc
, and yDesc
.
Data Type Configurations | xDesc and wDesc |
convDesc |
yDesc |
---|---|---|---|
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) |
CUDNN_DATA_HALF |
CUDNN_DATA_HALF |
CUDNN_DATA_HALF |
PSEUDO_HALF_CONFIG |
CUDNN_DATA_HALF |
CUDNN_DATA_FLOAT |
CUDNN_DATA_HALF |
FLOAT_CONFIG |
CUDNN_DATA_FLOAT |
CUDNN_DATA_FLOAT |
CUDNN_DATA_FLOAT |
DOUBLE_CONFIG |
CUDNN_DATA_DOUBLE |
CUDNN_DATA_DOUBLE |
CUDNN_DATA_DOUBLE |
INT8_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_INT8 |
CUDNN_DATA_INT32 |
CUDNN_DATA_INT8 |
INT8_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_INT8 |
CUDNN_DATA_INT32 |
CUDNN_DATA_FLOAT |
INT8x4_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_INT8x4 |
CUDNN_DATA_INT32 |
CUDNN_DATA_INT8x4 |
INT8x4_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_INT8x4 |
CUDNN_DATA_INT32 |
CUDNN_DATA_FLOAT |
UINT8x4_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_UINT8x4 |
CUDNN_DATA_INT32 |
CUDNN_DATA_UINT8x4 |
UINT8x4_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) |
CUDNN_DATA_UINT8x4 |
CUDNN_DATA_INT32 |
CUDNN_DATA_FLOAT |
Supported algorithms
For this function, all algorithms perform deterministic computations. Specifying a separate algorithm can cause changes in performance and support.
The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions. For the following terms, the short-form versions shown in the paranthesis are used in the table below, for brevity:
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM (_IMPLICIT_GEMM)
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM (_IMPLICIT_PRECOMP_GEMM)
CUDNN_CONVOLUTION_FWD_ALGO_GEMM (_GEMM)
CUDNN_CONVOLUTION_FWD_ALGO_DIRECT (_DIRECT)
CUDNN_CONVOLUTION_FWD_ALGO_FFT (_FFT)
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING (_FFT_TILING)
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD (_WINOGRAD)
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
CUDNN_TENSOR_NCHW (_NCHW)
CUDNN_TENSOR_NHWC (_NHWC)
CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Filter descriptor
wDesc: _NCHW (see cudnnTensorFormat_t)
|
||||
---|---|---|---|---|
Algo Name | Tensor Formats Supported for xDesc |
Tensor Formats Supported for yDesc |
Data Type Configurations Supported | Important |
_IMPLICIT_GEMM |
All except _NCHW_VECT_C . |
All except _NCHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: Greater than 0 for all dimensions |
_IMPLICIT_PRECOMP_GEMM |
TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions | ||
_GEMM |
PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions | ||
_FFT |
NCHW HW-packed | NCHW HW-packed | PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_FFT_TILING |
PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions When neither of When either of
|
||
_WINOGRAD |
All except_NCHW_VECT_C . |
All except_NCHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: 1 for all dimensions
|
_WINOGRAD_NONFUSED |
TRUE_HALF_CONFIG
|
Dilation: 1 for all dimensions
If |
||
_DIRECT |
Currently not implemented in cuDNN. |
Filter descriptor
wDesc: _NCHWC
|
||||
---|---|---|---|---|
Algo Name | xDesc |
yDesc |
Data Type Configurations Supported | Important |
_IMPLICIT_GEMM |
NCHWC HWC-packed | NCHWC HWC-packed | PSEUDO_HALF_CONFIG
|
Dilation: Greater than 0 for all dimensions |
Filter descriptor
wDesc: _NHWC
|
||||
---|---|---|---|---|
Algo Name | xDesc |
yDesc |
Data Type Configurations Supported | Important |
_IMPLICIT_PRECOMP_GEMM |
NHWC | NHWC | INT8_CONFIG
|
Dilation: 1 for all dimensions Input and output features maps must be a multiple of 4. |
Filter descriptor
wDesc: _NCHW
|
||||
---|---|---|---|---|
Algo Name | xDesc |
yDesc |
Data Type Configurations Supported | Important |
_IMPLICIT_GEMM |
All except _NCHW_VECT_C . |
All except _NCHW_VECT_C . |
PSEUDO_HALF_CONFIG
|
Dilation: Greater than 0 for all dimensions |
_IMPLICIT_PRECOMP_GEMM |
Dilation: 1 for all dimensions | |||
_FFT_TILING |
NCDHW DHW-packed | NCDHW DHW-packed | Dilation: 1 for all dimensions
|
Tensors can be converted to and from CUDNN_TENSOR_NCHW_VECT_C
with cudnnTransformTensor()
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The operation was launched successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- At least one of the following is
NULL
: handle,xDesc
,wDesc
,convDesc
,yDesc
,xData
,w
,yData
,alpha
,beta
xDesc
andyDesc
have a non-matching number of dimensionsxDesc
andwDesc
have a non-matching number of dimensionsxDesc
has fewer than three number of dimensionsxDesc
's number of dimensions is not equal toconvDesc
array length + 2xDesc
andwDesc
have a non-matching number of input feature maps per image (or group in case of grouped convolutions)yDesc
orwDesc
indicate an output channel count that isn't a multiple of group count (if group count has been set inconvDesc
).xDesc
,wDesc
, andyDesc
have a non-matching data type- For some spatial dimension,
wDesc
has a spatial size that is larger than the input spatial size (including zero-padding size)
- At least one of the following is
-
CUDNN_STATUS_NOT_SUPPORTED
-
At least one of the following conditions are met:
xDesc
oryDesc
have negative tensor stridingxDesc
,wDesc
, oryDesc
has a number of dimensions that is not 4 or 5yDesc
spatial sizes do not match with the expected size as determined bycudnnGetConvolutionNdForwardOutputDim
- The chosen algo does not support the parameters provided; see above for an exhaustive list of parameters supported for each algo
-
CUDNN_STATUS_MAPPING_ERROR
-
An error occurred during the texture binding of the filter data.
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.14. cudnnCreate
cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)
This function initializes the cuDNN library and creates a handle to an opaque structure holding the cuDNN library context. It allocates hardware resources on the host and device and must be called prior to making any other cuDNN library calls.
The cuDNN library handle is tied to the current CUDA device (context). To use the library on multiple devices, one cuDNN handle needs to be created for each device.
For a given device, multiple cuDNN handles with different configurations (for example, different current CUDA streams) may be created. Because cudnnCreate
allocates some internal resources, the release of those resources by calling cudnnDestroy
will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy
outside of performance-critical code paths.
For multithreaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handle(s) per thread and use that cuDNN handle for the entire life of the thread.
Parameters
-
handle
-
Output. Pointer to pointer where to store the address to the allocated cuDNN handle. For more information, see cudnnHandle_t.
Returns
-
CUDNN_STATUS_BAD_PARAM
-
Invalid (
NULL
) input pointer supplied. -
CUDNN_STATUS_NOT_INITIALIZED
-
No compatible GPU found, CUDA driver not installed or disabled, CUDA runtime API initialization failed.
-
CUDNN_STATUS_ARCH_MISMATCH
-
NVIDIA GPU architecture is too old.
-
CUDNN_STATUS_ALLOC_FAILED
-
Host memory allocation failed.
-
CUDNN_STATUS_INTERNAL_ERROR
-
CUDA resource allocation failed.
-
CUDNN_STATUS_LICENSE_ERROR
-
cuDNN license validation failed (only when the feature is enabled).
-
CUDNN_STATUS_SUCCESS
-
cuDNN handle was created successfully.
4.15. cudnnCreateActivationDescriptor
cudnnStatus_t cudnnCreateActivationDescriptor(
cudnnActivationDescriptor_t *activationDesc)
This function creates an activation descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnActivationDescriptor_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.16. cudnnCreateAlgorithmDescriptor
cudnnStatus_t cudnnCreateAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t *algoDesc)
This function creates an algorithm descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.17. cudnnCreateAlgorithmPerformance
cudnnStatus_t cudnnCreateAlgorithmPerformance(
cudnnAlgorithmPerformance_t *algoPerf,
int numberToCreate)
This function creates multiple algorithm performance objects by allocating the memory needed to hold their opaque structures.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.18. cudnnCreateAttnDescriptor
cudnnStatus_t cudnnCreateAttnDescriptor(cudnnAttnDescriptor_t *attnDesc);
This function creates one instance of an opaque attention descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL
to attnDesc
when the attention descriptor object cannot be allocated.
Use the cudnnSetAttnDescriptor()
function to configure the attention descriptor and cudnnDestroyAttnDescriptor()
to destroy it and release the allocated memory.
Parameters
-
attnDesc
- Output. Pointer where the address to the newly created attention descriptor should be written.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor object was created successfully.
-
CUDNN_STATUS_BAD_PARAM
-
An invalid input argument was encountered (
attnDesc=NULL
). -
CUDNN_STATUS_ALLOC_FAILED
- The memory allocation failed.
4.19. cudnnCreateConvolutionDescriptor
cudnnStatus_t cudnnCreateConvolutionDescriptor(
cudnnConvolutionDescriptor_t *convDesc)
This function creates a convolution descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnConvolutionDescriptor_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.20. cudnnCreateCTCLossDescriptor
cudnnStatus_t cudnnCreateCTCLossDescriptor(
cudnnCTCLossDescriptor_t* ctcLossDesc)
This function creates a CTC loss function descriptor.
Parameters
-
ctcLossDesc
-
Output. CTC loss descriptor to be set. For more information, see cudnnCTCLossDescriptor_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function returned successfully.
-
CUDNN_STATUS_BAD_PARAM
-
CTC loss descriptor passed to the function is invalid.
-
CUDNN_STATUS_ALLOC_FAILED
-
Memory allocation for this CTC loss descriptor failed.
4.21. cudnnCreateDropoutDescriptor
cudnnStatus_t cudnnCreateDropoutDescriptor(
cudnnDropoutDescriptor_t *dropoutDesc)
This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnDropoutDescriptor_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.22. cudnnCreateFilterDescriptor
cudnnStatus_t cudnnCreateFilterDescriptor(
cudnnFilterDescriptor_t *filterDesc)
This function creates a filter descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnFilterDescriptor_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.23. cudnnCreateFusedOpsConstParamPack
cudnnStatus_t cudnnCreateFusedOpsConstParamPack(
cudnnFusedOpsConstParamPack_t *constPack,
cudnnFusedOps_t ops);
This function creates an opaque structure to store the various problem size information, such as the shape, layout and the type of tensors, and the descriptors for convolution and activation, for the selected sequence of cudnnFusedOps
computations.
Parameters
-
constPack
- Input. The opaque structure that is created by this function. For more information, see cudnnFusedOpsConstParamPack_t.
-
ops
-
Input. The specific sequence of computations to perform in the
cudnnFusedOps
computations, as defined in the enumerant type cudnnFusedOps_t.
Returns
-
CUDNN_STATUS_BAD_PARAM
-
If either
constPack
orops
isNULL
. -
CUDNN_STATUS_SUCCESS
- If the descriptor is created successfully.
-
CUDNN_STATUS_NOT_SUPPORTED
-
If the
ops
enum value is not supported or reserved for future use.
4.24. cudnnCreateFusedOpsPlan
cudnnStatus_t cudnnCreateFusedOpsPlan(
cudnnFusedOpsPlan_t *plan,
cudnnFusedOps_t ops);
This function creates the plan descriptor for the cudnnFusedOps
computation. This descriptor contains the plan information, including the problem type and size, which kernels should be run, and the internal workspace partition.
Parameters
-
plan
- Input. A pointer to the instance of the descriptor created by this function.
-
ops
- Input. The specific sequence of fused operations computations for which this plan descriptor should be created. For more information, see cudnnFusedOps_t.
Returns
-
CUDNN_STATUS_BAD_PARAM
-
If either the input
*plan
isNULL
or theops
input is not a validcudnnFusedOp
enum. -
CUDNN_STATUS_NOT_SUPPORTED
-
The
ops
input provided is not supported. -
CUDNN_STATUS_SUCCESS
- The plan descriptor is created successfully.
4.25. cudnnCreateFusedOpsVariantParamPack
cudnnStatus_t cudnnCreateFusedOpsVariantParamPack(
cudnnFusedOpsVariantParamPack_t *varPack,
cudnnFusedOps_t ops);
This function creates a descriptor for cudnnFusedOps
constant parameters.
Parameters
-
varPack
- Input. Pointer to the descriptor created by this function. For more information, see cudnnFusedOpsVariantParamPack_t.
-
ops
- Input. The specific sequence of fused operations computations for which this descriptor should be created.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor is successfully created.
-
CUDNN_STATUS_BAD_PARAM
- If any input is invalid.
4.26. cudnnCreateLRNDescriptor
cudnnStatus_t cudnnCreateLRNDescriptor(
cudnnLRNDescriptor_t *poolingDesc)
This function allocates the memory needed to hold the data needed for LRN and DivisiveNormalization
layers operation and returns a descriptor used with subsequent layer forward and backward calls.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
cudnnCreateOpTensorDescriptor
cudnnStatus_t cudnnCreateOpTensorDescriptor(
cudnnOpTensorDescriptor_t* opTensorDesc)
This function creates a tensor pointwise math descriptor. For more information, see cudnnOpTensorDescriptor_t.
Parameters
-
opTensorDesc
-
Output. Pointer to the structure holding the description of the Tensor Pointwise math such as add, multiply, and more.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function returned successfully.
-
CUDNN_STATUS_BAD_PARAM
-
Tensor pointwise math descriptor passed to the function is invalid.
-
CUDNN_STATUS_ALLOC_FAILED
-
Memory allocation for this tensor pointwise math descriptor failed.
4.28. cudnnCreatePersistentRNNPlan
cudnnStatus_t cudnnCreatePersistentRNNPlan(
cudnnRNNDescriptor_t rnnDesc,
const int minibatch,
const cudnnDataType_t dataType,
cudnnPersistentRNNPlan_t *plan)
This function creates a plan to execute persistent RNNs when using the CUDNN_RNN_ALGO_PERSIST_DYNAMIC
algo. This plan is tailored to the current GPU and problem hyperparameters. This function call is expected to be expensive in terms of runtime and should be used infrequently. For more information, see cudnnRNNDescriptor_t, cudnnDataType_t, and cudnnPersistentRNNPlan_t.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
-
CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
-
A prerequisite runtime library cannot be found.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The current hyperparameters are invalid.
4.29. cudnnCreatePoolingDescriptor
cudnnStatus_t cudnnCreatePoolingDescriptor(
cudnnPoolingDescriptor_t *poolingDesc)
This function creates a pooling descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.30. cudnnCreateReduceTensorDescriptor
cudnnStatus_t cudnnCreateReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t* reduceTensorDesc)
This function creates a reduce tensor descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_BAD_PARAM
-
reduceTensorDesc
is aNULL
pointer. -
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.31. cudnnCreateRNNDataDescriptor
cudnnStatus_t cudnnCreateRNNDataDescriptor(
cudnnRNNDataDescriptor_t *RNNDataDesc)
This function creates a RNN data descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The RNN data descriptor object was created successfully.
-
CUDNN_STATUS_BAD_PARAM
-
RNNDataDesc
isNULL
. -
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.32. cudnnCreateRNNDescriptor
cudnnStatus_t cudnnCreateRNNDescriptor(
cudnnRNNDescriptor_t *rnnDesc)
This function creates a generic RNN descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.33. cudnnCreateSeqDataDescriptor
cudnnStatus_t cudnnCreateSeqDataDescriptor(cudnnSeqDataDescriptor_t *seqDataDesc);
This function creates one instance of an opaque sequence data descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL
to seqDataDesc
when the sequence data descriptor object cannot be allocated.
Use the cudnnSetSeqDataDescriptor()
function to configure the sequence data descriptor and cudnnDestroySeqDataDescriptor()
to destroy it and release the allocated memory.
Parameters
-
seqDataDesc
- Output. Pointer where the address to the newly created sequence data descriptor should be written.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor object was created successfully.
-
CUDNN_STATUS_BAD_PARAM
-
An invalid input argument was encountered (
seqDataDesc=NULL
). -
CUDNN_STATUS_ALLOC_FAILED
- The memory allocation failed.
4.34. cudnnCreateSpatialTransformerDescriptor
cudnnStatus_t cudnnCreateSpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t *stDesc)
This function creates a generic spatial transformer descriptor object by allocating the memory needed to hold its opaque structure.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
4.35. cudnnCreateTensorDescriptor
cudnnStatus_t cudnnCreateTensorDescriptor(
cudnnTensorDescriptor_t *tensorDesc)
This function creates a generic tensor descriptor object by allocating the memory needed to hold its opaque structure. The data is initialized to all zeros.
Parameters
-
tensorDesc
-
Input. Pointer to pointer where the address to the allocated tensor descriptor object should be stored.
Returns
-
CUDNN_STATUS_BAD_PARAM
-
Invalid input argument.
-
CUDNN_STATUS_ALLOC_FAILED
-
The resources could not be allocated.
-
CUDNN_STATUS_SUCCESS
-
The object was created successfully.
4.36. cudnnCreateTensorTransformDescriptor
cudnnStatus_t cudnnCreateTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t *transformDesc);
This function creates a Tensor transform descriptor object by allocating the memory needed to hold its opaque structure. The Tensor data is initialized to be all zero. Use the cudnnSetTensorTransformDescriptor function to initialize the descriptor created by this function.
Parameters
-
transformDesc
- Output. A pointer to an uninitialized tensor transform descriptor.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor object was created successfully.
-
CUDNN_STATUS_BAD_PARAM
-
The
transformDesc
isNULL
. -
CUDNN_STATUS_ALLOC_FAILED
- The memory allocation failed.
4.37. cudnnCTCLoss
cudnnStatus_t cudnnCTCLoss(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t probsDesc,
const void *probs,
const int *labels,
const int *labelLengths,
const int *inputLengths,
void *costs,
const cudnnTensorDescriptor_t gradientsDesc,
const void *gradients,
cudnnCTCLossAlgo_t algo,
const cudnnCTCLossDescriptor_t ctcLossDesc,
void *workspace,
size_t *workSpaceSizeInBytes)
This function returns the CTC costs and gradients, given the probabilities and labels.
This function has an inconsistent interface, for example, the probs
input is probability normalized by softmax, but the gradients
output is with respect to the unnormalized activation.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.
-
probsDesc
-
Input. Handle to the previously initialized probabilities tensor descriptor. For more information, see cudnnTensorDescriptor_t.
-
probs
-
Input. Pointer to a previously initialized probabilities tensor. These input probabilities are normalized by softmax.
-
labels
-
Input. Pointer to a previously initialized labels list.
-
labelLengths
-
Input. Pointer to a previously initialized lengths list, to walk the above labels list.
-
inputLengths
-
Input. Pointer to a previously initialized list of the lengths of the timing steps in each batch.
-
costs
-
Output. Pointer to the computed costs of CTC.
-
gradientsDesc
-
Input. Handle to a previously initialized gradients tensor descriptor.
-
gradients
-
Output. Pointer to the computed gradients of CTC. These computed gradient outputs are with respect to the unnormalized activation.
-
algo
-
Input. Enumerant that specifies the chosen CTC loss algorithm. For more information, see cudnnCTCLossAlgo_t.
-
ctcLossDesc
-
Input. Handle to the previously initialized CTC loss descriptor. For more information, see cudnnCTCLossDescriptor_t.
-
workspace
-
Input. Pointer to GPU memory of a workspace needed to able to execute the specified algorithm.
-
sizeInBytes
-
Input. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified
algo
.
Returns
-
CUDNN_STATUS_SUCCESS
-
The query was successful.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- The dimensions of
probsDesc
do not match the dimensions ofgradientsDesc
. - The
inputLengths
do not agree with the first dimension ofprobsDesc
. - The
workSpaceSizeInBytes
is not sufficient. - The
labelLengths
is greater than 256.
- The dimensions of
-
CUDNN_STATUS_NOT_SUPPORTED
-
A compute or data type other than
FLOAT
was chosen, or an unknown algorithm type was chosen. -
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.38. cudnnDeriveBNTensorDescriptor
cudnnStatus_t cudnnDeriveBNTensorDescriptor(
cudnnTensorDescriptor_t derivedBnDesc,
const cudnnTensorDescriptor_t xDesc,
cudnnBatchNormMode_t mode)
This function derives a secondary tensor descriptor for the batch normalization scale
, invVariance
, bnBias
, and bnScale
subtensors from the layer's x
data descriptor.
Use the tensor descriptor produced by this function as the bnScaleBiasMeanVarDesc
parameter for the cudnnBatchNormalizationForwardInference and cudnnBatchNormalizationForwardTraining functions, and as the bnScaleBiasDiffDesc
parameter in the cudnnBatchNormalizationBackward function.
The resulting dimensions will be:
- 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for
BATCHNORM_MODE_SPATIAL
- 1xCxHxW for 4D and 1xCxDxHxW for 5D for
BATCHNORM_MODE_PER_ACTIVATION
mode
For HALF
input data type the resulting tensor descriptor will have a FLOAT
type. For other data types, it will have the same type as the input data.
- Only 4D and 5D tensors are supported.
- The
derivedBnDesc
should be first created using cudnnCreateTensorDescriptor. xDesc
is the descriptor for the layer'sx
data and has to be setup with proper dimensions prior to calling this function.
Parameters
-
derivedBnDesc
-
Output. Handle to a previously created tensor descriptor.
-
xDesc
-
Input. Handle to a previously created and initialized layer's
x
data descriptor. -
mode
-
Input. Batch normalization layer mode of operation.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_BAD_PARAM
-
Invalid Batch Normalization mode.
4.39. cudnnDestroy
cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)
This function releases the resources used by the cuDNN handle. This function is usually the last call with a particular handle to the cuDNN handle. Because cudnnCreate
allocates some internal resources, the release of those resources by calling cudnnDestroy
will implicitly call cudaDeviceSynchronize
; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy
outside of performance-critical code paths.
Parameters
-
handle
-
Input. Pointer to the cuDNN handle to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
-
The cuDNN context destruction was successful.
-
CUDNN_STATUS_BAD_PARAM
-
Invalid (
NULL
) pointer supplied.
4.40. cudnnDestroyActivationDescriptor
cudnnStatus_t cudnnDestroyActivationDescriptor(
cudnnActivationDescriptor_t activationDesc)
This function destroys a previously created activation descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.41. cudnnDestroyAlgorithmDescriptor
cudnnStatus_t cudnnDestroyAlgorithmDescriptor(
cudnnActivationDescriptor_t algorithmDesc)
This function destroys a previously created algorithm descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.42. cudnnDestroyAlgorithmPerformance
cudnnStatus_t cudnnDestroyAlgorithmPerformance(
cudnnAlgorithmPerformance_t algoPerf)
This function destroys a previously created algorithm descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.43. cudnnDestroyAttnDescriptor
cudnnStatus_t cudnnDestroyAttnDescriptor(cudnnAttnDescriptor_t attnDesc);
This function destroys the attention descriptor object and releases its memory. The attnDesc
argument can be NULL
. Invoking cudnnDestroyAttnDescriptor()
with a NULL
argument is a no operation (NOP).
The cudnnDestroyAttnDescriptor()
function is not able to detect if the attnDesc
argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateAttnDescriptor()
function, or in the double deletion scenario of a valid address.
Parameters
-
attnDesc
- Input. Pointer to the attention descriptor object to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor was destroyed successfully.
4.44. cudnnDestroyConvolutionDescriptor
cudnnStatus_t cudnnDestroyConvolutionDescriptor(
cudnnConvolutionDescriptor_t convDesc)
This function destroys a previously created convolution descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor was destroyed successfully.
4.45. cudnnDestroyCTCLossDescriptor
cudnnStatus_t cudnnDestroyCTCLossDescriptor(
cudnnCTCLossDescriptor_t ctcLossDesc)
This function destroys a CTC loss function descriptor object.
Parameters
-
ctcLossDesc
-
Input. CTC loss function descriptor to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function returned successfully.
4.46. cudnnDestroyDropoutDescriptor
cudnnStatus_t cudnnDestroyDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc)
This function destroys a previously created dropout descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.47. cudnnDestroyFilterDescriptor
cudnnStatus_t cudnnDestroyFilterDescriptor(
cudnnFilterDescriptor_t filterDesc)
This function destroys a previously created tensor 4D descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.48. cudnnDestroyFusedOpsConstParamPack
cudnnStatus_t cudnnDestroyFusedOpsConstParamPack(
cudnnFusedOpsConstParamPack_t constPack);
This function destroys a previously-created cudnnFusedOpsConstParamPack_t structure.
Parameters
-
constPack
-
Input. The
cudnnFusedOpsConstParamPack_t
structure that should be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
- If the descriptor is destroyed successfully.
-
CUDNN_STATUS_INTERNAL_ERROR
- If the ops enum value is not supported or invalid.
4.49. cudnnDestroyFusedOpsPlan
cudnnStatus_t cudnnDestroyFusedOpsPlan(
cudnnFusedOpsPlan_t plan);
This function destroys the plan descriptor provided.
Parameters
-
plan
- Input. The descriptor that should be destroyed by this function.
Returns
-
CUDNN_STATUS_SUCCESS
-
If either the plan descriptor is
NULL
or the descriptor is successfully destroyed.
4.50. cudnnDestroyFusedOpsVariantParamPack
cudnnStatus_t cudnnDestroyFusedOpsVariantParamPack(
cudnnFusedOpsVariantParamPack_t varPack);
This function destroys a previously-created descriptor for cudnnFusedOps
constant parameters.
Parameters
-
varPack
- Input. The descriptor that should be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor is successfully destroyed.
4.51. cudnnDestroyLRNDescriptor
cudnnStatus_t cudnnDestroyLRNDescriptor(
cudnnLRNDescriptor_t lrnDesc)
This function destroys a previously created LRN descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.52. cudnnDestroyOpTensorDescriptor
cudnnStatus_t cudnnDestroyOpTensorDescriptor(
cudnnOpTensorDescriptor_t opTensorDesc)
This function deletes a tensor pointwise math descriptor object.
Parameters
-
opTensorDesc
-
Input. Pointer to the structure holding the description of the tensor pointwise math to be deleted.
Returns
-
CUDNN_STATUS_SUCCESS
-
The function returned successfully.
4.53. cudnnDestroyPersistentRNNPlan
cudnnStatus_t cudnnDestroyPersistentRNNPlan(
cudnnPersistentRNNPlan_t plan)
This function destroys a previously created persistent RNN plan object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.54. cudnnDestroyPoolingDescriptor
cudnnStatus_t cudnnDestroyPoolingDescriptor(
cudnnPoolingDescriptor_t poolingDesc)
This function destroys a previously created pooling descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.55. cudnnDestroyReduceTensorDescriptor
cudnnStatus_t cudnnDestroyReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t tensorDesc)
This function destroys a previously created reduce tensor descriptor object. When the input pointer is NULL
, this function performs no destroy operation.
Parameters
-
tensorDesc
-
Input. Pointer to the reduce tensor descriptor object to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.56. cudnnDestroyRNNDataDescriptor
cudnnStatus_t cudnnDestroyRNNDataDescriptor(
cudnnRNNDataDescriptor_t RNNDataDesc)
This function destroys a previously created RNN data descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The RNN data descriptor object was destroyed successfully.
4.57. cudnnDestroyRNNDescriptor
cudnnStatus_t cudnnDestroyRNNDescriptor(
cudnnRNNDescriptor_t rnnDesc)
This function destroys a previously created RNN descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.58. cudnnDestroySeqDataDescriptor
cudnnStatus_t cudnnDestroySeqDataDescriptor(cudnnSeqDataDescriptor_t seqDataDesc);
This function destroys the sequence data descriptor object and releases its memory. The seqDataDesc
argument can be NULL
. Invoking cudnnDestroySeqDataDescriptor()
with a NULL
argument is a no operation (NOP).
The cudnnDestroySeqDataDescriptor()
function is not able to detect if the seqDataDesc
argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateSeqDataDescriptor()
function, or in the double deletion scenario of a valid address.
Parameters
-
seqDataDesc
- Input. Pointer to the sequence data descriptor object to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor was destroyed successfully.
4.59. cudnnDestroySpatialTransformerDescriptor
cudnnStatus_t cudnnDestroySpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t stDesc)
This function destroys a previously created spatial transformer descriptor object.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.60. cudnnDestroyTensorDescriptor
cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)
This function destroys a previously created tensor descriptor object. When the input pointer is NULL
, this function performs no destroy operation.
Parameters
-
tensorDesc
-
Input. Pointer to the tensor descriptor object to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
-
The object was destroyed successfully.
4.61. cudnnDestroyTensorTransformDescriptor
cudnnStatus_t cudnnDestroyTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc);
Destroys a previously created tensor transform descriptor.
Parameters
-
transformDesc
- Input. The tensor transform descriptor to be destroyed.
Returns
-
CUDNN_STATUS_SUCCESS
- The descriptor was destroyed successfully.
4.62. cudnnDivisiveNormalizationBackward
cudnnStatus_t cudnnDivisiveNormalizationBackward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnDivNormMode_t mode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *means,
const void *dy,
void *temp,
void *temp2,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx,
void *dMeans)
This function performs the backward DivisiveNormalization
layer computation.
Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN library descriptor.
-
normDesc
-
Input. Handle to a previously initialized LRN parameter descriptor (this descriptor is used for both LRN and
DivisiveNormalization
layers). -
mode
-
Input.
DivisiveNormalization
layer mode of operation. Currently onlyCUDNN_DIVNORM_PRECOMPUTED_MEANS
is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user. -
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,x
,means
-
Input. Tensor descriptor and pointers in device memory for the layer's x and means data. Note that the
means
tensor is expected to be precomputed by the user. It can also contain any valid values (not required to be actualmeans
, and can be for instance a result of a convolution with a Gaussian kernel). -
dy
-
Input. Tensor pointer in device memory for the layer's
dy
cumulative loss differential data (error backpropagation). -
temp
,temp2
-
Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the backward pass. These tensors do not have to be preserved from forward to backward pass. Both use
xDesc
as a descriptor. -
dxDesc
-
Input. Tensor descriptor for
dx
anddMeans
. -
dx
,dMeans
-
Output. Tensor pointers (in device memory) for the layers resulting cumulative gradients
dx
anddMeans
(dLoss/dx
anddLoss/dMeans
). Both share the same descriptor.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the tensor pointers
x, dx, temp, tmep2, dy
isNULL
. - Number of any of the input or output tensor dimensions is not within the [4,5] range.
- Either alpha or beta pointer is
NULL
. - A mismatch in dimensions between
xDesc
anddxDesc
. - LRN descriptor parameters are outside of their valid ranges.
- Any of the tensor strides is negative.
- One of the tensor pointers
-
CUDNN_STATUS_UNSUPPORTED
-
The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a non-supported configuration.
4.63. cudnnDivisiveNormalizationForward
cudnnStatus_t cudnnDivisiveNormalizationForward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnDivNormMode_t mode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *means,
void *temp,
void *temp2,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
This function performs the forward spatial DivisiveNormalization
layer computation. It divides every value in a layer by the standard deviation of its spatial neighbors as described in What is the Best Multi-Stage Architecture for Object Recognition, Jarrett 2009, Local Contrast Normalization Layer section. Note that DivisiveNormalization
only implements the x/max(c, sigma_x)
portion of the computation, where sigma_x
is the variance over the spatial neighborhood of x
. The full LCN (Local Contrastive Normalization) computation can be implemented as a two-step process:
x_m = x-mean(x);
y = x_m/max(c, sigma(x_m));
The x-mean(x)
which is often referred to as "subtractive normalization" portion of the computation can be implemented using cuDNN average pooling layer followed by a call to addTensor
.
Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN library descriptor.
-
normDesc
-
Input. Handle to a previously initialized LRN parameter descriptor. This descriptor is used for both LRN and
DivisiveNormalization
layers. -
divNormMode
-
Input.
DivisiveNormalization
layer mode of operation. Currently onlyCUDNN_DIVNORM_PRECOMPUTED_MEANS
is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user. -
alpha
,beta
-
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, see Scaling Parameters in the cuDNN Developer Guide.
-
xDesc
,yDesc
-
Input. Tensor descriptor objects for the input and output tensors. Note that
xDesc
is shared betweenx
,means
,temp
, andtemp2
tensors. -
x
-
Input. Input tensor data pointer in device memory.
-
means
-
Input. Input means tensor data pointer in device memory. Note that this tensor can be
NULL
(in that case its values are assumed to be zero during the computation). This tensor also doesn't have to containmeans
, these can be any values, a frequently used variation is a result of convolution with a normalized positive kernel (such as Gaussian). -
temp
,temp2
-
Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the forward pass. These tensors do not have to be preserved as inputs from forward to the backward pass. Both use
xDesc
as their descriptor. -
y
-
Output. Pointer in device memory to a tensor for the result of the forward
DivisiveNormalization
computation.
Returns
-
CUDNN_STATUS_SUCCESS
-
The computation was performed successfully.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- One of the tensor pointers
x, y, temp, temp2
isNULL
. - Number of input tensor or output tensor dimensions is outside of [4,5] range.
- A mismatch in dimensions between any two of the input or output tensors.
- For in-place computation when pointers
x == y
, a mismatch in strides between the input data and output data tensors. - Alpha or beta pointer is
NULL
. - LRN descriptor parameters are outside of their valid ranges.
- Any of the tensor strides are negative.
- One of the tensor pointers
-
CUDNN_STATUS_UNSUPPORTED
-
The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a non-supported configuration.
4.64. cudnnDropoutBackward
cudnnStatus_t cudnnDropoutBackward(
cudnnHandle_t handle,
const cudnnDropoutDescriptor_t dropoutDesc,
const cudnnTensorDescriptor_t dydesc,
const void *dy,
const cudnnTensorDescriptor_t dxdesc,
void *dx,
void *reserveSpace,
size_t reserveSpaceSizeInBytes)
This function performs backward dropout operation over dy
returning results in dx
. If during forward dropout operation value from x
was propagated to y
then during backward operation value from dy
will be propagated to dx
, otherwise, dx
value will be set to 0
.
Better performance is obtained for fully packed tensors.
Parameters
-
handle
-
Input. Handle to a previously created cuDNN context.
-
dropoutDesc
-
Input. Previously created dropout descriptor object.
-
dyDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
dy
-
Input. Pointer to data of the tensor described by the
dyDesc
descriptor. -
dxDesc
-
Input. Handle to a previously initialized tensor descriptor.
-
dx
-
Output. Pointer to data of the tensor described by the
dxDesc
descriptor. -
reserveSpace
-
Input. Pointer to user-allocated GPU memory used by this function. It is expected that
reserveSpace
was populated during a call tocudnnDropoutForward
and has not been changed. -
reserveSpaceSizeInBytes
-
Input. Specifies the size in bytes of the provided memory for the reserve space
Returns
-
CUDNN_STATUS_SUCCESS
-
The call was successful.
-
CUDNN_STATUS_NOT_SUPPORTED
-
The function does not support the provided configuration.
-
CUDNN_STATUS_BAD_PARAM
-
At least one of the following conditions are met:
- The number of elements of input tensor and output tensors differ.
- The
datatype
of the input tensor and output tensors differs. - The strides of the input tensor and output tensors differ and in-place operation is used (i.e.,
x
andy
pointers are equal). - The provided
reserveSpaceSizeInBytes
is less then the value returned bycudnnDropoutGetReserveSpaceSize
. cudnnSetDropoutDescriptor
has not been called ondropoutDesc
with the non-NULL
states
argument.
-
CUDNN_STATUS_EXECUTION_FAILED
-
The function failed to launch on the GPU.
4.65. cudnnDropoutForward
cudnnStatus_t cudnnDropoutForward(
cudnnHandle_t handle,
const cudnnDropoutDescriptor_t dropoutDesc,
const cudnnTensorDescriptor_t xdesc,
const void *x,
const cudnnTensorDescriptor_t ydesc,
void *y,
void *reserveSpace,
size_t reserveSpaceSizeInBytes)
This function performs forward dropout operation ove