Abstract

This is the API documentation for the cuDNN library. This API Guide consists of the cuDNN datatype reference chapter which describes the types of enums and the cuDNN API reference chapter which describes all routines in the cuDNN library API. The cuDNN API is a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams.

For previously released cuDNN developer documentation, see cuDNN Archives.

1. Introduction

cuDNN offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams. The cuDNN Datatypes Reference API describes all the types and enums of the cuDNN library API. The cuDNN API Reference describes the API of all the routines in the cuDNN library.

2. cuDNN Datatypes Reference

This chapter describes all the types and enums of the cuDNN library API.

2.1. cudnnActivationDescriptor_t

cudnnActivationDescriptor_t is a pointer to an opaque structure holding the description of an activation operation. cudnnCreateActivationDescriptor() is used to create one instance, and cudnnSetActivationDescriptor() must be used to initialize this instance.

2.2. cudnnActivationMode_t

cudnnActivationMode_t is an enumerated type used to select the neuron activation function used in cudnnActivationForward(), cudnnActivationBackward(), and cudnnConvolutionBiasActivationForward().

Values

CUDNN_ACTIVATION_SIGMOID

Selects the sigmoid function.

CUDNN_ACTIVATION_RELU

Selects the rectified linear function.

CUDNN_ACTIVATION_TANH

Selects the hyperbolic tangent function.

CUDNN_ACTIVATION_CLIPPED_RELU

Selects the clipped rectified linear function.

CUDNN_ACTIVATION_ELU

Selects the exponential linear function.

CUDNN_ACTIVATION_IDENTITY

Selects the identity function, intended for bypassing the activation step in cudnnConvolutionBiasActivationForward(). (The cudnnConvolutionBiasActivationForward() function must use CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.) Does not work with cudnnActivationForward() or cudnnActivationBackward().

2.3. cudnnAttnDescriptor_t

cudnnAttnDescriptor_t is a pointer to an opaque structure holding parameters of the multi-head attention layer such as:
  • weight and bias tensor shapes (vector lengths before and after linear projections)
  • parameters that can be set in advance and do not change when invoking functions to evaluate forward responses and gradients (number of attention heads, softmax smoothing/sharpening coefficient)
  • other settings that are necessary to compute temporary buffer sizes.

Use the cudnnCreateAttnDescriptor() function to create an instance of the attention descriptor object and cudnnDestroyAttnDescriptor() to delete the previously created descriptor. Use the cudnnSetAttnDescriptor() function to configure the descriptor.

2.4. cudnnBatchNormMode_t

cudnnBatchNormMode_t is an enumerated type used to specify the mode of operation in cudnnBatchNormalizationForwardInference(), cudnnBatchNormalizationForwardTraining(), cudnnBatchNormalizationBackward() and cudnnDeriveBNTensorDescriptor() routines.

Values

CUDNN_BATCHNORM_PER_ACTIVATION

Normalization is performed per-activation. This mode is intended to be used after the non-convolutional network layers. In this mode, the tensor dimensions of bnBias and bnScale and the parameters used in the cudnnBatchNormalization* functions, are 1xCxHxW.

CUDNN_BATCHNORM_SPATIAL

Normalization is performed over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode the bnBias and bnScale tensor dimensions are 1xCx1x1.

CUDNN_BATCHNORM_SPATIAL_PERSISTENT

This mode is similar to CUDNN_BATCHNORM_SPATIAL but it can be faster for some tasks.

An optimized path may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF types, compute capability 6.0 or higher for the following two batch normalization API calls: cudnnBatchNormalizationForwardTraining(), and cudnnBatchNormalizationBackward(). In the case of cudnnBatchNormalizationBackward(), the savedMean and savedInvVariance arguments should not be NULL.

The rest of this section applies to NCHW mode only:

This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaN-s or Inf-s (infinity) in output buffers.

When Inf-s/NaN-s are present in the input data, the output in this mode is the same as from a pure floating-point implementation.

For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Inf-s/NaN-s while CUDNN_BATCHNORM_SPATIAL will produce finite results. The user can invoke cudnnQueryRuntimeError() to check if a numerical overflow occurred in this mode.

2.5. cudnnBatchNormOps_t

cudnnBatchNormOps_t is an enumerated type used to specify the mode of operation in cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize(), cudnnBatchNormalizationForwardTrainingEx(), cudnnGetBatchNormalizationBackwardExWorkspaceSize(), cudnnBatchNormalizationBackwardEx(), and cudnnGetBatchNormalizationTrainingExReserveSpaceSize() functions.

Values

CUDNN_BATCHNORM_OPS_BN

Only batch normalization is performed, per-activation.

CUDNN_BATCHNORM_OPS_BN_ACTIVATION

First, the batch normalization is performed, and then the activation is performed.

CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION

Performs the batch normalization, then element-wise addition, followed by the activation operation.

2.6. cudnnConvolutionBwdDataAlgo_t

cudnnConvolutionBwdDataAlgo_t is an enumerated type that exposes the different algorithms available to execute the backward data convolution operation.

Values

CUDNN_CONVOLUTION_BWD_DATA_ALGO_0

This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_1

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT

This algorithm uses a Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING

This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT for large size images. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD

This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED

This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results. The results are deterministic.

2.7. cudnnConvolutionBwdDataAlgoPerf_t

cudnnConvolutionBwdDataAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionBackwardDataAlgorithm() or heuristic results returned by cudnnGetConvolutionBackwardDataAlgorithm_v7().

Data Members

cudnnConvolutionBwdDataAlgo_t algo

The algorithm runs to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionBackwardData(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionBackwardData().

  • CUDNN_STATUS_ALLOC_FAILED if any error occurred during workspace allocation or if the provided workspace is insufficient.
  • CUDNN_STATUS_INTERNAL_ERROR if any error occurred during timing calculations or workspace deallocation.
  • Otherwise, this will be the return status of cudnnConvolutionBackwardData().
float time

The execution time of cudnnConvolutionBackwardData() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

2.8. cudnnConvolutionBwdDataPreference_t

cudnnConvolutionBwdDataPreference_t is an enumerated type used by cudnnGetConvolutionBackwardDataAlgorithm() to help the choice of the algorithm used for the backward data convolution.

Values

CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_BWD_DATA_PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() will return the fastest algorithm regardless of how much workspace is needed to execute it.

CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionBackwardDataAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

2.9. cudnnConvolutionBwdFilterAlgo_t

cudnnConvolutionBwdFilterAlgo_t is an enumerated type that exposes the different algorithms available to execute the backward filter convolution operation.

Values

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0

This algorithm expresses the convolution as a sum of matrix product without actually explicitly form the matrix that holds the input tensor data. The sum is done using atomic adds operation, thus the results are non-deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT

This algorithm uses the Fast-Fourier Transform approach to compute the convolution. A significant workspace is needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3

This algorithm is similar to CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 but uses some small workspace to precomputes some indices. The results are also non-deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_WINOGRAD_NONFUSED

This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results. The results are deterministic.

CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING

This algorithm uses the Fast-Fourier Transform approach to compute the convolution but splits the input tensor into tiles. A significant workspace may be needed to store intermediate results. The results are deterministic.

2.10. cudnnConvolutionBwdFilterAlgoPerf_t

cudnnConvolutionBwdFilterAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionBackwardFilterAlgorithm() or heuristic results returned by cudnnGetConvolutionBackwardFilterAlgorithm_v7().

Data Members

cudnnConvolutionBwdFilterAlgo_t algo

The algorithm runs to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionBackwardFilter(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionBackwardFilter().

  • CUDNN_STATUS_ALLOC_FAILED if any error occurred during workspace allocation or if the provided workspace is insufficient.
  • CUDNN_STATUS_INTERNAL_ERROR if any error occurred during timing calculations or workspace deallocation.
  • Otherwise, this will be the return status of cudnnConvolutionBackwardFilter().
float time

The execution time of cudnnConvolutionBackwardFilter() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

2.11. cudnnConvolutionBwdFilterPreference_t

cudnnConvolutionBwdFilterPreference_t is an enumerated type used by cudnnGetConvolutionBackwardFilterAlgorithm() to help the choice of the algorithm used for the backward filter convolution.

Values

CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() will return the fastest algorithm regardless of how much workspace is needed to execute it.

CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionBackwardFilterAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

2.12. cudnnConvolutionDescriptor_t

cudnnConvolutionDescriptor_t is a pointer to an opaque structure holding the description of a convolution operation. cudnnCreateConvolutionDescriptor() is used to create one instance, and cudnnSetConvolutionNdDescriptor() or cudnnSetConvolution2dDescriptor() must be used to initialize this instance.

2.13. cudnnConvolutionFwdAlgo_t

cudnnConvolutionFwdAlgo_t is an enumerated type that exposes the different algorithms available to execute the forward convolution operation.

Values

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

This algorithm expresses the convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

This algorithm expresses convolution as a matrix product without actually explicitly form the matrix that holds the input tensor data, but still needs some memory workspace to precompute some indices in order to facilitate the implicit construction of the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_GEMM

This algorithm expresses the convolution as an explicit matrix product. A significant memory workspace is needed to store the matrix that holds the input tensor data.

CUDNN_CONVOLUTION_FWD_ALGO_DIRECT

This algorithm expresses the convolution as a direct convolution (for example, without implicitly or explicitly doing a matrix multiplication).

CUDNN_CONVOLUTION_FWD_ALGO_FFT

This algorithm uses the Fast-Fourier Transform approach to compute the convolution. A significant memory workspace is needed to store intermediate results.

CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING

This algorithm uses the Fast-Fourier Transform approach but splits the inputs into tiles. A significant memory workspace is needed to store intermediate results but less than CUDNN_CONVOLUTION_FWD_ALGO_FFT for large size images.

CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

This algorithm uses the Winograd Transform approach to compute the convolution. A reasonably sized workspace is needed to store intermediate results.

CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

This algorithm uses the Winograd Transform approach to compute the convolution. A significant workspace may be needed to store intermediate results.

2.14. cudnnConvolutionFwdAlgoPerf_t

cudnnConvolutionFwdAlgoPerf_t is a structure containing performance results returned by cudnnFindConvolutionForwardAlgorithm() or heuristic results returned by cudnnGetConvolutionForwardAlgorithm_v7().

Data Members

cudnnConvolutionFwdAlgo_t algo

The algorithm runs to obtain the associated performance metrics.

cudnnStatus_t status

If any error occurs during the workspace allocation or timing of cudnnConvolutionForward(), this status will represent that error. Otherwise, this status will be the return status of cudnnConvolutionForward().

  • CUDNN_STATUS_ALLOC_FAILED if any error occurred during workspace allocation or if the provided workspace is insufficient.
  • CUDNN_STATUS_INTERNAL_ERROR if any error occurred during timing calculations or workspace deallocation.
  • Otherwise, this will be the return status of cudnnConvolutionForward().
float time

The execution time of cudnnConvolutionForward() (in milliseconds).

size_t memory

The workspace size (in bytes).

cudnnDeterminism_t determinism

The determinism of the algorithm.

cudnnMathType_t mathType

The math type provided to the algorithm.

int reserved[3]

Reserved space for future properties.

2.15. cudnnConvolutionFwdPreference_t

cudnnConvolutionFwdPreference_t is an enumerated type used by cudnnGetConvolutionForwardAlgorithm() to help the choice of the algorithm used for the forward convolution.

Values

CUDNN_CONVOLUTION_FWD_NO_WORKSPACE

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() is guaranteed to return an algorithm that does not require any extra workspace to be provided by the user.

CUDNN_CONVOLUTION_FWD_PREFER_FASTEST

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() will return the fastest algorithm regardless of how much workspace is needed to execute it.

CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT

In this configuration, the routine cudnnGetConvolutionForwardAlgorithm() will return the fastest algorithm that fits within the memory limit that the user provided.

2.16. cudnnConvolutionMode_t

cudnnConvolutionMode_t is an enumerated type used by cudnnSetConvolution2dDescriptor() to configure a convolution descriptor. The filter used for the convolution can be applied in two different ways, corresponding mathematically to a convolution or to a cross-correlation. (A cross-correlation is equivalent to a convolution with its filter rotated by 180 degrees.)

Values

CUDNN_CONVOLUTION

In this mode, a convolution operation will be done when applying the filter to the images.

CUDNN_CROSS_CORRELATION

In this mode, a cross-correlation operation will be done when applying the filter to the images.

2.17. cudnnCTCLossAlgo_t

cudnnCTCLossAlgo_t is an enumerated type that exposes the different algorithms available to execute the CTC loss operation.

Values

CUDNN_CTC_LOSS_ALGO_DETERMINISTIC

Results are guaranteed to be reproducible

CUDNN_CTC_LOSS_ALGO_NON_DETERMINISTIC

Results are not guaranteed to be reproducible

2.18. cudnnCTCLossDescriptor_t

cudnnCTCLossDescriptor_t is a pointer to an opaque structure holding the description of a CTC loss operation. cudnnCreateCTCLossDescriptor() is used to create one instance, cudnnSetCTCLossDescriptor() is used to initialize this instance, and cudnnDestroyCTCLossDescriptor() is used to destroy this instance.

2.19. cudnnDataType_t

cudnnDataType_t is an enumerated type indicating the data type to which a tensor descriptor or filter descriptor refers.

Values

CUDNN_DATA_FLOAT

The data is a 32-bit single-precision floating-point (float).

CUDNN_DATA_DOUBLE

The data is a 64-bit double-precision floating-point (double).

CUDNN_DATA_HALF

The data is a 16-bit floating-point.

CUDNN_DATA_INT8

The data is an 8-bit signed integer.

CUDNN_DATA_UINT8

The data is an 8-bit unsigned integer.

CUDNN_DATA_INT32

The data is a 32-bit signed integer.

CUDNN_DATA_INT8x4

The data is 32-bit elements each composed of 4 8-bit signed integers. This data type is only supported with tensor format CUDNN_TENSOR_NCHW_VECT_C.

CUDNN_DATA_INT8x32

The data is 32-element vectors, each element being an 8-bit signed integer. This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C. Moreover, this data type can only be used with algo 1, meaning, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM. For more information, see cudnnConvolutionFwdAlgo_t.

CUDNN_DATA_UINT8x4

The data is 32-bit elements each composed of 4 8-bit unsigned integers. This data type is only supported with tensor format CUDNN_TENSOR_NCHW_VECT_C.

2.20. cudnnDeterminism_t

cudnnDeterminism_t is an enumerated type used to indicate if the computed results are deterministic (reproducible). For more information, see Reproducibility (determinism).

Values

CUDNN_NON_DETERMINISTIC

Results are not guaranteed to be reproducible.

CUDNN_DETERMINISTIC

Results are guaranteed to be reproducible.

2.21. cudnnDirectionMode_t

cudnnDirectionMode_t is an enumerated type used to specify the recurrence pattern in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_UNIDIRECTIONAL
The network iterates recurrently from the first input to the last.
CUDNN_BIDIRECTIONAL
Each layer of the network iterates recurrently from the first input to the last and separately from the last input to the first. The outputs of the two are concatenated at each iteration giving the output of the layer.

2.22. cudnnDivNormMode_t

cudnnDivNormMode_t is an enumerated type used to specify the mode of operation in cudnnDivisiveNormalizationForward() and cudnnDivisiveNormalizationBackward().

Values

CUDNN_DIVNORM_PRECOMPUTED_MEANS
The means tensor data pointer is expected to contain means or other kernel convolution values precomputed by the user. The means pointer can also be NULL, in that case, it's considered to be filled with zeroes. This is equivalent to spatial LRN.
Note: In the backward pass, the means are treated as independent inputs and the gradient over means is computed independently. In this mode, to yield a net gradient over the entire LCN computational graph, the destDiffMeans result should be backpropagated through the user's means layer (which can be implemented using average pooling) and added to the destDiffData tensor produced by cudnnDivisiveNormalizationBackward().

2.23. cudnnDropoutDescriptor_t

cudnnDropoutDescriptor_t is a pointer to an opaque structure holding the description of a dropout operation. cudnnCreateDropoutDescriptor() is used to create one instance, cudnnSetDropoutDescriptor() is used to initialize this instance, cudnnDestroyDropoutDescriptor() is used to destroy this instance, cudnnGetDropoutDescriptor() is used to query fields of a previously initialized instance, cudnnRestoreDropoutDescriptor() is used to restore an instance to a previously saved off state.

2.24. cudnnErrQueryMode_t

cudnnErrQueryMode_t is an enumerated type passed to cudnnQueryRuntimeError() to select the remote kernel error query mode.

Values

CUDNN_ERRQUERY_RAWCODE

Read the error storage location regardless of the kernel completion status.

CUDNN_ERRQUERY_NONBLOCKING

Report if all tasks in the user stream of the cuDNN handle were completed. If that is the case, report the remote kernel error code.

CUDNN_ERRQUERY_BLOCKING

Wait for all tasks to complete in the user stream before reporting the remote kernel error code.

2.25. cudnnFilterDescriptor_t

cudnnFilterDescriptor_t is a pointer to an opaque structure holding the description of a filter dataset. cudnnCreateFilterDescriptor() is used to create one instance, and cudnnSetFilter4dDescriptor() or cudnnSetFilterNdDescriptor() must be used to initialize this instance.

2.26. cudnnFoldingDirection_t

cudnnFoldingDirection_t is an enumerated type used to select the folding direction. For more information, see cudnnTensorTransformDescriptor_t.

Data Member

CUDNN_TRANSFORM_FOLD = 0U

Selects folding.

CUDNN_TRANSFORM_UNFOLD = 1U

Selects unfolding.

2.27. cudnnFusedOps_t

The cudnnFusedOps_t type is an enumerated type to select a specific sequence of computations to perform in the fused operations.

Member Description
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS = 0 On a per-channel basis, performs these operations in this order: scale, add bias, activation, convolution, and generate batchnorm statistics.
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD = 1 On a per-channel basis, performs these operations in this order: scale, add bias, activation, convolution backward weights, and generate batchnorm statistics.
CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING = 2 Computes the equivalent scale and bias from ySum, ySqSum and learned scale, bias.

Optionally update running statistics and generate saved stats

CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE = 3 Computes the equivalent scale and bias from the learned running statistics and the learned scale, bias.
CUDNN_FUSED_CONV_SCALE_BIAS_ADD_ACTIVATION = 4 On a per-channel basis, performs these operations in this order: convolution, scale, add bias, element-wise addition with another tensor, and activation.
CUDNN_FUSED_SCALE_BIAS_ADD_ACTIVATION_GEN_BITMASK = 5 On a per-channel basis, performs these operations in this order: scale and bias on one tensor, scale, and bias on a second tensor, element-wise addition of these two tensors, and on the resulting tensor perform activation, and generate activation bit mask.
CUDNN_FUSED_DACTIVATION_FORK_DBATCHNORM = 6 On a per-channel basis, performs these operations in this order: backward activation, fork (meaning, write out gradient for the residual branch), and backward batch norm.

2.28. cudnnFusedOpsConstParamLabel_t

The cudnnFusedOpsConstParamLabel_t is an enumerated type for the selection of the type of the cudnnFusedOps descriptor. For more information, see cudnnSetFusedOpsConstParamPackAttribute().

typedef enum {
	CUDNN_PARAM_XDESC                          = 0,
	CUDNN_PARAM_XDATA_PLACEHOLDER              = 1,
	CUDNN_PARAM_BN_MODE                        = 2,
	CUDNN_PARAM_BN_EQSCALEBIAS_DESC            = 3,
	CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER         = 4,
	CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER          = 5,
	CUDNN_PARAM_ACTIVATION_DESC                = 6,
	CUDNN_PARAM_CONV_DESC                      = 7,
	CUDNN_PARAM_WDESC                          = 8,
	CUDNN_PARAM_WDATA_PLACEHOLDER              = 9,
	CUDNN_PARAM_DWDESC                         = 10,
	CUDNN_PARAM_DWDATA_PLACEHOLDER             = 11,
	CUDNN_PARAM_YDESC                          = 12,
	CUDNN_PARAM_YDATA_PLACEHOLDER              = 13,
	CUDNN_PARAM_DYDESC                         = 14,
	CUDNN_PARAM_DYDATA_PLACEHOLDER             = 15,
	CUDNN_PARAM_YSTATS_DESC                    = 16,
	CUDNN_PARAM_YSUM_PLACEHOLDER               = 17,
	CUDNN_PARAM_YSQSUM_PLACEHOLDER             = 18,
	CUDNN_PARAM_BN_SCALEBIAS_MEANVAR_DESC      = 19,
	CUDNN_PARAM_BN_SCALE_PLACEHOLDER           = 20,
	CUDNN_PARAM_BN_BIAS_PLACEHOLDER            = 21,
	CUDNN_PARAM_BN_SAVED_MEAN_PLACEHOLDER      = 22,
	CUDNN_PARAM_BN_SAVED_INVSTD_PLACEHOLDER    = 23,
	CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER    = 24,
	CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER     = 25,
	CUDNN_PARAM_ZDESC                          = 26,
	CUDNN_PARAM_ZDATA_PLACEHOLDER              = 27,
	CUDNN_PARAM_BN_Z_EQSCALEBIAS_DESC          = 28,
	CUDNN_PARAM_BN_Z_EQSCALE_PLACEHOLDER       = 29,
	CUDNN_PARAM_BN_Z_EQBIAS_PLACEHOLDER        = 30,
	CUDNN_PARAM_ACTIVATION_BITMASK_DESC        = 31,
	CUDNN_PARAM_ACTIVATION_BITMASK_PLACEHOLDER = 32,
	CUDNN_PARAM_DXDESC                         = 33,
	CUDNN_PARAM_DXDATA_PLACEHOLDER             = 34,
	CUDNN_PARAM_DZDESC                         = 35,
	CUDNN_PARAM_DZDATA_PLACEHOLDER             = 36,
	CUDNN_PARAM_BN_DSCALE_PLACEHOLDER          = 37,
	CUDNN_PARAM_BN_DBIAS_PLACEHOLDER           = 38,
	} cudnnFusedOpsConstParamLabel_t;
Short-form used Stands for
Setter cudnnSetFusedOpsConstParamPackAttribute()
Getter cudnnGetFusedOpsConstParamPackAttribute()
X_PointerPlaceHolder_t cudnnFusedOpsPointerPlaceHolder_t
X_ prefix in the Attribute column Stands for CUDNN_PARAM_ in the enumerator name
Table 1. CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS in cudnnFusedOp_t
Attribute Expected Descriptor Type Passed in, in the Setter Description Default Value After Creation
X_XDESC In the setter, the *param should be xDesc, a pointer to a previously initialized cudnnTensorDescriptor_t. Tensor descriptor describing the size, layout, and datatype of the x (input) tensor. NULL
X_XDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether xData pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_MODE In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. Describes the mode of operation for the scale, bias and the statistics.

As of cuDNN 7.6.0, only CUDNN_BATCHNORM_SPATIAL and CUDNN_BATCHNORM_SPATIAL_PERSISTENT are supported, meaning, scale, bias, and statistics are all per-channel.

CUDNN_BATCHNORM_PER_ACTIVATION
X_BN_EQSCALEBIAS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. Tensor descriptor describing the size, layout, and datatype of the batchNorm equivalent scale and bias tensors. The shapes must match the mode specified in CUDNN_PARAM_BN_MODE. If set to NULL, both scale and bias operation will become a NOP. NULL
X_BN_EQSCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the scale operation becomes a NOP.

CUDNN_PTR_NULL
X_BN_EQBIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the bias operation becomes a NOP.

CUDNN_PTR_NULL
X_ACTIVATION_DESC In the setter, the *param should be a pointer to a previously initialized cudnnActivationDescriptor_t*. Describes the activation operation.

As of 7.6.0, only activation mode of CUDNN_ACTIVATION_RELU and CUDNN_ACTIVATION_IDENTITY are supported. If set to NULL or if the activation mode is set to CUDNN_ACTIVATION_IDENTITY, then the activation in the op sequence becomes a NOP.

NULL
X_CONV_DESC In the setter, the *param should be a pointer to a previously initialized cudnnConvolutionDescriptor_t*. Describes the convolution operation. NULL
X_WDESC In the setter, the *param should be a pointer to a previously initialized cudnnFilterDescriptor_t*. Filter descriptor describing the size, layout and datatype of the w (filter) tensor. NULL
X_WDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether w (filter) tensor pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_YDESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the y (output) tensor. NULL
X_YDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether y (output) tensor pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_YSTATS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the sum of y and sum of y square tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE.

If set to NULL, the y statistics generation operation will be become a NOP.

NULL
X_YSUM_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether sum of y pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, the y statistics generation operation will be become a NOP.

CUDNN_PTR_NULL
X_YSQSUM_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether sum of y square pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, the y statistics generation operation will be become a NOP.

CUDNN_PTR_NULL
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack need to be NULL as well.
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and need to be at least element-aligned or 16 bytes-aligned, respectively.
As of cuDNN 7.6.0, if the conditions in Table 2 are met, then the fully fused fast path will be triggered. Otherwise, a slower partially fused path will be triggered.
Table 2. Conditions for Fully Fused Fast Path (Forward)
Parameter Condition
Device compute capability Need to be one of 7.0, 7.2 or 7.5.
CUDNN_PARAM_XDESC

CUDNN_PARAM_XDATA_PLACEHOLDER

Tensor is 4 dimensional

Datatype is CUDNN_DATA_HALF

Layout is NHWC fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

Tensor’s C dimension is a multiple of 8.

CUDNN_PARAM_BN_EQSCALEBIAS_DESC

CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER

CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER

If either one of scale and bias operation is not turned into a NOP:

Tensor is 4 dimensional with shape 1xCx1x1

Datatype is CUDNN_DATA_HALF

Layout is fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

CUDNN_PARAM_CONV_DESC

CUDNN_PARAM_WDESC

CUDNN_PARAM_WDATA_PLACEHOLDER

Convolution descriptor’s mode needs to be CUDNN_CROSS_CORRELATION.

Convolution descriptor’s dataType needs to be CUDNN_DATA_FLOAT.

Convolution descriptor’s dilationA is (1,1).

Convolution descriptor’s group count needs to be 1.

Convolution descriptor’s mathType needs to be CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Filter is in NHWC layout

Filter’s data type is CUDNN_DATA_HALF

Filter’s K dimension is a multiple of 32

Filter size RxS is either 1x1 or 3x3

If filter size RxS is 1x1, convolution descriptor’s padA needs to be (0,0) and filterStrideA needs to be (1,1).

Filter’s alignment is CUDNN_PTR_16B_ALIGNED

CUDNN_PARAM_YDESC

CUDNN_PARAM_YDATA_PLACEHOLDER

Tensor is 4 dimensional

Datatype is CUDNN_DATA_HALF

Layout is NHWC fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

CUDNN_PARAM_YSTATS_DESC

CUDNN_PARAM_YSUM_PLACEHOLDER

CUDNN_PARAM_YSQSUM_PLACEHOLDER

If the generate statistics operation is not turned into a NOP:

Tensor is 4 dimensional with shape 1xKx1x1

Datatype is CUDNN_DATA_FLOAT

Layout is fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

Table 3. CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD in cudnnFusedOp_t
Attribute Expected Descriptor Type Passed in, in the Setter Description Default Value After Creation
X_XDESC In the setter, the *param should be xDesc, a pointer to a previously initialized cudnnTensorDescriptor_t. Tensor descriptor describing the size, layout and datatype of the x (input) tensor NULL
X_XDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether xData pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_MODE In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. Describes the mode of operation for the scale, bias and the statistics.

As of cuDNN 7.6.0, only CUDNN_BATCHNORM_SPATIAL and CUDNN_BATCHNORM_SPATIAL_PERSISTENT are supported, meaning, scale, bias, and statistics are all per-channel.

CUDNN_BATCHNORM_PER_ACTIVATION
X_BN_EQSCALEBIAS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes must match the mode specified in CUDNN_PARAM_BN_MODE. If set to NULL, both scale and bias operation will become a NOP. NULL
X_BN_EQSCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the scale operation becomes a NOP.

CUDNN_PTR_NULL
X_BN_EQBIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the bias operation becomes a NOP.

CUDNN_PTR_NULL
X_ACTIVATION_DESC In the setter, the *param should be a pointer to a previously initialized cudnnActivationDescriptor_t*. Describes the activation operation.

As of 7.6.0, only activation mode of CUDNN_ACTIVATION_RELU and CUDNN_ACTIVATION_IDENTITY is supported. If set to NULL or if the activation mode is set to CUDNN_ACTIVATION_IDENTITY, then the activation in the op sequence becomes a NOP.

NULL
X_CONV_DESC In the setter, the *param should be a pointer to a previously initialized cudnnConvolutionDescriptor_t*. Describes the convolution operation. NULL
X_DWDESC In the setter, the *param should be a pointer to a previously initialized cudnnFilterDescriptor_t*. Filter descriptor describing the size, layout and datatype of the dw (filter gradient output) tensor. NULL
X_DWDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether dw (filter gradient output) tensor pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_DYDESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the dy (gradient input) tensor. NULL
X_DYDATA_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether dy (gradient input) tensor pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack needs to be NULL as well.
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and needs to be at least element-aligned or 16 bytes-aligned, respectively.
As of cuDNN 7.6.0, if the conditions in Table 4 are met, then the fully fused fast path will be triggered. Otherwise a slower partially fused path will be triggered.
Table 4. Conditions for Fully Fused Fast Path (Backward)
Parameter Condition
Device compute capability Needs to be one of 7.0, 7.2 or 7.5.
CUDNN_PARAM_XDESC

CUDNN_PARAM_XDATA_PLACEHOLDER

Tensor is 4 dimensional

Datatype is CUDNN_DATA_HALF

Layout is NHWC fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

Tensor’s C dimension is a multiple of 8.

CUDNN_PARAM_BN_EQSCALEBIAS_DESC

CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER

CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER

If either one of scale and bias operation is not turned into a NOP:

Tensor is 4 dimensional with shape 1xCx1x1

Datatype is CUDNN_DATA_HALF

Layout is fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

CUDNN_PARAM_CONV_DESC

CUDNN_PARAM_DWDESC

CUDNN_PARAM_DWDATA_PLACEHOLDER

Convolution descriptor’s mode needs to be CUDNN_CROSS_CORRELATION.

Convolution descriptor’s dataType needs to be CUDNN_DATA_FLOAT.

Convolution descriptor’s dilationA is (1,1)

Convolution descriptor’s group count needs to be 1.

Convolution descriptor’s mathType needs to be CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.

Filter gradient is in NHWC layout

Filter gradient’s data type is CUDNN_DATA_HALF

Filter gradient’s K dimension is a multiple of 32.

Filter gradient size RxS is either 1x1 or 3x3

If filter gradient size RxS is 1x1, convolution descriptor’s padA needs to be (0,0) and filterStrideA needs to be (1,1).

Filter gradient’s alignment is CUDNN_PTR_16B_ALIGNED

CUDNN_PARAM_DYDESC

CUDNN_PARAM_DYDATA_PLACEHOLDER

Tensor is 4 dimensional

Datatype is CUDNN_DATA_HALF

Layout is NHWC fully packed

Alignment is CUDNN_PTR_16B_ALIGNED

Table 5. CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING in cudnnFusedOp_t
Attribute Expected Descriptor Type Passed in, in the Setter Description Default Value After Creation
X_BN_MODE In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. Describes the mode of operation for the scale, bias and the statistics.

As of cuDNN 7.6.0, only CUDNN_BATCHNORM_SPATIAL and CUDNN_BATCHNORM_SPATIAL_PERSISTENT are supported, meaning, scale, bias and statistics are all per-channel.

CUDNN_BATCHNORM_PER_ACTIVATION
X_YSTATS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the sum of y and sum of y square tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE. NULL
X_YSUM_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether sum of y pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_YSQSUM_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether sum of y square pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_SCALEBIAS_MEANVAR_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. A common tensor descriptor describing the size, layout and datatype of the batchNorm trained scale, bias and statistics tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE (similar to the bnScaleBiasMeanVarDesc field in the cudnnBatchNormalization* API). NULL
X_BN_SCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm trained scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If the output of BN_EQSCALE is not needed, then this is not needed and may be NULL.

CUDNN_PTR_NULL
X_BN_BIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm trained bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If neither output of BN_EQSCALE or BN_EQBIAS is needed, then this is not needed and may be NULL.

CUDNN_PTR_NULL
X_BN_SAVED_MEAN_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm saved mean pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_SAVED_INVSTD_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm saved inverse standard deviation pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_RUNNING_MEAN_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm running mean pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_RUNNING_VAR_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm running variance pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_EQSCALEBIAS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE.

If neither output of BN_EQSCALE or BN_EQBIAS is needed, then this is not needed and may be NULL.

NULL
X_BN_EQSCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_EQBIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
Table 6. CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE in cudnnFusedOp_t
Attribute Expected Descriptor Type Passed in, in the Setter Description Default Value After Creation
X_BN_MODE In the setter, the *param should be a pointer to a previously initialized cudnnBatchNormMode_t*. Describes the mode of operation for the scale, bias and the statistics.

As of cuDNN 7.6.0, only CUDNN_BATCHNORM_SPATIAL and CUDNN_BATCHNORM_SPATIAL_PERSISTENT are supported, meaning, scale, bias and statistics are all per-channel.

CUDNN_BATCHNORM_PER_ACTIVATION
X_BN_SCALEBIAS_MEANVAR_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t. A common tensor descriptor describing the size, layout and datatype of the batchNorm trained scale, bias and statistics tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE (similar to the bnScaleBiasMeanVarDesc field in the cudnnBatchNormalization* API). NULL
X_BN_SCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm trained scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_BIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm trained bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_RUNNING_MEAN_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm running mean pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_RUNNING_VAR_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether the batchNorm running variance pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *. CUDNN_PTR_NULL
X_BN_EQSCALEBIAS_DESC In the setter, the *param should be a pointer to a previously initialized cudnnTensorDescriptor_t*. Tensor descriptor describing the size, layout and datatype of the batchNorm equivalent scale and bias tensors. The shapes need to match the mode specified in CUDNN_PARAM_BN_MODE. NULL
X_BN_EQSCALE_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent scale pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL
X_BN_EQBIAS_PLACEHOLDER In the setter, the *param should be a pointer to a previously initialized X_PointerPlaceHolder_t*. Describes whether batchnorm equivalent bias pointer in the VariantParamPack will be NULL, or if not, user promised pointer alignment *.

If set to CUDNN_PTR_NULL, then the computation for this output becomes a NOP.

CUDNN_PTR_NULL

2.29. cudnnFusedOpsConstParamPack_t

cudnnFusedOpsConstParamPack_t is a pointer to an opaque structure holding the description of the cudnnFusedOps constant parameters. Use the function cudnnCreateFusedOpsConstParamPack() ​to create one instance of this structure, and the function cudnnDestroyFusedOpsConstParamPack() to destroy a previously-created descriptor.

2.30. cudnnFusedOpsPlan_t

cudnnFusedOpsPlan_t is a pointer to an opaque structure holding the description of the cudnnFusedOpsPlan. This descriptor contains the plan information, including the problem type and size, which kernels should be run, and the internal workspace partition. Use the function cudnnCreateFusedOpsPlan() to create one instance of this structure, and the function cudnnDestroyFusedOpsPlan() to destroy a previously-created descriptor.

2.31. cudnnFusedOpsPointerPlaceHolder_t

cudnnFusedOpsPointerPlaceHolder_t is an enumerated type used to select the alignment type of the cudnnFusedOps descriptor pointer.
Member Description
CUDNN_PTR_NULL = 0 Indicates that the pointer to the tensor in the variantPack will be NULL.
CUDNN_PTR_ELEM_ALIGNED = 1 Indicates that the pointer to the tensor in the variantPack will not be NULL, and will have element alignment.
CUDNN_PTR_16B_ALIGNED = 2 Indicates that the pointer to the tensor in the variantPack will not be NULL, and will have 16 byte alignment.

2.32. cudnnFusedOpsVariantParamLabel_t

The cudnnFusedOpsVariantParamLabel_t is an enumerated type that is used to set the buffer pointers. These buffer pointers can be changed in each iteration.

typedef enum {
	CUDNN_PTR_XDATA                              = 0,
	CUDNN_PTR_BN_EQSCALE                         = 1,
	CUDNN_PTR_BN_EQBIAS                          = 2,
	CUDNN_PTR_WDATA                              = 3,
	CUDNN_PTR_DWDATA                             = 4,
	CUDNN_PTR_YDATA                              = 5,
	CUDNN_PTR_DYDATA                             = 6,
	CUDNN_PTR_YSUM                               = 7,
	CUDNN_PTR_YSQSUM                             = 8,
	CUDNN_PTR_WORKSPACE                          = 9,
	CUDNN_PTR_BN_SCALE                           = 10,
	CUDNN_PTR_BN_BIAS                            = 11,
	CUDNN_PTR_BN_SAVED_MEAN                      = 12,
	CUDNN_PTR_BN_SAVED_INVSTD                    = 13,
	CUDNN_PTR_BN_RUNNING_MEAN                    = 14,
	CUDNN_PTR_BN_RUNNING_VAR                     = 15,
	CUDNN_PTR_ZDATA                              = 16,
	CUDNN_PTR_BN_Z_EQSCALE                       = 17,
	CUDNN_PTR_BN_Z_EQBIAS                        = 18,
	CUDNN_PTR_ACTIVATION_BITMASK                 = 19,
	CUDNN_PTR_DXDATA                             = 20,
	CUDNN_PTR_DZDATA                             = 21,
	CUDNN_PTR_BN_DSCALE                          = 22,
	CUDNN_PTR_BN_DBIAS                           = 23,
	CUDNN_SCALAR_SIZE_T_WORKSPACE_SIZE_IN_BYTES  = 100,
	CUDNN_SCALAR_INT64_T_BN_ACCUMULATION_COUNT   = 101,
	CUDNN_SCALAR_DOUBLE_BN_EXP_AVG_FACTOR        = 102,
	CUDNN_SCALAR_DOUBLE_BN_EPSILON               = 103,
	} cudnnFusedOpsVariantParamLabel_t;
Table 7. Legend For Tables in This Section
Short-form used Stands for
Setter cudnnSetFusedOpsVariantParamPackAttribute()
Getter cudnnGetFusedOpsVariantParamPackAttribute()
X_ prefix in the Attribute key column Stands for CUDNN_PTR_ or CUDNN_SCALAR_ in the enumerator name.
Table 8. CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS in cudnnFusedOp_t
Attribute key Expected Descriptor Type Passed in, in the Setter I/O Type Description Default Value
X_XDATA void * input Pointer to x (input) tensor on device, need to agree with previously set CUDNN_PARAM_XDATA_PLACEHOLDER attribute *. NULL
X_BN_EQSCALE void * input Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute *. NULL
X_BN_EQBIAS void * input Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute *. NULL
X_WDATA void * input Pointer to w (filter) tensor on device, need to agree with previously set CUDNN_PARAM_WDATA_PLACEHOLDER attribute *. NULL
X_YDATA void * output Pointer to y (output) tensor on device, need to agree with previously set CUDNN_PARAM_YDATA_PLACEHOLDER attribute *. NULL
X_YSUM void * output Pointer to sum of y tensor on device, need to agree with previously set CUDNN_PARAM_YSUM_PLACEHOLDER attribute *. NULL
X_YSQSUM void * output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_YSQSUM_PLACEHOLDER attribute *. NULL
X_WORKSPACE void * input Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0. NULL
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES size_t * input Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount needs to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan. 0
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack needs to be NULL as well
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and needs to be at least element-aligned or 16 bytes-aligned, respectively.
Table 9. CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
For the attribute CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD in cudnnFusedOp_t
Attribute key Expected Descriptor Type Passed in, in the Setter I/O Type Description Default Value
X_XDATA void * input Pointer to x (input) tensor on device, need to agree with previously set CUDNN_PARAM_XDATA_PLACEHOLDER attribute *. NULL
X_BN_EQSCALE void * input Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute *. NULL
X_BN_EQBIAS void * input Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute *. NULL
X_DWDATA void * output Pointer to dw (filter gradient output) tensor on device, need to agree with previously set CUDNN_PARAM_WDATA_PLACEHOLDER attribute *. NULL
X_DYDATA void * input Pointer to dy (gradient input) tensor on device, need to agree with previously set CUDNN_PARAM_YDATA_PLACEHOLDER attribute *. NULL
X_WORKSPACE void * input Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0. NULL
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES size_t * input Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount needs to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan. 0
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack needs to be NULL as well.
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and needs to be at least element-aligned or 16 bytes-aligned, respectively.
Table 10. CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING in cudnnFusedOp_t
Attribute key Expected Descriptor Type Passed in, in the Setter I/O Type Description Default Value
X_YSUM void * input Pointer to sum of y tensor on device, need to agree with previously set CUDNN_PARAM_YSUM_PLACEHOLDER attribute *. NULL
X_YSQSUM void * input Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_YSQSUM_PLACEHOLDER attribute *. NULL
X_BN_SCALE void * input Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SCALE_PLACEHOLDER attribute *. NULL
X_BN_BIAS void * input Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_BIAS_PLACEHOLDER attribute *. NULL
X_BN_SAVED_MEAN void * output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SAVED_MEAN_PLACEHOLDER attribute *. NULL
X_BN_SAVED_INVSTD void * output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SAVED_INVSTD_PLACEHOLDER attribute *. NULL
X_BN_RUNNING_MEAN void * input/output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER attribute *. NULL
X_BN_RUNNING_VAR void * input/output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER attribute *. NULL
X_BN_EQSCALE void * output Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute *. NULL
X_BN_EQBIAS void * output Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute *. NULL
X_INT64_T_BN_ACCUMULATION_COUNT int64_t * input Pointer to a scalar value in int64_t on host memory.

This value should describe the number of tensor elements accumulated in the sum of y and sum of y square tensors.

For example, in the single GPU use case, if the mode is CUDNN_BATCHNORM_SPATIAL or CUDNN_BATCHNORM_SPATIAL_PERSISTENT, the value should be equal to N*H*W of the tensor from which the statistics are calculated.

In multi-GPU use case, if all-reduce has been performed on the sum of y and sum of y square tensors, this value should be the sum of the single GPU accumulation count on each of the GPUs.

0
X_DOUBLE_BN_EXP_AVG_FACTOR double * input Pointer to a scalar value in double on host memory.

Factor used in the moving average computation. See exponentialAverageFactor in cudnnBatchNormalization* APIs.

0.0
X_DOUBLE_BN_EPSILON double * input Pointer to a scalar value in double on host memory.

A conditioning constant used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h.

See exponentialAverageFactor in cudnnBatchNormalization* APIs.

0.0
X_WORKSPACE void * input Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0. NULL
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES size_t * input Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount need to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan. 0
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack need to be NULL as well.
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and needs to be at least element-aligned or 16 bytes-aligned, respectively.
Table 11. CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE
For the attribute CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE in cudnnFusedOp_t
Attribute key Expected Descriptor Type Passed in, in the Setter I/O Type Description Default Value
X_BN_SCALE void * input Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_SCALE_PLACEHOLDER attribute *. NULL
X_BN_BIAS void * input Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_BIAS_PLACEHOLDER attribute *. NULL
X_BN_RUNNING_MEAN void * input/output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_MEAN_PLACEHOLDER attribute *. NULL
X_BN_RUNNING_VAR void * input/output Pointer to sum of y square tensor on device, need to agree with previously set CUDNN_PARAM_BN_RUNNING_VAR_PLACEHOLDER attribute *. NULL
X_BN_EQSCALE void * output Pointer to batchnorm equivalent scale tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQSCALE_PLACEHOLDER attribute *. NULL
X_BN_EQBIAS void * output Pointer to batchnorm equivalent bias tensor on device, need to agree with previously set CUDNN_PARAM_BN_EQBIAS_PLACEHOLDER attribute *. NULL
X_DOUBLE_BN_EPSILON double * input Pointer to a scalar value in double on host memory.

A conditioning constant used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h.

See exponentialAverageFactor in cudnnBatchNormalization* APIs.

0.0
X_WORKSPACE void * input Pointer to user allocated workspace on device. Can be NULL if the workspace size requested is 0. NULL
X_SIZE_T_WORKSPACE_SIZE_IN_BYTES size_t * input Pointer to a size_t value in host memory describing the user allocated workspace size in bytes. The amount need to be equal or larger than the amount requested in cudnnMakeFusedOpsPlan. 0
Note:
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_NULL, then the device pointer in the VariantParamPack needs to be NULL as well.
  • If the corresponding pointer placeholder in ConstParamPack is set to CUDNN_PTR_ELEM_ALIGNED or CUDNN_PTR_16B_ALIGNED, then the device pointer in the VariantParamPack may not be NULL and needs to be at least element-aligned or 16 bytes-aligned, respectively.

2.33. cudnnFusedOpsVariantParamPack_t

cudnnFusedOpsVariantParamPack_t is a pointer to an opaque structure holding the description of the cudnnFusedOps variant parameters. Use the function cudnnCreateFusedOpsVariantParamPack() to create one instance of this structure, and the function cudnnDestroyFusedOpsVariantParamPack() to destroy a previously-created descriptor.

2.34. cudnnHandle_t

cudnnHandle_t is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using cudnnCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cudnnDestroy(). The context is associated with only one GPU device, the current device at the time of the call to cudnnCreate(). However, multiple contexts can be created on the same GPU device.

2.35. cudnnIndicesType_t

cudnnIndicesType_t is an enumerated type used to indicate the data type for the indices to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

CUDNN_32BIT_INDICES

Compute unsigned int indices.

CUDNN_64BIT_INDICES

Compute unsigned long indices.

CUDNN_16BIT_INDICES

Compute unsigned short indices.

CUDNN_8BIT_INDICES

Compute unsigned char indices.

2.36. cudnnLossNormalizationMode_t

cudnnLossNormalizationMode_t is an enumerated type that controls the input normalization mode for a loss function. This type can be used with cudnnSetCTCLossDescriptorEx().

Values

CUDNN_LOSS_NORMALIZATION_NONE

The input probs of cudnnCTCLoss() function is expected to be the normalized probability, and the output gradients is the gradient of loss with respect to the unnormalized probability.

CUDNN_LOSS_NORMALIZATION_SOFTMAX

The input probs of cudnnCTCLoss() function is expected to be the unnormalized activation from the previous layer, and the output gradients is the gradient with respect to the activation. Internally the probability is computed by softmax normalization.

2.37. cudnnLRNMode_t

cudnnLRNMode_t is an enumerated type used to specify the mode of operation in cudnnLRNCrossChannelForward() and cudnnLRNCrossChannelBackward().

Values

CUDNN_LRN_CROSS_CHANNEL_DIM1

LRN computation is performed across tensor's dimension dimA[1].

2.38. cudnnMathType_t

cudnnMathType_t is an enumerated type used to indicate if the use of Tensor Core operations is permitted a given library routine.

Values

CUDNN_DEFAULT_MATH

Tensor Core operations are not used.

CUDNN_TENSOR_OP_MATH

The use of Tensor Core operations is permitted.

CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION

Enables the use of FP32 tensors for both input and output.

2.39. cudnnMultiHeadAttnWeightKind_t

cudnnMultiHeadAttnWeightKind_t is an enumerated type that specifies a group of weights or biases in the cudnnGetMultiHeadAttnWeights() function.

Values

CUDNN_MH_ATTN_Q_WEIGHTS

Selects the input projection weights for queries.

CUDNN_MH_ATTN_K_WEIGHTS

Selects the input projection weights for keys.

CUDNN_MH_ATTN_V_WEIGHTS

Selects the input projection weights for values.

CUDNN_MH_ATTN_O_WEIGHTS

Selects the output projection weights.

CUDNN_MH_ATTN_Q_BIASES

Selects the input projection biases for queries.

CUDNN_MH_ATTN_K_BIASES

Selects the input projection biases for keys.

CUDNN_MH_ATTN_V_BIASES

Selects the input projection biases for values.

CUDNN_MH_ATTN_O_BIASES

Selects the output projection biases.

2.40. cudnnNanPropagation_t

cudnnNanPropagation_t is an enumerated type used to indicate if a given routine should propagate Nan numbers. This enumerated type is used as a field for the cudnnActivationDescriptor_t descriptor and cudnnPoolingDescriptor_t descriptor.

Values

CUDNN_NOT_PROPAGATE_NAN

Nan numbers are not propagated.

CUDNN_PROPAGATE_NAN

Nan numbers are propagated.

2.41. cudnnOpTensorDescriptor_t

cudnnOpTensorDescriptor_t is a pointer to an opaque structure holding the description of a Tensor Core operation, used as a parameter to cudnnOpTensor(). cudnnCreateOpTensorDescriptor() is used to create one instance, and cudnnSetOpTensorDescriptor() must be used to initialize this instance.

2.42. cudnnOpTensorOp_t

cudnnOpTensorOp_t is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnOpTensor() routine. This enumerated type is used as a field for the cudnnOpTensorDescriptor_t descriptor.

Values

CUDNN_OP_TENSOR_ADD

The operation to be performed is addition.

CUDNN_OP_TENSOR_MUL

The operation to be performed is multiplication.

CUDNN_OP_TENSOR_MIN

The operation to be performed is a minimum comparison.

CUDNN_OP_TENSOR_MAX

The operation to be performed is a maximum comparison.

CUDNN_OP_TENSOR_SQRT

The operation to be performed is square root, performed on only the A tensor.

CUDNN_OP_TENSOR_NOT

The operation to be performed is negation, performed on only the A tensor.

2.43. cudnnPersistentRNNPlan_t

cudnnPersistentRNNPlan_t is a pointer to an opaque structure holding a plan to execute a dynamic persistent RNN. cudnnCreatePersistentRNNPlan() is used to create and initialize one instance.

2.44. cudnnPoolingDescriptor_t

cudnnPoolingDescriptor_t is a pointer to an opaque structure holding the description of a pooling operation. cudnnCreatePoolingDescriptor() is used to create one instance, and cudnnSetPoolingNdDescriptor() or cudnnSetPooling2dDescriptor() must be used to initialize this instance.

2.45. cudnnPoolingMode_t

cudnnPoolingMode_t is an enumerated type passed to cudnnSetPooling2dDescriptor() to select the pooling method to be used by cudnnPoolingForward() and cudnnPoolingBackward().

Values

CUDNN_POOLING_MAX

The maximum value inside the pooling window is used.

CUDNN_POOLING_AVERAGE_COUNT_INCLUDE_PADDING

Values inside the pooling window are averaged. The number of elements used to calculate the average includes spatial locations falling in the padding region.

CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING

Values inside the pooling window are averaged. The number of elements used to calculate the average excludes spatial locations falling in the padding region.

CUDNN_POOLING_MAX_DETERMINISTIC

The maximum value inside the pooling window is used. The algorithm used is deterministic.

2.46. cudnnReduceTensorDescriptor_t

cudnnReduceTensorDescriptor_t is a pointer to an opaque structure holding the description of a tensor reduction operation, used as a parameter to cudnnReduceTensor(). cudnnCreateReduceTensorDescriptor() is used to create one instance, and cudnnSetReduceTensorDescriptor() must be used to initialize this instance.

2.47. cudnnReduceTensorIndices_t

cudnnReduceTensorIndices_t is an enumerated type used to indicate whether indices are to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

CUDNN_REDUCE_TENSOR_NO_INDICES

Do not compute indices.

CUDNN_REDUCE_TENSOR_FLATTENED_INDICES

Compute indices. The resulting indices are relative, and flattened.

2.48. cudnnReduceTensorOp_t

cudnnReduceTensorOp_t is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

Values

CUDNN_REDUCE_TENSOR_ADD

The operation to be performed is addition.

CUDNN_REDUCE_TENSOR_MUL

The operation to be performed is multiplication.

CUDNN_REDUCE_TENSOR_MIN

The operation to be performed is a minimum comparison.

CUDNN_REDUCE_TENSOR_MAX

The operation to be performed is a maximum comparison.

CUDNN_REDUCE_TENSOR_AMAX

The operation to be performed is a maximum comparison of absolute values.

CUDNN_REDUCE_TENSOR_AVG

The operation to be performed is averaging.

CUDNN_REDUCE_TENSOR_NORM1

The operation to be performed is addition of absolute values.

CUDNN_REDUCE_TENSOR_NORM2

The operation to be performed is a square root of sum of squares.

CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS

The operation to be performed is multiplication, not including elements of value zero.

2.49. cudnnReorderType_t

typedef enum {
	CUDNN_DEFAULT_REORDER = 0,
	CUDNN_NO_REORDER      = 1,
	} cudnnReorderType_t;		

cudnnReorderType_t is an enumerated type to set the convolution reordering type. The reordering type can be set by cudnnSetConvolutionReorderType() and its status can be read by cudnnGetConvolutionReorderType().

2.50. cudnnRNNAlgo_t

cudnnRNNAlgo_t is an enumerated type used to specify the algorithm used in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_RNN_ALGO_STANDARD
Each RNN layer is executed as a sequence of operations. This algorithm is expected to have robust performance across a wide range of network parameters.
CUDNN_RNN_ALGO_PERSIST_STATIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch).

CUDNN_RNN_ALGO_PERSIST_STATIC is only supported on devices with compute capability >= 6.0.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch). When using CUDNN_RNN_ALGO_PERSIST_DYNAMIC persistent kernels are prepared at runtime and are able to optimize using the specific parameters of the network and active GPU. As such, when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC a one-time plan preparation stage must be executed. These plans can then be reused in repeated calls with the same model parameters.

The limits on the maximum number of hidden units supported when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC are significantly higher than the limits when using CUDNN_RNN_ALGO_PERSIST_STATIC, however throughput is likely to significantly reduce when exceeding the maximums supported by CUDNN_RNN_ALGO_PERSIST_STATIC. In this regime, this method will still outperform CUDNN_RNN_ALGO_STANDARD for some cases.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC is only supported on devices with compute capability >= 6.0 on Linux machines.

2.51. cudnnRNNBiasMode_t

cudnnRNNBiasMode_t is an enumerated type used to specify the number of bias vectors for RNN functions. See the description of the cudnnRNNMode_t enumerated type for the equations for each cell type based on the bias mode.

Values

CUDNN_RNN_NO_BIAS

Applies RNN cell formulas that do not use biases.

CUDNN_RNN_SINGLE_INP_BIAS

Applies RNN cell formulas that use one input bias vector in the input GEMM.

CUDNN_RNN_DOUBLE_BIAS

Applies RNN cell formulas that use two bias vectors.

CUDNN_RNN_SINGLE_REC_BIAS

Applies RNN cell formulas that use one recurrent bias vector in the recurrent GEMM.

2.52. cudnnRNNClipMode_t

cudnnRNNClipMode_t is an enumerated type used to select the LSTM cell clipping mode. It is used with cudnnRNNSetClip(), cudnnRNNGetClip() functions, and internally within LSTM cells.

Values

CUDNN_RNN_CLIP_NONE

Disables LSTM cell clipping.

CUDNN_RNN_CLIP_MINMAX

Enables LSTM cell clipping.

2.53. cudnnRNNDataDescriptor_t

cudnnRNNDataDescriptor_t is a pointer to an opaque structure holding the description of an RNN data set. The function cudnnCreateRNNDataDescriptor() is used to create one instance, and cudnnSetRNNDataDescriptor() must be used to initialize this instance.

2.54. cudnnRNNDataLayout_t

cudnnRNNDataLayout_t is an enumerated type used to select the RNN data layout. It is used used in the API calls cudnnGetRNNDataDescriptor() and cudnnSetRNNDataDescriptor().

Values

CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED

Data layout is padded, with outer stride from one time-step to the next.

CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED

The sequence length is sorted and packed as in basic RNN API.

CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED

Data layout is padded, with outer stride from one batch to the next.

2.55. cudnnRNNDescriptor_t

cudnnRNNDescriptor_t is a pointer to an opaque structure holding the description of an RNN operation. cudnnCreateRNNDescriptor() is used to create one instance, and cudnnSetRNNDescriptor() must be used to initialize this instance.

2.56. cudnnRNNInputMode_t

cudnnRNNInputMode_t is an enumerated type used to specify the behavior of the first layer in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

Values

CUDNN_LINEAR_INPUT

A biased matrix multiplication is performed at the input of the first recurrent layer.

CUDNN_SKIP_INPUT

No operation is performed at the input of the first recurrent layer. If CUDNN_SKIP_INPUT is used the leading dimension of the input tensor must be equal to the hidden state size of the network.

2.57. cudnnRNNMode_t

cudnnRNNMode_t is an enumerated type used to specify the type of network used in the cudnnRNNForwardInference, cudnnRNNForwardTraining, cudnnRNNBackwardData and cudnnRNNBackwardWeights routines.

Values

CUDNN_RNN_RELU

A single-gate recurrent neural network with a ReLU activation function.

In the forward pass, the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt, given the matrices W, R and the bias vectors, where ReLU(x) = max(x, 0).

If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_DOUBLE_BIAS (default mode), then the following equation with biases bW and bR applies:
ht = ReLU(Wixt + Riht-1 + bWi + bRi)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_SINGLE_INP_BIAS or CUDNN_RNN_SINGLE_REC_BIAS, then the following equation with bias b applies:
ht = ReLU(Wixt + Riht-1 + bi)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_NO_BIAS, then the following equation applies:
ht = ReLU(Wixt + Riht-1)
CUDNN_RNN_TANH

A single-gate recurrent neural network with a tanh activation function.

In the forward pass, the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt, given the matrices W, R and the bias vectors, and where tanh is the hyperbolic tangent function.

If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_DOUBLE_BIAS (default mode), then the following equation with biasesbW and bR applies:
ht = tanh(Wixt + Riht-1 + bWi + bRi)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_SINGLE_INP_BIAS or CUDNN_RNN_SINGLE_REC_BIAS, then the following equation with bias b applies:
ht = tanh(Wixt + Riht-1 + bi)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_NO_BIAS, then the following equation applies:
ht = tanh(Wixt + Riht-1)
CUDNN_LSTM

A four-gate Long Short-Term Memory network with no peephole connections.

In the forward pass, the output ht and cell output ct for a given iteration can be computed from the recurrent input ht-1, the cell input ct-1 and the previous layer input xt, given the matrices W, R and the bias vectors.

In addition, the following applies:
  • σ is the sigmoid operator such that: σ(x) = 1 / (1 + e-x),
  • represents a point-wise multiplication,
  • tanh is the hyperbolic tangent function, and
  • it, ft, ot, c't represent the input, forget, output and new gates respectively.
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_DOUBLE_BIAS (default mode), then the following equations with biases bW and bR apply:
it = σ(Wixt + Riht-1 + bWi + bRi)
ft = σ(Wfxt + Rfht-1 + bWf + bRf)
ot = σ(Woxt + Roht-1 + bWo + bRo)
c't = tanh(Wcxt + Rcht-1 + bWc + bRc)
ct = ft ◦ ct-1 + it ◦ c't
ht = ot ◦ tanh(ct)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_SINGLE_INP_BIAS or CUDNN_RNN_SINGLE_REC_BIAS, then the following equations with bias b apply:
it = σ(Wixt + Riht-1 + bi)
ft = σ(Wfxt + Rfht-1 + bf)
ot = σ(Woxt + Roht-1 + bo)
c't = tanh(Wcxt + Rcht-1 + bc)
ct = ft ◦ ct-1 + it ◦ c't
ht = ot ◦ tanh(ct)
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_NO_BIAS, then the following equations apply:
it = σ(Wixt + Riht-1)
ft = σ(Wfxt + Rfht-1)
ot = σ(Woxt + Roht-1)
c't = tanh(Wcxt + Rcht-1)
ct = ft ◦ ct-1 + it ◦ c't
ht = ot◦tanh(ct)
CUDNN_GRU

A three-gate network consisting of Gated Recurrent Units.

In the forward pass, the output ht for a given iteration can be computed from the recurrent input ht-1 and the previous layer input xt given matrices W, R and the bias vectors.

In addition, the following applies:
  • σ is the sigmoid operator such that: σ(x) = 1 / (1 + e-x),
  • represents a point-wise multiplication,
  • tanh is the hyperbolic tangent function, and
  • it, rt, h't represent the input, reset, new gates respectively.
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_DOUBLE_BIAS (default mode), then the following equations with biases bW and bR apply:
it = σ(Wixt + Riht-1 + bWi + bRu)
rt = σ(Wrxt + Rrht-1 + bWr + bRr)
h't = tanh(Whxt + rt◦(Rhht-1 + bRh) + bWh)
ht = (1 - it) ◦ h't + it ◦ ht-1
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_SINGLE_INP_BIAS, then the following equations with bias b apply:
it = σ(Wixt + Riht-1 + bi)
rt = σ(Wrxt + Rrht-1 + br)
h't = tanh(Whxt + rt ◦ (Rhht-1) + bWh)
ht = (1 - it) ◦ h't + it ◦ ht-1
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_SINGLE_REC_BIAS, then the following equations with bias b apply:
it = σ(Wixt + Riht-1 + bi)
rt = σ(Wrxt + Rrht-1 + br)
h't = tanh(Whxt + rt ◦ (Rhht-1 + bRh))
ht = (1 - it) ◦ h't + it ◦ ht-1
If cudnnRNNBiasMode_t biasMode in rnnDesc is CUDNN_RNN_NO_BIAS, then the following equations apply:
it = σ(Wixt + Riht-1)
rt = σ(Wrxt + Rrht-1)
h't = tanh(Whxt + rt ◦ (Rhht-1))
ht = (1 - it) ◦ h't + it ◦ ht-1

2.58. cudnnRNNPaddingMode_t

cudnnRNNPaddingMode_t is an enumerated type used to enable or disable the padded input/output.

Values

CUDNN_RNN_PADDED_IO_DISABLED

Disables the padded input/output.

CUDNN_RNN_PADDED_IO_ENABLED

Enables the padded input/output.

2.59. cudnnSamplerType_t

cudnnSamplerType_t is an enumerated type passed to cudnnSetSpatialTransformerNdDescriptor() to select the sampler type to be used by cudnnSpatialTfSamplerForward() and cudnnSpatialTfSamplerBackward().

Values

CUDNN_SAMPLER_BILINEAR
Selects the bilinear sampler.

2.60. cudnnSeqDataAxis_t

cudnnSeqDataAxis_t is an enumerated type that indexes active dimensions in the dimA[] argument that is passed to the cudnnSetSeqDataDescriptor() function to configure the sequence data descriptor of type cudnnSeqDataDescriptor_t.

cudnnSeqDataAxis_t constants are also used in the axis[] argument of the cudnnSetSeqDataDescriptor() call to define the layout of the sequence data buffer in memory.

See cudnnSetSeqDataDescriptor() for a detailed description on how to use the cudnnSeqDataAxis_t enumerated type.

The CUDNN_SEQDATA_DIM_COUNT macro defines the number of constants in the cudnnSeqDataAxis_t enumerated type. This value is currently set to 4.

Values

CUDNN_SEQDATA_TIME_DIM

Identifies the TIME (sequence length) dimension or specifies the TIME in the data layout.

CUDNN_SEQDATA_BATCH_DIM

Identifies the BATCH dimension or specifies the BATCH in the data layout.

CUDNN_SEQDATA_BEAM_DIM

Identifies the BEAM dimension or specifies the BEAM in the data layout.

CUDNN_SEQDATA_VECT_DIM

Identifies the VECT (vector) dimension or specifies the VECT in the data layout.

2.61. cudnnSeqDataDescriptor_t

cudnnSeqDataDescriptor_t is a pointer to an opaque structure holding parameters of the sequence data container or buffer. The sequence data container is used to store fixed size vectors defined by the VECT dimension. Vectors are arranged in additional three dimensions: TIME, BATCH and BEAM.

The TIME dimension is used to bundle vectors into sequences of vectors. The actual sequences can be shorter than the TIME dimension, therefore, additional information is needed about each sequence length and how unused (padding) vectors should be saved.

It is assumed that the sequence data container is fully packed. The TIME, BATCH and BEAM dimensions can be in any order when vectors are traversed in the ascending order of addresses. Six data layouts (permutation of TIME, BATCH and BEAM) are possible.

The cudnnSeqDataDescriptor_t object holds the following parameters:
  • data type used by vectors
  • TIME, BATCH, BEAM and VECT dimensions
  • data layout
  • the length of each sequence along the TIME dimension
  • an optional value to be copied to output padding vectors

Use the cudnnCreateSeqDataDescriptor() function to create one instance of the sequence data descriptor object and cudnnDestroySeqDataDescriptor() to delete a previously created descriptor. Use the cudnnSetSeqDataDescriptor() function to configure the descriptor.

This descriptor is used by multi-head attention API functions.

2.62. cudnnSoftmaxAlgorithm_t

cudnnSoftmaxAlgorithm_t is used to select an implementation of the softmax function used in cudnnSoftmaxForward() and cudnnSoftmaxBackward().

Values

CUDNN_SOFTMAX_FAST

This implementation applies the straightforward softmax operation.

CUDNN_SOFTMAX_ACCURATE

This implementation scales each point of the softmax input domain by its maximum value to avoid potential floating point overflows in the softmax evaluation.

CUDNN_SOFTMAX_LOG

This entry performs the log softmax operation, avoiding overflows by scaling each point in the input domain as in CUDNN_SOFTMAX_ACCURATE.

2.63. cudnnSoftmaxMode_t

cudnnSoftmaxMode_t is used to select over which data the cudnnSoftmaxForward() and cudnnSoftmaxBackward() are computing their results.

Values

CUDNN_SOFTMAX_MODE_INSTANCE

The softmax operation is computed per image (N) across the dimensions C,H,W.

CUDNN_SOFTMAX_MODE_CHANNEL

The softmax operation is computed per spatial location (H,W) per image (N) across the dimension C.

2.64. cudnnSpatialTransformerDescriptor_t

cudnnSpatialTransformerDescriptor_t is a pointer to an opaque structure holding the description of a spatial transformation operation. cudnnCreateSpatialTransformerDescriptor() is used to create one instance, cudnnSetSpatialTransformerNdDescriptor() is used to initialize this instance, and cudnnDestroySpatialTransformerDescriptor() is used to destroy this instance.

2.65. cudnnStatus_t

cudnnStatus_t is an enumerated type used for function status returns. All cuDNN library functions return their status, which can be one of the following values:

Values

CUDNN_STATUS_SUCCESS

The operation completed successfully.

CUDNN_STATUS_NOT_INITIALIZED

The cuDNN library was not initialized properly. This error is usually returned when a call to cudnnCreate() fails or when cudnnCreate() has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called by cudnnCreate() or by an error in the hardware setup.

CUDNN_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuDNN library. This is usually caused by an internal cudaMalloc() failure.

To correct: prior to the function call, deallocate previously allocated memory as much as possible.

CUDNN_STATUS_BAD_PARAM

An incorrect value or parameter was passed to the function.

To correct: ensure that all the parameters being passed have valid values.

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note that cuDNN only supports devices with compute capabilities greater than or equal to 3.0.

To correct: compile and run the application on a device with appropriate compute capability.

CUDNN_STATUS_MAPPING_ERROR

An access to GPU memory space failed, which is usually caused by a failure to bind a texture.

To correct: prior to the function call, unbind any previously bound textures.

Otherwise, this may indicate an internal error/bug in the library.

CUDNN_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is usually caused by a failure to launch some cuDNN kernel on the GPU, which can occur for multiple reasons.

To correct: check that the hardware, an appropriate version of the driver, and the cuDNN library are correctly installed.

Otherwise, this may indicate an internal error/bug in the library.

CUDNN_STATUS_INTERNAL_ERROR

An internal cuDNN operation failed.

CUDNN_STATUS_NOT_SUPPORTED

The functionality requested is not presently supported by cuDNN.

CUDNN_STATUS_LICENSE_ERROR

The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable NVIDIA_LICENSE_FILE is not set properly.

CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING

Runtime library required by RNN calls (libcuda.so or nvcuda.dll) cannot be found in predefined search paths.

CUDNN_STATUS_RUNTIME_IN_PROGRESS

Some tasks in the user stream are not completed.

CUDNN_STATUS_RUNTIME_FP_OVERFLOW

Numerical overflow occurred during the GPU kernel execution.

2.66. cudnnTensorDescriptor_t

cudnnCreateTensorDescriptor_t is a pointer to an opaque structure holding the description of a generic n-D dataset. cudnnCreateTensorDescriptor() is used to create one instance, and one of the routines cudnnSetTensorNdDescriptor(), cudnnSetTensor4dDescriptor() or cudnnSetTensor4dDescriptorEx() must be used to initialize this instance.

2.67. cudnnTensorFormat_t

cudnnTensorFormat_t is an enumerated type used by cudnnSetTensor4dDescriptor() to create a tensor with a pre-defined layout. For a detailed explanation of how these tensors are arranged in memory, see the Data Layout Formats section in the cuDNN Developer Guide.

Values

CUDNN_TENSOR_NCHW

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NHWC

This tensor format specifies that the data is laid out in the following order: batch size, rows, columns, feature maps. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, rows, columns, and feature maps; the feature maps are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NCHW_VECT_C

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. However, each element of the tensor is a vector of multiple feature maps. The length of the vector is carried by the data type of the tensor. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension. This format is only supported with tensor data types CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32, and CUDNN_DATA_UINT8x4.

The CUDNN_TENSOR_NCHW_VECT_C can also be interpreted in the following way: The NCHW INT8x32 format is really N x (C/32) x H x W x 32 (32 Cs for every W), just as the NCHW INT8x4 format is N x (C/4) x H x W x 4 (4 Cs for every W). Hence, the VECT_C name - each W is a vector (4 or 32) of Cs.

2.68. cudnnTensorTransformDescriptor_t

cudnnTensorTransformDescriptor_t is an opaque structure containing the description of the tensor transform. Use the cudnnCreateTensorTransformDescriptor() function to create an instance of this descriptor, and cudnnDestroyTensorTransformDescriptor() function to destroy a previously created instance.

2.69. cudnnWgradMode_t

cudnnWgradMode_t is an enumerated type that selects how buffers holding gradients of the loss function, computed with respect to trainable parameters, are updated. Currently, this type is used by the cudnnGetMultiHeadAttnWeights() function only.

Values

CUDNN_WGRAD_MODE_ADD

A weight gradient component corresponding to a new batch of inputs is added to previously evaluated weight gradients. Before using this mode, the buffer holding weight gradients should be initialized to zero. Alternatively, the first API call outputting to an uninitialized buffer should use the CUDNN_WGRAD_MODE_SET option.

CUDNN_WGRAD_MODE_SET
A weight gradient component, corresponding to a new batch of inputs, overwrites previously stored weight gradients in the output buffer.

3. cuDNN API Reference

This chapter describes the API of all the routines of the cuDNN library.

3.1. cudnnActivationBackward()

cudnnStatus_t cudnnActivationBackward(
    cudnnHandle_t                    handle,
    cudnnActivationDescriptor_t      activationDesc,
    const void                      *alpha,
    const cudnnTensorDescriptor_t    yDesc,
    const void                      *y,
    const cudnnTensorDescriptor_t    dyDesc,
    const void                      *dy,
    const cudnnTensorDescriptor_t    xDesc,
    const void                      *x,
    const void                      *beta,
    const cudnnTensorDescriptor_t    dxDesc,
    void                            *dx)
This routine computes the gradient of a neuron activation function.
Note:
  • In-place operation is allowed for this routine; meaning dy and dx pointers may be equal. However, this requires the corresponding tensor descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed).
  • All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of yDesc and xDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

activationDesc

Input. Activation descriptor. See cudnnActivationDescriptor_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

yDesc

Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.

y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

xDesc

Input. Handle to the previously initialized output tensor descriptor.

x

Input. Data pointer to GPU memory associated with the output tensor descriptor xDesc.

dxDesc

Input. Handle to the previously initialized output differential tensor descriptor.

dx

Output. Data pointer to GPU memory associated with the output tensor descriptor dxDesc.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • The strides nStride, cStride, hStride, wStride of the input differential tensor and output differential tensor differ and in-place operation is used.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

  • The dimensions n, c, h, w of the input tensor and output tensor differ.
  • The datatype of the input tensor and output tensor differs.
  • The strides nStride, cStride, hStride, wStride of the input tensor and the input differential tensor differ.
  • The strides nStride, cStride, hStride, wStride of the output tensor and the output differential tensor differ.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.2. cudnnActivationForward()

cudnnStatus_t cudnnActivationForward(
    cudnnHandle_t handle,
    cudnnActivationDescriptor_t     activationDesc,
    const void                     *alpha,
    const cudnnTensorDescriptor_t   xDesc,
    const void                     *x,
    const void                     *beta,
    const cudnnTensorDescriptor_t   yDesc,
    void                           *y)
This routine applies a specified neuron activation function element-wise over each input value.
Note:
  • In-place operation is allowed for this routine; meaning, xData and yData pointers may be equal. However, this requires xDesc and yDesc descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed).
  • All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of xDesc and yDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

activationDesc

Input. Activation descriptor. For more information, see cudnnActivationDescriptor_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • The parameter mode has an invalid enumerant value.
  • The dimensions n, c, h, w of the input tensor and output tensor differ.
  • The datatype of the input tensor and output tensor differs.
  • The strides nStride, cStride, hStride, wStride of the input tensor and output tensor differ and in-place operation is used (meaning, x and y pointers are equal).
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.3. cudnnAddTensor()

cudnnStatus_t cudnnAddTensor(
    cudnnHandle_t                     handle,
    const void                       *alpha,
    const cudnnTensorDescriptor_t     aDesc,
    const void                       *A,
    const void                       *beta,
    const cudnnTensorDescriptor_t     cDesc,
    void                             *C)

This function adds the scaled values of a bias tensor to another tensor. Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. In the latter case, the same value from the bias tensor for those dimensions will be used to blend into the C tensor.

Note: Up to dimension 5, all tensor formats are supported. Beyond those dimensions, this routine is not supported

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with the prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

aDesc

Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.

A

Input. Pointer to data of the tensor described by the aDesc descriptor.

cDesc

Input. Handle to a previously initialized tensor descriptor.

C

Input/Output. Pointer to data of the tensor described by the cDesc descriptor.

Returns

CUDNN_STATUS_SUCCESS

The function executed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

The dimensions of the bias tensor refer to an amount of data that is incompatible with the output tensor dimensions or the dataType of the two tensor descriptors are different.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.4. cudnnBatchNormalizationBackward()

cudnnStatus_t cudnnBatchNormalizationBackward(
      cudnnHandle_t                    handle,
      cudnnBatchNormMode_t             mode,
      const void                      *alphaDataDiff,
      const void                      *betaDataDiff,
      const void                      *alphaParamDiff,
      const void                      *betaParamDiff,
      const cudnnTensorDescriptor_t    xDesc,
      const void                      *x,
      const cudnnTensorDescriptor_t    dyDesc,
      const void                      *dy,
      const cudnnTensorDescriptor_t    dxDesc,
      void                            *dx,
      const cudnnTensorDescriptor_t    bnScaleBiasDiffDesc,
      const void                      *bnScale,
      void                            *resultBnScaleDiff,
      void                            *resultBnBiasDiff,
      double                           epsilon,
      const void                      *savedMean,
      const void                      *savedInvVariance)
This function performs the backward batch normalization layer computation. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015. .
Note:
  • Only 4D and 5D tensors are supported.
  • The epsilon value has to be the same during training, backpropagation, and inference.
  • Higher performance can be obtained when HW-packed tensors are used for all of x, dy, dx.

For more information, see cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.

*alphaDataDiff, *betaDataDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output dx with a prior value in the destination tensor as follows:
dstValue = alphaDataDiff[0]*resultValue + betaDataDiff[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

*alphaParamDiff, *betaParamDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs resultBnScaleDiff and resultBnBiasDiff with prior values in the destination tensor as follows:
dstValue = alphaParamDiff[0]*resultValue + betaParamDiff[0]*priorDstValue

For more information, see Scaling Parameters.

xDesc, dxDesc, dyDesc

Inputs. Handles to the previously initialized tensor descriptors.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x data.

*dy

Inputs. Data pointer to GPU memory associated with the tensor descriptor dyDesc, for the backpropagated differential dy input.

*dx

Inputs. Data pointer to GPU memory associated with the tensor descriptor dxDesc, for the resulting differential output with respect to x.

bnScaleBiasDiffDesc
Input. Shared tensor descriptor for the following five tensors: bnScale, resultBnScaleDiff, resultBnBiasDiff, savedMean, savedInvVariance. The dimensions for this tensor descriptor are dependent on normalization mode. For more information, see cudnnDeriveBNTensorDescriptor().
Note: The data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors.
*bnScale
Input. Pointer in the device memory for the batch normalization scale parameter (in the original paper the quantity scale is referred to as gamma).
Note: The bnBias parameter is not needed for this layer's computation.
resultBnScaleDiff, resultBnBiasDiff
Outputs. Pointers in device memory for the resulting scale and bias differentials computed by this routine. Note that these scale and bias gradients are weight gradients specific to this batch normalization operation, and by definition are not backpropagated.
epsilon

Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*savedMean, *savedInvVariance
Inputs. Optional cache parameters containing saved intermediate results that were computed during the forward pass. For this to work correctly, the layer's x and bnScale data have to remain unchanged until this backward function is called.
Note: Both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • Any of the pointers alpha, beta, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff is NULL.
  • The number of xDesc or yDesc or dxDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
  • bnScaleBiasDiffDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
  • Exactly one of savedMean, savedInvVariance pointers is NULL.
  • epsilon value is less than CUDNN_BN_MIN_EPSILON.
  • Dimensions or data types mismatch for any pair of xDesc, dyDesc, dxDesc.

3.5. cudnnBatchNormalizationBackwardEx()

cudnnStatus_t cudnnBatchNormalizationBackwardEx (
    cudnnHandle_t                       handle,
    cudnnBatchNormMode_t                mode,
    cudnnBatchNormOps_t                 bnOps,
    const void                          *alphaDataDiff,
    const void                          *betaDataDiff,
    const void                          *alphaParamDiff,
    const void                          *betaParamDiff,
    const cudnnTensorDescriptor_t       xDesc,
    const void                          *xData,
    const cudnnTensorDescriptor_t       yDesc,
    const void                          *yData,
    const cudnnTensorDescriptor_t       dyDesc,
    const void                          *dyData,
    const cudnnTensorDescriptor_t       dzDesc,
    void                                *dzData,
    const cudnnTensorDescriptor_t       dxDesc,
    void                                *dxData,
    const cudnnTensorDescriptor_t       dBnScaleBiasDesc,
    const void                          *bnScaleData,
    const void                          *bnBiasData,
    void                                *dBnScaleData,
    void                                *dBnBiasData,
    double                              epsilon,
    const void                          *savedMean,
    const void                          *savedInvVariance,
    const cudnnActivationDescriptor_t   activationDesc,
    void                                *workspace,
    size_t                              workSpaceSizeInBytes
    void                                *reserveSpace
    size_t                              reserveSpaceSizeInBytes);
This function is an extension of the cudnnBatchNormalizationBackward() for performing the backward batch normalization layer computation with a fast NHWC semi-persistent kernel. This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:

If workspace is NULL and workSpaceSizeInBytes of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationBackward.

This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.

This extended function can accept a *workspace pointer to the GPU workspace, and workSpaceSizeInBytes, the size of the workspace, from the user.

The bnOps input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.

Only 4D and 5D tensors are supported. The epsilon value has to be the same during the training, the backpropagation, and the inference.

When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.

bnOps
Input. Mode of operation. Currently CUDNN_BATCHNORM_OPS_BN_ACTIVATION and CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION are only supported in the NHWC layout. For more information, see cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
*alphaDataDiff, *betaDataDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output dx with a prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

*alphaParamDiff, *betaParamDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs dBnScaleData and dBnBiasData with prior values in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, *x, yDesc, *yData, dyDesc, *dyData

Inputs. Tensor descriptors and pointers in the device memory for the layer's x data, backpropagated gradient input dy, the original forward output y data. yDesc and yData are not needed if bnOps is set to CUDNN_BATCHNORM_OPS_BN, user may pass NULL. For more information, see cudnnTensorDescriptor_t.

dzDesc, *dzData, dxDesc, *dxData
Outputs. Tensor descriptors and pointers in the device memory for the computed gradient output dz, and dx. dzDesc and *dzData are not needed when bnOps is CUDNN_BATCHNORM_OPS_BN or CUDNN_BATCHNORM_OPS_BN_ACTIVATION, user may pass NULL. For more information, see cudnnTensorDescriptor_t.
dBnScaleBiasDesc

Input. Shared tensor descriptor for the following six tensors: bnScaleData, bnBiasData, dBnScaleData, dBnBiasData, savedMean, and savedInvVariance. For more information, see cudnnDeriveBNTensorDescriptor().

The dimensions for this tensor descriptor are dependent on normalization mode.
Note: Note: The data type of this tensor descriptor must be float for FP16 and FP32 input tensors and double for FP64 input tensors.
For more information, see cudnnTensorDescriptor_t.
*bnScaleData

Input. Pointer in the device memory for the batch normalization scale parameter (in the original paper the quantity scale is referred to as gamma).

*bnBiasData
Input. Pointers in the device memory for the batch normalization bias parameter (in the original paper bias is referred to as beta). This parameter is used only when activation should be performed.
*dBnScaleData, dBnBiasData
Inputs. Pointers in the device memory for the gradients of bnScaleData and bnBiasData, respectively.
epsilon

Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*savedMean, *savedInvVariance
Inputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's x and bnScaleData, bnBiasData data has to remain unchanged until this backward function is called. Note that both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
activationDesc
Input. Descriptor for the activation operation. When the bnOps input is set to either CUDNN_BATCHNORM_OPS_BN_ACTIVATION or CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION then this activation is used, otherwise user may pass NULL.
workspace
Input. Pointer to the GPU workspace. If workspace is NULL and workSpaceSizeInBytes of zero is passed in, then this API will function exactly like the non-extended function cudnnBatchNormalizationBackward().
workSpaceSizeInBytes
Input. The size of the workspace. It must be large enough to trigger the fast NHWC semi-persistent kernel by this function.
*reserveSpace
Input. Pointer to the GPU workspace for the reserveSpace.
reserveSpaceSizeInBytes
Input. The size of the reserveSpace. It must be equal or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • Any of the pointers alphaDataDiff, betaDataDiff, alphaParamDiff, betaParamDiff, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff is NULL.
  • The number of xDesc or yDesc or dxDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
  • dBnScaleBiasDesc dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
  • Exactly one of savedMean, savedInvVariance pointers is NULL.
  • epsilon value is less than CUDNN_BN_MIN_EPSILON.
  • Dimensions or data types mismatch for any pair of xDesc, dyDesc, dxDesc.

3.6. cudnnBatchNormalizationForwardInference()

 cudnnStatus_t cudnnBatchNormalizationForwardInference(
      cudnnHandle_t                    handle,
      cudnnBatchNormMode_t             mode,
      const void                      *alpha,
      const void                      *beta,
      const cudnnTensorDescriptor_t    xDesc,
      const void                      *x,
      const cudnnTensorDescriptor_t    yDesc,
      void                            *y,
      const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
      const void                      *bnScale,
      const void                      *bnBias,
      const void                      *estimatedMean,
      const void                      *estimatedVariance,
      double                           epsilon)
This function performs the forward batch normalization layer computation for the inference phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
Note:
  • Only 4D and 5D tensors are supported.
  • The input transformation performed by this function is defined as:
    y = beta*y + alpha *[bnBias + (bnScale * (x-estimatedMean)/sqrt(epsilon + estimatedVariance)]
  • The epsilon value has to be the same during training, backpropagation and inference.
  • For the training phase, use cudnnBatchNormalizationForwardTraining().
  • Higher performance can be obtained when HW-packed tensors are used for all of x and dx.
For more information, see cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.

alpha, beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Handles to the previously initialized tensor descriptors.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x input data.

*y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc, for the youtput of the batch normalization layer.

bnScaleBiasMeanVarDesc, bnScale, bnBias

Inputs. Tensor descriptors and pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma).

estimatedMean, estimatedVariance

Inputs. Mean and variance tensors (these have the same descriptor as the bias and scale). The resultRunningMean and resultRunningVariance, accumulated during the training phase from the cudnnBatchNormalizationForwardTraining() call, should be passed as inputs here.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the pointers alpha, beta, x, y, bnScale, bnBias, estimatedMean, estimatedInvVariance is NULL.
  • The number of xDesc or yDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported.)
  • bnScaleBiasMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
  • epsilon value is less than CUDNN_BN_MIN_EPSILON.
  • Dimensions or data types mismatch for xDesc, yDesc.

3.7. cudnnBatchNormalizationForwardTraining()

 cudnnStatus_t cudnnBatchNormalizationForwardTraining(
      cudnnHandle_t                    handle,
      cudnnBatchNormMode_t             mode,
      const void                      *alpha,
      const void                      *beta,
      const cudnnTensorDescriptor_t    xDesc,
      const void                      *x,
      const cudnnTensorDescriptor_t    yDesc,
      void                            *y,
      const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
      const void                      *bnScale,
      const void                      *bnBias,
      double                           exponentialAverageFactor,
      void                            *resultRunningMean,
      void                            *resultRunningVariance,
      double                           epsilon,
      void                            *resultSaveMean,
      void                            *resultSaveInvVariance)
This function performs the forward batch normalization layer computation for the training phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
Note:
  • Only 4D and 5D tensors are supported.
  • The epsilon value has to be the same during training, backpropagation, and inference.
  • For the inference phase, use cudnnBatchNormalizationForwardInference.
  • Higher performance can be obtained when HW-packed tensors are used for both x and y.
See cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

Parameters

handle

Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.

mode

Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.

alpha, beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Tensor descriptors and pointers in device memory for the layer's x and y data. For more information, see cudnnTensorDescriptor_t.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x input data.

*y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc, for the youtput of the batch normalization layer.

bnScaleBiasMeanVarDesc

Shared tensor descriptor desc for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor(). The dimensions for this tensor descriptor are dependent on the normalization mode.

bnScale, bnBias
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma). Note that bnBias parameter can replace the previous layer's bias parameter for improved efficiency.
exponentialAverageFactor
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
Use a factor=1/(1+n) at N-th call to the function to get Cumulative Moving Average (CMA) behavior such that:
CMA[n] = (x[1]+...+x[n])/n

This is proved below:

Writing
CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1)
= ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1)
= CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1)
= CMA[n]*(1-factor) + x(n+1)*factor
resultRunningMean, resultRunningVariance

Inputs/Outputs. Running mean and variance tensors (these have the same descriptor as the bias and scale). Both of these pointers can be NULL but only at the same time. The value stored in resultRunningVariance (or passed as an input in inference mode) is the sample variance and is the moving average of variance[x] where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are not NULL, the tensors should be initialized to some reasonable values or to 0.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

resultSaveMean, resultSaveInvVariance
Outputs. Optional cache to save intermediate results computed during the forward pass. These buffers can be used to speed up the backward pass when supplied to the cudnnBatchNormalizationBackward() function. The intermediate results stored in resultSaveMean and resultSaveInvVariance buffers should not be used directly by the user. Depending on the batch normalization mode, the results stored in resultSaveInvVariance may vary. For the cache to work correctly, the input layer data must remain unchanged until the backward function is called. Note that both parameters can be NULL but only at the same time. In such a case, intermediate statistics will not be saved, and cudnnBatchNormalizationBackward() will have to re-compute them. It is recommended to use this cache as the memory overhead is relatively small because these tensors have a much lower product of dimensions than the data tensors.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the pointers alpha, beta, x, y, bnScale, bnBias is NULL.
  • The number of xDesc or yDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
  • bnScaleBiasMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
  • Exactly one of resultSaveMean, resultSaveInvVariance pointers are NULL.
  • Exactly one of resultRunningMean, resultRunningInvVariance pointers are NULL.
  • epsilon value is less than CUDNN_BN_MIN_EPSILON.
  • Dimensions or data types mismatch for xDesc, yDesc

3.8. cudnnBatchNormalizationForwardTrainingEx()

 cudnnStatus_t cudnnBatchNormalizationForwardTrainingEx(
    cudnnHandle_t                       handle,
    cudnnBatchNormMode_t                mode,
    cudnnBatchNormOps_t                 bnOps,
    const void                          *alpha,
    const void                          *beta,
    const cudnnTensorDescriptor_t       xDesc,
    const void                          *xData,
    const cudnnTensorDescriptor_t       zDesc, 
    const void                          *zData, 
    const cudnnTensorDescriptor_t       yDesc,
    void                                *yData,
    const cudnnTensorDescriptor_t       bnScaleBiasMeanVarDesc,
    const void                          *bnScaleData,
    const void                          *bnBiasData, 
    double                              exponentialAverageFactor,
    void                                *resultRunningMeanData,
    void                                *resultRunningVarianceData,
    double                              epsilon,
    void                                *saveMean,
    void                                *saveInvVariance,
    const cudnnActivationDescriptor_t   activationDesc, 
    void                                *workspace,
    size_t                              workSpaceSizeInBytes
    void                                *reserveSpace
    size_t                              reserveSpaceSizeInBytes);

This function is an extension of the cudnnBatchNormalizationForwardTraining() for performing the forward batch normalization layer computation.

This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:

If workspace is NULL and workSpaceSizeInBytes of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationForwardTraining().

This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.

This extended function can accept a *workspace pointer to the GPU workspace, and workSpaceSizeInBytes, the size of the workspace, from the user.

The bnOps input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.

Only 4D and 5D tensors are supported. The epsilon value has to be the same during the training, the backpropagation, and the inference.

When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx.

Parameters

handle

Handle to a previously created cuDNN library descriptor. For more information, see cudnnHandle_t.

mode

Mode of operation (spatial or per-activation). For more information, see cudnnBatchNormMode_t.

bnOps
Input. Mode of operation for the fast NHWC kernel. See cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
*alpha, *beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, *xData, zDesc, *zData, yDesc, *yData

Tensor descriptors and pointers in device memory for the layer's input x and output y, and for the optional z tensor input for residual addition to the result of the batch normalization operation, prior to the activation. The optional zDes and *zData descriptors is only used when bnOps is CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION, otherwise user may pass NULL. When in use, z should have exactly the same dimension as x and the final output y. For more information, see cudnnTensorDescriptor_t.

bnScaleBiasMeanVarDesc

Shared tensor descriptor desc for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor(). The dimensions for this tensor descriptor are dependent on the normalization mode.

*bnScaleData, *bnBiasData
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper, bias is referred to as beta and scale as gamma). Note that bnBiasData parameter can replace the previous layer's bias parameter for improved efficiency.
exponentialAverageFactor
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
Use a factor=1/(1+n) at N-th call to the function to get Cumulative Moving Average (CMA) behavior such that:
CMA[n] = (x[1]+...+x[n])/n

This is proved below:

Writing
CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1)
= ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1)
= CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1)
= CMA[n]*(1-factor) + x(n+1)*factor
*resultRunningMeanData, *resultRunningVarianceData

Inputs/Outputs. Pointers to the running mean and running variance data. Both these pointers can be NULL but only at the same time. The value stored in resultRunningVarianceData (or passed as an input in inference mode) is the sample variance and is the moving average of variance[x] where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are not NULL, the tensors should be initialized to some reasonable values or to 0.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*saveMean, *saveInvVariance
Outputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's x and bnScaleData, bnBiasData data has to remain unchanged until this backward function is called. Note that both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
activationDesc
Input. The tensor descriptor for the activation operation. When the bnOps input is set to either CUDNN_BATCHNORM_OPS_BN_ACTIVATION or CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION then this activation is used, otherwise user may pass NULL.
*workspace, workSpaceSizeInBytes
Inputs. *workspace is a pointer to the GPU workspace, and workSpaceSizeInBytes is the size of the workspace. When *workspace is not NULL and *workSpaceSizeInBytes is large enough, and the tensor layout is NHWC and the data type configuration is supported, then this function will trigger a new semi-persistent NHWC kernel for batch normalization. The workspace is not required to be clean. Also, the workspace does not need to remain unchanged between the forward and backward passes.
*reserveSpace
Input. Pointer to the GPU workspace for the reserveSpace.
reserveSpaceSizeInBytes
Input. The size of the reserveSpace. Must be equal or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the pointers alpha, beta, x, y, bnScaleData, bnBiasData is NULL.
  • The number of xDesc or yDesc tensor descriptor dimensions is not within the [4,5] range (only 4D and 5D tensors are supported).
  • bnScaleBiasMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
  • Exactly one of saveMean, saveInvVariance pointers are NULL.
  • Exactly one of resultRunningMeanData, resultRunningInvVarianceData pointers are NULL.
  • epsilon value is less than CUDNN_BN_MIN_EPSILON.
  • Dimensions or data types mismatch for xDesc, yDesc

3.9. cudnnConvolutionBackwardBias()

cudnnStatus_t cudnnConvolutionBackwardBias(
    cudnnHandle_t                    handle,
    const void                      *alpha,
    const cudnnTensorDescriptor_t    dyDesc,
    const void                      *dy,
    const void                      *beta,
    const cudnnTensorDescriptor_t    dbDesc,
    void                            *db)

This function computes the convolution function gradient with respect to the bias, which is the sum of every element belonging to the same feature map across all of the images of the input tensor. Therefore, the number of elements produced is equal to the number of features maps of the input tensor.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

dyDesc

Input. Handle to the previously initialized input tensor descriptor. For more information, see cudnnTensorDescriptor_t.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

dbDesc

Input. Handle to the previously initialized output tensor descriptor.

db

Output. Data pointer to GPU memory associated with the output tensor descriptor dbDesc.

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the parameters n, height, width of the output tensor is not 1.
  • The numbers of feature maps of the input tensor and output tensor differ.
  • The dataType of the two tensor descriptors is different.

3.10. cudnnConvolutionBackwardData()

cudnnStatus_t cudnnConvolutionBackwardData(
    cudnnHandle_t                       handle,
    const void                         *alpha,
    const cudnnFilterDescriptor_t       wDesc,
    const void                         *w,
    const cudnnTensorDescriptor_t       dyDesc,
    const void                         *dy,
    const cudnnConvolutionDescriptor_t  convDesc,
    cudnnConvolutionBwdDataAlgo_t       algo,
    void                               *workSpace,
    size_t                              workSpaceSizeInBytes,
    const void                         *beta,
    const cudnnTensorDescriptor_t       dxDesc,
    void                               *dx)

This function computes the convolution data gradient of the tensor dy, where y is the output of the forward convolution in cudnnConvolutionForward(). It uses the specified algo, and returns the results in the output tensor dx. Scaling factors alpha and beta can be used to scale the computed result or accumulate with the current dx.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

wDesc

Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor. For more information, see cudnnTensorDescriptor_t.

dy

Input. Data pointer to GPU memory associated with the input differential tensor descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.

algo

Input. Enumerant that specifies which backward data convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionBwdDataAlgo_t.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

dx

Input/Output. Data pointer to GPU memory associated with the output tensor descriptor dxDesc that carries the result.

Supported configurations

This function supports the following combinations of data types for wDesc, dyDesc, convDesc, and dxDesc.
Data Type Configurations wDesc, dyDesc and dxDesc Data Type convDesc Data Type
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE

Supported algorithms

Note: Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following for a list of algorithm options, and their respective supported parameters and deterministic behavior.

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the parentheses are used in the table below, for brevity:
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 (_ALGO_0)
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 (_ALGO_1)
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT (_FFT)
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING (_FFT_TILING)
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD (_WINOGRAD)
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
  • CUDNN_TENSOR_NCHW (_NCHW)
  • CUDNN_TENSOR_NHWC (_NHWC)
  • CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Table 12. For 2D convolutions: wDesc: _NHWC
Filter descriptor wDesc: _NHWC (see cudnnTensorFormat_t)
Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_1   NHWC HWC-packed NHWC HWC-packed TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

 
Table 13. For 2D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW.
Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_0 No NCHW CHW-packed All except _NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

_ALGO_1 Yes NCHW CHW-packed All except _NCHW_VECT_C. TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

_FFT Yes NCHW CHW-packed NCHW HW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

dxDesc feature map height + 2 * convDesc zero-padding height must equal 256 or less

dxDesc feature map width + 2 * convDesc zero-padding width must equal 256 or less

convDesc vertical and horizontal filter stride must equal 1

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

_FFT_TILING Yes NCHW CHW-packed NCHW HW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG is also supported when the task can be handled by 1D FFT, meaning, one of the filter dimensions, width or height is 1.

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

When neither of wDesc filter dimension is 1, the filter width and height must not be larger than 32

When either of wDesc filter dimension is 1, the largest filter dimension should not exceed 256

convDesc vertical and horizontal filter stride must equal 1 when either the filter width or filter height is 1, otherwise, the stride can be 1 or 2

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

_WINOGRAD Yes NCHW CHW-packed All except _NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

convDesc vertical and horizontal filter stride must equal 1

wDesc filter height must be 3

wDesc filter width must be 3

_WINOGRAD_NONFUSED Yes NCHW CHW-packed All except _NCHW_VECT_C. TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

convDesc vertical and horizontal filter stride must equal 1

wDesc filter (height, width) must be (3,3) or (5,5)

If wDesc filter (height, width) is (5,5) then the data type config TRUE_HALF_CONFIG is not supported

Table 14. For 3D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW.
Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_0 Yes NCDHW CDHW-packed All except _NCDHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

_ALGO_1 Yes NCDHW fully-packed NCDHW fully-packed TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIGDOUBLE_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

_FFT_TILING Yes NCDHW CDHW-packed NCDHW DHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

wDesc filter height must equal 16 or less

wDesc filter width must equal 16 or less

wDesc filter depth must equal 16 or less

convDesc must have all filter strides equal to 1

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

wDesc filter depth must be greater than convDesc zero-padding width

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • At least one of the following is NULL: handle, dyDesc, wDesc, convDesc, dxDesc, dy, w, dx, alpha, beta
  • wDesc and dyDesc have a non-matching number of dimensions
  • wDesc and dxDesc have a non-matching number of dimensions
  • wDesc has fewer than three number of dimensions
  • wDesc, dxDesc, and dyDesc have a non-matching data type.
  • wDesc and dxDesc have a non-matching number of input feature maps per image (or group in case of grouped convolutions).
  • dyDesc spatial sizes do not match with the expected size as determined by cudnnGetConvolutionNdForwardOutputDim
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
  • dyDesc or dxDesc have a negative tensor striding
  • dyDesc, wDesc or dxDesc has a number of dimensions that is not 4 or 5
  • The chosen algo does not support the parameters provided; see above for an exhaustive list of parameters that support each algo
  • dyDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
CUDNN_STATUS_MAPPING_ERROR

An error occurs during the texture binding of the filter data or the input differential tensor data

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.11. cudnnConvolutionBackwardFilter()

cudnnStatus_t cudnnConvolutionBackwardFilter(
    cudnnHandle_t                       handle,
    const void                         *alpha,
    const cudnnTensorDescriptor_t       xDesc,
    const void                         *x,
    const cudnnTensorDescriptor_t       dyDesc,
    const void                         *dy,
    const cudnnConvolutionDescriptor_t  convDesc,
    cudnnConvolutionBwdFilterAlgo_t     algo,
    void                               *workSpace,
    size_t                              workSpaceSizeInBytes,
    const void                         *beta,
    const cudnnFilterDescriptor_t       dwDesc,
    void                               *dw)

This function computes the convolution weight (filter) gradient of the tensor dy, where y is the output of the forward convolution in cudnnConvolutionForward(). It uses the specified algo, and returns the results in the output tensor dw. Scaling factors alpha and beta can be used to scale the computed result or accumulate with the current dw.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the backpropagation gradient tensor descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.

algo

Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionBwdFilterAlgo_t.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

dwDesc

Input. Handle to a previously initialized filter gradient descriptor. For more information, see cudnnFilterDescriptor_t.

dw

Input/Output. Data pointer to GPU memory associated with the filter gradient descriptor dwDesc that carries the result.

Supported configurations

This function supports the following combinations of data types for xDesc, dyDesc, convDesc, and dwDesc.

Data Type Configurations xDesc, dyDesc, and dwDesc Data Type convDesc Data Type
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE

Supported algorithms

Note: Specifying a separate algorithm can cause changes in performance, support and computation determinism. See the following table for an exhaustive list of algorithm options and their respective supported parameters and deterministic behavior.

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the parentheses are used in the table below, for brevity:
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (_ALGO_0)
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (_ALGO_1)
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 (_ALGO_3)
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT (_FFT)
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING (_FFT_TILING)
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
  • CUDNN_TENSOR_NCHW (_NCHW)
  • CUDNN_TENSOR_NHWC (_NHWC)
  • CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Table 15. For 2D convolutions: dwDesc: _NHWC
Filter descriptor dwDesc: _NHWC (see cudnnTensorFormat_t)
Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_0 and _ALGO_1   NHWC HWC-packed NHWC HWC-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

 
Table 16. For 2D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW
Algo Name Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_0 No All except _NCHW_VECT_C. NCHW CHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

This algo is not supported if output is of type CUDNN_DATA_HALF and the number of elements in dw is odd.

_ALGO_1 Yes _NCHW or _NHWC NCHW CHW-packed TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

_FFT Yes NCHW CHW-packed NCHW CHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

xDesc feature map height + 2 * convDesc zero-padding height must equal 256 or less

xDesc feature map width + 2 * convDesc zero-padding width must equal 256 or less

convDesc vertical and horizontal filter stride must equal 1

dwDesc filter height must be greater than convDesc zero-padding height

dwDesc filter width must be greater than convDesc zero-padding width

_ALGO_3 Yes All except _NCHW_VECT_C NCHW CHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

_WINOGRAD_NONFUSED Yes All except _NCHW_VECT_C NCHW CHW-packed TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

convDesc vertical and horizontal filter stride must equal 1

wDesc filter (height, width) must be (3,3) or (5,5)

If wDesc filter (height, width) is (5,5), then the data type config TRUE_HALF_CONFIG is not supported.

_FFT_TILING Yes NCHW CHW-packed NCHW CHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions

convDesc Group Count Support: Greater than 0

xDesc width or height must equal 1

dyDesc width or height must equal 1 (the same dimension as in xDesc). The other dimension must be less than or equal to 256, meaning, the largest 1D tile size currently supported.

convDesc vertical and horizontal filter stride must equal 1

dwDesc filter height must be greater than convDesc zero-padding height

dwDesc filter width must be greater than convDesc zero-padding width

Table 17. For 3D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW.
Algo Name (3D Convolutions) Deterministic (Yes or No) Tensor Formats Supported for dyDesc Tensor Formats Supported for dxDesc Data Type Configurations Supported Important
_ALGO_0 No All except _NCDHW_VECT_C. NCDHW CDHW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

_ALGO_3 No NCDHW fully-packed NCDHW fully-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: greater than 0 for all dimensions

convDesc Group Count Support: Greater than 0

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • At least one of the following is NULL: handle, xDesc, dyDesc, convDesc, dwDesc, xData, dyData, dwData, alpha, beta
  • xDesc and dyDesc have a non-matching number of dimensions
  • xDesc and dwDesc have a non-matching number of dimensions
  • xDesc has fewer than three number of dimensions
  • xDesc, dyDesc, and dwDesc have a non-matching data type.
  • xDesc and dwDesc have a non-matching number of input feature maps per image (or group in case of grouped convolutions).
  • yDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
  • xDesc or dyDesc have negative tensor striding
  • xDesc, dyDesc or dwDesc has a number of dimensions that is not 4 or 5
  • The chosen algo does not support the parameters provided; see above for exhaustive list of parameter support for each algo
CUDNN_STATUS_MAPPING_ERROR

An error occurs during the texture binding of the filter data.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.12. cudnnConvolutionBiasActivationForward()

cudnnStatus_t cudnnConvolutionBiasActivationForward(
    cudnnHandle_t                       handle,
    const void                         *alpha1,
    const cudnnTensorDescriptor_t       xDesc,
    const void                         *x,
    const cudnnFilterDescriptor_t       wDesc,
    const void                         *w,
    const cudnnConvolutionDescriptor_t  convDesc,
    cudnnConvolutionFwdAlgo_t           algo,
    void                               *workSpace,
    size_t                              workSpaceSizeInBytes,
    const void                         *alpha2,
    const cudnnTensorDescriptor_t       zDesc,
    const void                         *z,
    const cudnnTensorDescriptor_t       biasDesc,
    const void                         *bias,
    const cudnnActivationDescriptor_t   activationDesc,
    const cudnnTensorDescriptor_t       yDesc,
    void                               *y)
This function applies a bias and then an activation to the convolutions or cross-correlations of cudnnConvolutionForward(), returning results in y. The full computation follows the equation y = act ( alpha1 * conv(x) + alpha2 * z + bias ).
Note:

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha1, alpha2
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

wDesc

Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

convDesc

Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.

algo

Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionFwdAlgo_t.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

zDesc

Input. Handle to a previously initialized tensor descriptor.

z

Input. Data pointer to GPU memory associated with the tensor descriptor zDesc.

biasDesc

Input. Handle to a previously initialized tensor descriptor.

bias

Input. Data pointer to GPU memory associated with the tensor descriptor biasDesc.

activationDesc

Input. Handle to a previously initialized activation descriptor. For more information, see cudnnActivationDescriptor_t.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc that carries the result of the convolution.

For the convolution step, this function supports the specific combinations of data types for xDesc, wDesc, convDesc, and yDesc as listed in the documentation of cudnnConvolutionForward(). The following table specifies the supported combinations of data types for x, y, z, bias, and alpha1/alpha2.

Table 18. Supported combinations of data types (X = CUDNN_DATA)
x w y and z bias alpha1/alpha2
X_DOUBLE X_DOUBLE X_DOUBLE X_DOUBLE X_DOUBLE
X_FLOAT X_FLOAT X_FLOAT X_FLOAT X_FLOAT
X_HALF X_HALF X_HALF X_HALF X_FLOAT
X_INT8 X_INT8 X_INT8 X_FLOAT X_FLOAT
X_INT8 X_INT8 X_FLOAT X_FLOAT X_FLOAT
X_INT8x4 X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT
X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT X_FLOAT
X_UINT8 X_INT8 X_INT8 X_FLOAT X_FLOAT
X_UINT8 X_INT8 X_FLOAT X_FLOAT X_FLOAT
X_UINT8x4 X_INT8x4 X_INT8x4 X_FLOAT X_FLOAT
X_UINT8x4 X_INT8x4 X_FLOAT X_FLOAT X_FLOAT

Returns

In addition to the error values listed by the documentation of cudnnConvolutionForward(), the possible error values returned by this function and their meanings are listed below.
CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_BAD_PARAM
At least one of the following conditions are met:
  • At least one of the following is NULL: zDesc, zData, biasDesc, bias, activationDesc.
  • The second dimension of biasDesc and the first dimension of filterDesc are not equal.
  • zDesc and destDesc do not match.
CUDNN_STATUS_NOT_SUPPORTED
The function does not support the provided configuration. Some examples of non-supported configurations are as follows:
  • The mode of activationDesc is neither CUDNN_ACTIVATION_RELU or CUDNN_ACTIVATION_IDENTITY.
  • The reluNanOpt of activationDesc is not CUDNN_NOT_PROPAGATE_NAN.
  • The second stride of biasDesc is not equal to one.
  • The data type of biasDesc does not correspond to the data type of yDesc as listed in the above data types table.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.13. cudnnConvolutionForward()

cudnnStatus_t cudnnConvolutionForward(
    cudnnHandle_t                       handle,
    const void                         *alpha,
    const cudnnTensorDescriptor_t       xDesc,
    const void                         *x,
    const cudnnFilterDescriptor_t       wDesc,
    const void                         *w,
    const cudnnConvolutionDescriptor_t  convDesc,
    cudnnConvolutionFwdAlgo_t           algo,
    void                               *workSpace,
    size_t                              workSpaceSizeInBytes,
    const void                         *beta,
    const cudnnTensorDescriptor_t       yDesc,
    void                               *y)

This function executes convolutions or cross-correlations over x using filters specified with w, returning results in y. Scaling factors alpha and beta can be used to scale the input tensor and the output tensor respectively.

Note: The routine cudnnGetConvolution2dForwardOutputDim() or cudnnGetConvolutionNdForwardOutputDim() can be used to determine the proper dimensions of the output tensor descriptor yDesc with respect to xDesc, convDesc, and wDesc.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to a previously initialized tensor descriptor. For more information, see cudnnTensorDescriptor_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

wDesc

Input. Handle to a previously initialized filter descriptor. For more information, see cudnnFilterDescriptor_t.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

convDesc

Input. Previously initialized convolution descriptor. For more information, see cudnnConvolutionDescriptor_t.

algo

Input. Enumerant that specifies which convolution algorithm should be used to compute the results. For more information, see cudnnConvolutionFwdAlgo_t.

workSpace

Input. Data pointer to GPU memory to a workspace needed to able to execute the specified algorithm. If no workspace is needed for a particular algorithm, that pointer can be nil.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc that carries the result of the convolution.

Supported configurations

This function supports the following combinations of data types for xDesc, wDesc, convDesc, and yDesc.

Table 19. Supported configurations
Data Type Configurations xDesc and wDesc convDesc yDesc
TRUE_HALF_CONFIG (only supported on architectures with true FP16 support, meaning, compute capability 5.3 and later) CUDNN_DATA_HALF CUDNN_DATA_HALF CUDNN_DATA_HALF
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
INT8_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_INT8 CUDNN_DATA_INT32 CUDNN_DATA_INT8
INT8_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_INT8 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT
INT8x4_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_INT8x4 CUDNN_DATA_INT32 CUDNN_DATA_INT8x4
INT8x4_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_INT8x4 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT
UINT8x4_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_UINT8x4 CUDNN_DATA_INT32 CUDNN_DATA_UINT8x4
UINT8x4_EXT_CONFIG (only supported on architectures with DP4A support, meaning, compute capability 6.1 and later) CUDNN_DATA_UINT8x4 CUDNN_DATA_INT32 CUDNN_DATA_FLOAT

Supported algorithms

Note: For this function, all algorithms perform deterministic computations. Specifying a separate algorithm can cause changes in performance and support.

The table below shows the list of the supported 2D and 3D convolutions. The 2D convolutions are described first, followed by the 3D convolutions.

For the following terms, the short-form versions shown in the parenthesis are used in the table below, for brevity:
  • CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM (_IMPLICIT_GEMM)
  • CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM (_IMPLICIT_PRECOMP_GEMM)
  • CUDNN_CONVOLUTION_FWD_ALGO_GEMM (_GEMM)
  • CUDNN_CONVOLUTION_FWD_ALGO_DIRECT (_DIRECT)
  • CUDNN_CONVOLUTION_FWD_ALGO_FFT (_FFT)
  • CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING (_FFT_TILING)
  • CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD (_WINOGRAD)
  • CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED (_WINOGRAD_NONFUSED)
  • CUDNN_TENSOR_NCHW (_NCHW)
  • CUDNN_TENSOR_NHWC (_NHWC)
  • CUDNN_TENSOR_NCHW_VECT_C (_NCHW_VECT_C)
Table 20. For 2D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW (see cudnnTensorFormat_t)

convDesc Group count support: Greater than 0, for all algos.

Algo Name Tensor Formats Supported for xDesc Tensor Formats Supported for yDesc Data Type Configurations Supported Important
_IMPLICIT_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: Greater than 0 for all dimensions
_IMPLICIT_PRECOMP_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions
_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: 1 for all dimensions
_FFT NCHW HW-packed NCHW HW-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

xDesc feature map height + 2 * convDesc zero-padding height must equal 256 or less

xDesc feature map width + 2 * convDesc zero-padding width must equal 256 or less

convDesc vertical and horizontal filter stride must equal 1

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

_FFT_TILING PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG is also supported when the task can be handled by 1D FFT, meaning, one of the filter dimension, width or height is 1.

Dilation: 1 for all dimensions

When neither of wDesc filter dimension is 1, the filter width and height must not be larger than 32

When either of wDesc filter dimension is 1, the largest filter dimension should not exceed 256

convDesc vertical and horizontal filter stride must equal 1 when either the filter width or filter height is 1, otherwise the stride can be a 1 or 2

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

_WINOGRAD All except_NCHW_VECT_C. All except_NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc vertical and horizontal filter stride must equal 1

wDesc filter height must be 3

wDesc filter width must be 3

_WINOGRAD_NONFUSED TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: 1 for all dimensions

convDesc vertical and horizontal filter stride must equal 1

wDesc filter (height, width) must be (3,3) or (5,5)

If wDesc filter (height, width) is (5,5), then data type config TRUE_HALF_CONFIG is not supported.

_DIRECT Currently not implemented in cuDNN.
Table 21. For 2D convolutions: wDesc: _NCHWC
Filter descriptor wDesc: _NCHWC

convDesc Group count support: Greater than 0.

Algo Name xDesc yDesc Data Type Configurations Supported Important
_IMPLICIT_GEMM NCHWC HWC-packed NCHWC HWC-packed PSEUDO_HALF_CONFIG

FLOAT_CONFIG

Dilation: Greater than 0 for all dimensions
_IMPLICIT_PRECOMP_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. INT8x4_CONFIG

INT8x4_EXT_CONFIG

UINT8x4_CONFIG

UINT8x4_EXT_CONFIG

Dilation: 1 for all dimensions
Table 22. For 2D convolutions: wDesc: _NHWC
Filter descriptor wDesc: _NHWC

convDesc Group count support: Greater than 0.

Algo Name xDesc yDesc Data Type Configurations Supported Important
_IMPLICIT_PRECOMP_GEMM NHWC NHWC INT8_CONFIG

INT8_EXT_CONFIG

INT8x4_CONFIG

INT8x4_EXT_CONFIG

UINT8x4_CONFIG

UINT8x4_EXT_CONFIG

Dilation: 1 for all dimensions

Input and output feature maps must be a multiple of 4.

_IMPLICIT_PRECOMP_GEMM NHWC HWC-packed. NHWC HWC-packed. TRUE_HALF_CONFIG

PSEUDO_HALF_CONFIG

FLOAT_CONFIG

 
Table 23. For 3D convolutions: wDesc: _NCHW
Filter descriptor wDesc: _NCHW

convDesc Group count support: Greater than 0, for all algos.

Algo Name xDesc yDesc Data Type Configurations Supported Important
_IMPLICIT_GEMM All except _NCHW_VECT_C. All except _NCHW_VECT_C. PSEUDO_HALF_CONFIG

FLOAT_CONFIG

DOUBLE_CONFIG

Dilation: Greater than 0 for all dimensions
_IMPLICIT_PRECOMP_GEMM Dilation: 1 for all dimensions
_FFT_TILING NCDHW DHW-packed NCDHW DHW-packed Dilation: 1 for all dimensions

wDesc filter height must equal 16 or less

wDesc filter width must equal 16 or less

wDesc filter depth must equal 16 or less

convDesc must have all filter strides equal to 1

wDesc filter height must be greater than convDesc zero-padding height

wDesc filter width must be greater than convDesc zero-padding width

wDesc filter depth must be greater than convDesc zero-padding width

Note: Tensors can be converted to and from CUDNN_TENSOR_NCHW_VECT_C with cudnnTransformTensor().

Returns

CUDNN_STATUS_SUCCESS

The operation was launched successfully.

CUDNN_STATUS_BAD_PARAM
At least one of the following conditions are met:
  • At least one of the following is NULL: handle, xDesc, wDesc, convDesc, yDesc, xData, w, yData, alpha, beta
  • xDesc and yDesc have a non-matching number of dimensions
  • xDesc and wDesc have a non-matching number of dimensions
  • xDesc has fewer than three number of dimensions
  • xDesc's number of dimensions is not equal to convDesc array length + 2
  • xDesc and wDesc have a non-matching number of input feature maps per image (or group in case of grouped convolutions)
  • yDesc or wDesc indicate an output channel count that isn't a multiple of group count (if group count has been set in convDesc).
  • xDesc, wDesc, and yDesc have a non-matching data type
  • For some spatial dimension, wDesc has a spatial size that is larger than the input spatial size (including zero-padding size)
CUDNN_STATUS_NOT_SUPPORTED
At least one of the following conditions are met:
  • xDesc or yDesc have negative tensor striding
  • xDesc, wDesc, or yDesc has a number of dimensions that is not 4 or 5
  • yDesc spatial sizes do not match with the expected size as determined by cudnnGetConvolutionNdForwardOutputDim()
  • The chosen algo does not support the parameters provided; see above for an exhaustive list of parameters supported for each algo
CUDNN_STATUS_MAPPING_ERROR

An error occurred during the texture binding of the filter data.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.14. cudnnCreate()

cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)

This function initializes the cuDNN library and creates a handle to an opaque structure holding the cuDNN library context. It allocates hardware resources on the host and device and must be called prior to making any other cuDNN library calls.

The cuDNN library handle is tied to the current CUDA device (context). To use the library on multiple devices, one cuDNN handle needs to be created for each device.

For a given device, multiple cuDNN handles with different configurations (for example, different current CUDA streams) may be created. Because cudnnCreate() allocates some internal resources, the release of those resources by calling cudnnDestroy() will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths.

For multithreaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handle(s) per thread and use that cuDNN handle for the entire life of the thread.

Parameters

handle

Output. Pointer to pointer where to store the address to the allocated cuDNN handle. For more information, see cudnnHandle_t.

Returns

CUDNN_STATUS_BAD_PARAM

Invalid (NULL) input pointer supplied.

CUDNN_STATUS_NOT_INITIALIZED

No compatible GPU found, CUDA driver not installed or disabled, CUDA runtime API initialization failed.

CUDNN_STATUS_ARCH_MISMATCH

NVIDIA GPU architecture is too old.

CUDNN_STATUS_ALLOC_FAILED

Host memory allocation failed.

CUDNN_STATUS_INTERNAL_ERROR

CUDA resource allocation failed.

CUDNN_STATUS_LICENSE_ERROR

cuDNN license validation failed (only when the feature is enabled).

CUDNN_STATUS_SUCCESS

cuDNN handle was created successfully.

3.15. cudnnCreateActivationDescriptor()

cudnnStatus_t cudnnCreateActivationDescriptor(
        cudnnActivationDescriptor_t   *activationDesc)

This function creates an activation descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnActivationDescriptor_t.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.16. cudnnCreateAlgorithmDescriptor()

cudnnStatus_t cudnnCreateAlgorithmDescriptor(
    cudnnAlgorithmDescriptor_t *algoDesc)

This function creates an algorithm descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.17. cudnnCreateAlgorithmPerformance()

cudnnStatus_t cudnnCreateAlgorithmPerformance(
    cudnnAlgorithmPerformance_t *algoPerf,
    int                         numberToCreate)

This function creates multiple algorithm performance objects by allocating the memory needed to hold their opaque structures.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.18. cudnnCreateAttnDescriptor()

cudnnStatus_t cudnnCreateAttnDescriptor(cudnnAttnDescriptor_t *attnDesc);

This function creates one instance of an opaque attention descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL to attnDesc when the attention descriptor object cannot be allocated.

Use the cudnnSetAttnDescriptor() function to configure the attention descriptor and cudnnDestroyAttnDescriptor() to destroy it and release the allocated memory.

Parameters

attnDesc
Output. Pointer where the address to the newly created attention descriptor should be written.

Returns

CUDNN_STATUS_SUCCESS
The descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAM
An invalid input argument was encountered (attnDesc=NULL).
CUDNN_STATUS_ALLOC_FAILED
The memory allocation failed.

3.19. cudnnCreateConvolutionDescriptor()

cudnnStatus_t cudnnCreateConvolutionDescriptor(
    cudnnConvolutionDescriptor_t *convDesc)

This function creates a convolution descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnConvolutionDescriptor_t.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.20. cudnnCreateCTCLossDescriptor()

cudnnStatus_t cudnnCreateCTCLossDescriptor(
    cudnnCTCLossDescriptor_t* ctcLossDesc)

This function creates a CTC loss function descriptor.

Parameters

ctcLossDesc

Output. CTC loss descriptor to be set. For more information, see cudnnCTCLossDescriptor_t.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

CUDNN_STATUS_BAD_PARAM

CTC loss descriptor passed to the function is invalid.

CUDNN_STATUS_ALLOC_FAILED

Memory allocation for this CTC loss descriptor failed.

3.21. cudnnCreateDropoutDescriptor()

cudnnStatus_t cudnnCreateDropoutDescriptor(
    cudnnDropoutDescriptor_t    *dropoutDesc)

This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnDropoutDescriptor_t.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.22. cudnnCreateFilterDescriptor()

cudnnStatus_t cudnnCreateFilterDescriptor(
    cudnnFilterDescriptor_t *filterDesc)

This function creates a filter descriptor object by allocating the memory needed to hold its opaque structure. For more information, see cudnnFilterDescriptor_t.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.23. cudnnCreateFusedOpsConstParamPack()

cudnnStatus_t cudnnCreateFusedOpsConstParamPack(
	cudnnFusedOpsConstParamPack_t *constPack, 
	cudnnFusedOps_t ops);		

This function creates an opaque structure to store the various problem size information, such as the shape, layout and the type of tensors, and the descriptors for convolution and activation, for the selected sequence of cudnnFusedOps computations.

Parameters

constPack
Input. The opaque structure that is created by this function. For more information, see cudnnFusedOpsConstParamPack_t.
ops
Input. The specific sequence of computations to perform in the cudnnFusedOps computations, as defined in the enumerant type cudnnFusedOps_t.

Returns

CUDNN_STATUS_BAD_PARAM
If either constPack or ops is NULL.
CUDNN_STATUS_SUCCESS
If the descriptor is created successfully.
CUDNN_STATUS_NOT_SUPPORTED
If the ops enum value is not supported or reserved for future use.

3.24. cudnnCreateFusedOpsPlan()

cudnnStatus_t cudnnCreateFusedOpsPlan(
	cudnnFusedOpsPlan_t *plan, 
	cudnnFusedOps_t ops);		

This function creates the plan descriptor for the cudnnFusedOps computation. This descriptor contains the plan information, including the problem type and size, which kernels should be run, and the internal workspace partition.

Parameters

plan
Input. A pointer to the instance of the descriptor created by this function.
ops
Input. The specific sequence of fused operations computations for which this plan descriptor should be created. For more information, see cudnnFusedOps_t.

Returns

CUDNN_STATUS_BAD_PARAM
If either the input *plan is NULL or the ops input is not a valid cudnnFusedOp enum.
CUDNN_STATUS_NOT_SUPPORTED
The ops input provided is not supported.
CUDNN_STATUS_SUCCESS
The plan descriptor is created successfully.

3.25. cudnnCreateFusedOpsVariantParamPack()

cudnnStatus_t cudnnCreateFusedOpsVariantParamPack(
	cudnnFusedOpsVariantParamPack_t *varPack, 
	cudnnFusedOps_t ops);		

This function creates a descriptor for cudnnFusedOps constant parameters.

Parameters

varPack
Input. Pointer to the descriptor created by this function. For more information, see cudnnFusedOpsVariantParamPack_t.
ops
Input. The specific sequence of fused operations computations for which this descriptor should be created.

Returns

CUDNN_STATUS_SUCCESS
The descriptor is successfully created.
CUDNN_STATUS_BAD_PARAM
If any input is invalid.

3.26. cudnnCreateLRNDescriptor()

cudnnStatus_t cudnnCreateLRNDescriptor(
            cudnnLRNDescriptor_t    *poolingDesc)

This function allocates the memory needed to hold the data needed for LRN and DivisiveNormalization layers operation and returns a descriptor used with subsequent layer forward and backward calls.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

cudnnCreateOpTensorDescriptor()

cudnnStatus_t cudnnCreateOpTensorDescriptor(
    cudnnOpTensorDescriptor_t* 	opTensorDesc)

This function creates a tensor pointwise math descriptor. For more information, see cudnnOpTensorDescriptor_t.

Parameters

opTensorDesc

Output. Pointer to the structure holding the description of the Tensor Pointwise math such as add, multiply, and more.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

CUDNN_STATUS_BAD_PARAM

Tensor pointwise math descriptor passed to the function is invalid.

CUDNN_STATUS_ALLOC_FAILED

Memory allocation for this tensor pointwise math descriptor failed.

3.28. cudnnCreatePersistentRNNPlan()

cudnnStatus_t cudnnCreatePersistentRNNPlan(
    cudnnRNNDescriptor_t        rnnDesc,
    const int                   minibatch,
    const cudnnDataType_t       dataType,
    cudnnPersistentRNNPlan_t   *plan)

This function creates a plan to execute persistent RNNs when using the CUDNN_RNN_ALGO_PERSIST_DYNAMIC algo. This plan is tailored to the current GPU and problem hyperparameters. This function call is expected to be expensive in terms of runtime and should be used infrequently. For more information, see cudnnRNNDescriptor_t, cudnnDataType_t, and cudnnPersistentRNNPlan_t.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING

A prerequisite runtime library cannot be found.

CUDNN_STATUS_NOT_SUPPORTED

The current hyperparameters are invalid.

3.29. cudnnCreatePoolingDescriptor()

cudnnStatus_t cudnnCreatePoolingDescriptor(
    cudnnPoolingDescriptor_t    *poolingDesc)

This function creates a pooling descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.30. cudnnCreateReduceTensorDescriptor()

cudnnStatus_t cudnnCreateReduceTensorDescriptor(
	cudnnReduceTensorDescriptor_t*	reduceTensorDesc)

This function creates a reduce tensor descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_BAD_PARAM

reduceTensorDesc is a NULL pointer.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.31. cudnnCreateRNNDataDescriptor()

cudnnStatus_t cudnnCreateRNNDataDescriptor(
    cudnnRNNDataDescriptor_t *RNNDataDesc)

This function creates a RNN data descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The RNN data descriptor object was created successfully.

CUDNN_STATUS_BAD_PARAM

RNNDataDesc is NULL.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.32. cudnnCreateRNNDescriptor()

cudnnStatus_t cudnnCreateRNNDescriptor(
    cudnnRNNDescriptor_t    *rnnDesc)

This function creates a generic RNN descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.33. cudnnCreateSeqDataDescriptor()

cudnnStatus_t cudnnCreateSeqDataDescriptor(cudnnSeqDataDescriptor_t *seqDataDesc);	

This function creates one instance of an opaque sequence data descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL to seqDataDesc when the sequence data descriptor object cannot be allocated.

Use the cudnnSetSeqDataDescriptor() function to configure the sequence data descriptor and cudnnDestroySeqDataDescriptor() to destroy it and release the allocated memory.

Parameters

seqDataDesc
Output. Pointer where the address to the newly created sequence data descriptor should be written.

Returns

CUDNN_STATUS_SUCCESS
The descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAM
An invalid input argument was encountered (seqDataDesc=NULL).
CUDNN_STATUS_ALLOC_FAILED
The memory allocation failed.

3.34. cudnnCreateSpatialTransformerDescriptor()

cudnnStatus_t cudnnCreateSpatialTransformerDescriptor(
    cudnnSpatialTransformerDescriptor_t *stDesc)

This function creates a generic spatial transformer descriptor object by allocating the memory needed to hold its opaque structure.

Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

3.35. cudnnCreateTensorDescriptor()

cudnnStatus_t cudnnCreateTensorDescriptor(
    cudnnTensorDescriptor_t *tensorDesc)

This function creates a generic tensor descriptor object by allocating the memory needed to hold its opaque structure. The data is initialized to all zeros.

Parameters

tensorDesc

Input. Pointer to pointer where the address to the allocated tensor descriptor object should be stored.

Returns

CUDNN_STATUS_BAD_PARAM

Invalid input argument.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

CUDNN_STATUS_SUCCESS

The object was created successfully.

3.36. cudnnCreateTensorTransformDescriptor()

cudnnStatus_t cudnnCreateTensorTransformDescriptor(
	cudnnTensorTransformDescriptor_t *transformDesc);

This function creates a tensor transform descriptor object by allocating the memory needed to hold its opaque structure. The tensor data is initialized to be all zero. Use the cudnnSetTensorTransformDescriptor() function to initialize the descriptor created by this function.

Parameters

transformDesc
Output. A pointer to an uninitialized tensor transform descriptor.

Returns

CUDNN_STATUS_SUCCESS
The descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAM
The transformDesc is NULL.
CUDNN_STATUS_ALLOC_FAILED
The memory allocation failed.

3.37. cudnnCTCLoss()

cudnnStatus_t cudnnCTCLoss(
    cudnnHandle_t                        handle,
    const   cudnnTensorDescriptor_t      probsDesc,
    const   void                        *probs,
    const   int                         *labels,
    const   int                         *labelLengths,
    const   int                         *inputLengths,
    void                                *costs,
    const   cudnnTensorDescriptor_t      gradientsDesc,
    const   void                        *gradients,
    cudnnCTCLossAlgo_t                   algo,
    const   cudnnCTCLossDescriptor_t     ctcLossDesc,
    void                                *workspace,
    size_t                              *workSpaceSizeInBytes)
This function returns the CTC costs and gradients, given the probabilities and labels.
Note: This function has an inconsistent interface, for example, the probs input is probability normalized by softmax, but the gradients output is with respect to the unnormalized activation.

Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, see cudnnHandle_t.

probsDesc

Input. Handle to the previously initialized probabilities tensor descriptor. For more information, see cudnnTensorDescriptor_t.

probs

Input. Pointer to a previously initialized probabilities tensor. These input probabilities are normalized by softmax.

labels

Input. Pointer to a previously initialized labels list.

labelLengths

Input. Pointer to a previously initialized lengths list, to walk the above labels list.

inputLengths

Input. Pointer to a previously initialized list of the lengths of the timing steps in each batch.

costs

Output. Pointer to the computed costs of CTC.

gradientsDesc

Input. Handle to a previously initialized gradients tensor descriptor.

gradients

Output. Pointer to the computed gradients of CTC. These computed gradient outputs are with respect to the unnormalized activation.

algo

Input. Enumerant that specifies the chosen CTC loss algorithm. For more information, see cudnnCTCLossAlgo_t.

ctcLossDesc

Input. Handle to the previously initialized CTC loss descriptor. For more information, see cudnnCTCLossDescriptor_t.

workspace

Input. Pointer to GPU memory of a workspace needed to able to execute the specified algorithm.

sizeInBytes

Input. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • The dimensions of probsDesc do not match the dimensions of gradientsDesc.
  • The inputLengths do not agree with the first dimension of probsDesc.
  • The workSpaceSizeInBytes is not sufficient.
  • The labelLengths is greater than 256.
CUDNN_STATUS_NOT_SUPPORTED

A compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.38. cudnnDeriveBNTensorDescriptor()

cudnnStatus_t cudnnDeriveBNTensorDescriptor(
    cudnnTensorDescriptor_t         derivedBnDesc,
    const cudnnTensorDescriptor_t   xDesc,
    cudnnBatchNormMode_t            mode)

This function derives a secondary tensor descriptor for the batch normalization scale, invVariance, bnBias, and bnScale subtensors from the layer's x data descriptor.

Use the tensor descriptor produced by this function as the bnScaleBiasMeanVarDesc parameter for the cudnnBatchNormalizationForwardInference() and cudnnBatchNormalizationForwardTraining() functions, and as the bnScaleBiasDiffDesc parameter in the cudnnBatchNormalizationBackward() function.

The resulting dimensions will be:
  • 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for BATCHNORM_MODE_SPATIAL
  • 1xCxHxW for 4D and 1xCxDxHxW for 5D for BATCHNORM_MODE_PER_ACTIVATION mode
For HALF input data type the resulting tensor descriptor will have a FLOAT type. For other data types, it will have the same type as the input data.
Note:
  • Only 4D and 5D tensors are supported.
  • The derivedBnDesc should be first created using cudnnCreateTensorDescriptor().
  • xDesc is the descriptor for the layer's x data and has to be setup with proper dimensions prior to calling this function.

Parameters

derivedBnDesc

Output. Handle to a previously created tensor descriptor.

xDesc

Input. Handle to a previously created and initialized layer's x data descriptor.

mode

Input. Batch normalization layer mode of operation.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_BAD_PARAM

Invalid Batch Normalization mode.

3.39. cudnnDestroy()

cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)

This function releases the resources used by the cuDNN handle. This function is usually the last call with a particular handle to the cuDNN handle. Because cudnnCreate() allocates some internal resources, the release of those resources by calling cudnnDestroy() will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths.

Parameters

handle

Input. Pointer to the cuDNN handle to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The cuDNN context destruction was successful.

CUDNN_STATUS_BAD_PARAM

Invalid (NULL) pointer supplied.

3.40. cudnnDestroyActivationDescriptor()

cudnnStatus_t cudnnDestroyActivationDescriptor(
        cudnnActivationDescriptor_t activationDesc)

This function destroys a previously created activation descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.41. cudnnDestroyAlgorithmDescriptor()

cudnnStatus_t cudnnDestroyAlgorithmDescriptor(
        cudnnActivationDescriptor_t algorithmDesc)

This function destroys a previously created algorithm descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.42. cudnnDestroyAlgorithmPerformance()

cudnnStatus_t cudnnDestroyAlgorithmPerformance(
        cudnnAlgorithmPerformance_t     algoPerf)

This function destroys a previously created algorithm descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.43. cudnnDestroyAttnDescriptor()

cudnnStatus_t cudnnDestroyAttnDescriptor(cudnnAttnDescriptor_t attnDesc);

This function destroys the attention descriptor object and releases its memory. The attnDesc argument can be NULL. Invoking cudnnDestroyAttnDescriptor() with a NULL argument is a no operation (NOP).

The cudnnDestroyAttnDescriptor() function is not able to detect if the attnDesc argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateAttnDescriptor() function, or in the double deletion scenario of a valid address.

Parameters

attnDesc
Input. Pointer to the attention descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS
The descriptor was destroyed successfully.

3.44. cudnnDestroyConvolutionDescriptor()

cudnnStatus_t cudnnDestroyConvolutionDescriptor(
    cudnnConvolutionDescriptor_t convDesc)

This function destroys a previously created convolution descriptor object.

Returns

CUDNN_STATUS_SUCCESS
The descriptor was destroyed successfully.

3.45. cudnnDestroyCTCLossDescriptor()

cudnnStatus_t cudnnDestroyCTCLossDescriptor(
    cudnnCTCLossDescriptor_t 	ctcLossDesc)

This function destroys a CTC loss function descriptor object.

Parameters

ctcLossDesc

Input. CTC loss function descriptor to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

3.46. cudnnDestroyDropoutDescriptor()

cudnnStatus_t cudnnDestroyDropoutDescriptor(
    cudnnDropoutDescriptor_t dropoutDesc)

This function destroys a previously created dropout descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.47. cudnnDestroyFilterDescriptor()

cudnnStatus_t cudnnDestroyFilterDescriptor(
    cudnnFilterDescriptor_t filterDesc)

This function destroys a previously created tensor 4D descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.48. cudnnDestroyFusedOpsConstParamPack()

cudnnStatus_t cudnnDestroyFusedOpsConstParamPack(
	cudnnFusedOpsConstParamPack_t constPack);		

This function destroys a previously-created cudnnFusedOpsConstParamPack_t structure.

Parameters

constPack
Input. The cudnnFusedOpsConstParamPack_t structure that should be destroyed.

Returns

CUDNN_STATUS_SUCCESS
If the descriptor is destroyed successfully.
CUDNN_STATUS_INTERNAL_ERROR
If the ops enum value is not supported or invalid.

3.49. cudnnDestroyFusedOpsPlan()

cudnnStatus_t cudnnDestroyFusedOpsPlan(
	cudnnFusedOpsPlan_t plan);		

This function destroys the plan descriptor provided.

Parameters

plan
Input. The descriptor that should be destroyed by this function.

Returns

CUDNN_STATUS_SUCCESS
If either the plan descriptor is NULL or the descriptor is successfully destroyed.

3.50. cudnnDestroyFusedOpsVariantParamPack()

cudnnStatus_t cudnnDestroyFusedOpsVariantParamPack(
	cudnnFusedOpsVariantParamPack_t varPack);		

This function destroys a previously-created descriptor for cudnnFusedOps constant parameters.

Parameters

varPack
Input. The descriptor that should be destroyed.

Returns

CUDNN_STATUS_SUCCESS
The descriptor is successfully destroyed.

3.51. cudnnDestroyLRNDescriptor()

cudnnStatus_t cudnnDestroyLRNDescriptor(
    cudnnLRNDescriptor_t lrnDesc)

This function destroys a previously created LRN descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.52. cudnnDestroyOpTensorDescriptor()

cudnnStatus_t cudnnDestroyOpTensorDescriptor(
    cudnnOpTensorDescriptor_t   opTensorDesc)

This function deletes a tensor pointwise math descriptor object.

Parameters

opTensorDesc

Input. Pointer to the structure holding the description of the tensor pointwise math to be deleted.

Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

3.53. cudnnDestroyPersistentRNNPlan()

cudnnStatus_t cudnnDestroyPersistentRNNPlan(
    cudnnPersistentRNNPlan_t plan)

This function destroys a previously created persistent RNN plan object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.54. cudnnDestroyPoolingDescriptor()

cudnnStatus_t cudnnDestroyPoolingDescriptor(
    cudnnPoolingDescriptor_t poolingDesc)

This function destroys a previously created pooling descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.55. cudnnDestroyReduceTensorDescriptor()

cudnnStatus_t cudnnDestroyReduceTensorDescriptor(
    cudnnReduceTensorDescriptor_t   tensorDesc)

This function destroys a previously created reduce tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

Parameters

tensorDesc

Input. Pointer to the reduce tensor descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.56. cudnnDestroyRNNDataDescriptor()

cudnnStatus_t cudnnDestroyRNNDataDescriptor(
    cudnnRNNDataDescriptor_t RNNDataDesc)

This function destroys a previously created RNN data descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The RNN data descriptor object was destroyed successfully.

3.57. cudnnDestroyRNNDescriptor()

cudnnStatus_t cudnnDestroyRNNDescriptor(
    cudnnRNNDescriptor_t rnnDesc)

This function destroys a previously created RNN descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.58. cudnnDestroySeqDataDescriptor()

cudnnStatus_t cudnnDestroySeqDataDescriptor(cudnnSeqDataDescriptor_t seqDataDesc);

This function destroys the sequence data descriptor object and releases its memory. The seqDataDesc argument can be NULL. Invoking cudnnDestroySeqDataDescriptor() with a NULL argument is a no operation (NOP).

The cudnnDestroySeqDataDescriptor() function is not able to detect if the seqDataDesc argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateSeqDataDescriptor() function, or in the double deletion scenario of a valid address.

Parameters

seqDataDesc
Input. Pointer to the sequence data descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS
The descriptor was destroyed successfully.

3.59. cudnnDestroySpatialTransformerDescriptor()

cudnnStatus_t cudnnDestroySpatialTransformerDescriptor(
    cudnnSpatialTransformerDescriptor_t stDesc)

This function destroys a previously created spatial transformer descriptor object.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.60. cudnnDestroyTensorDescriptor()

cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)

This function destroys a previously created tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

Parameters

tensorDesc

Input. Pointer to the tensor descriptor object to be destroyed.

Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

3.61. cudnnDestroyTensorTransformDescriptor()

cudnnStatus_t cudnnDestroyTensorTransformDescriptor(
	cudnnTensorTransformDescriptor_t transformDesc);

Destroys a previously created tensor transform descriptor.

Parameters

transformDesc
Input. The tensor transform descriptor to be destroyed.

Returns

CUDNN_STATUS_SUCCESS
The descriptor was destroyed successfully.

3.62. cudnnDivisiveNormalizationBackward()

 cudnnStatus_t cudnnDivisiveNormalizationBackward(
    cudnnHandle_t                    handle,
    cudnnLRNDescriptor_t             normDesc,
    cudnnDivNormMode_t               mode,
    const void                      *alpha,
    const cudnnTensorDescriptor_t    xDesc,
    const void                      *x,
    const void                      *means,
    const void                      *dy,
    void                            *temp,
    void                            *temp2,
    const void                      *beta,
    const cudnnTensorDescriptor_t    dxDesc,
    void                            *dx,
    void                            *dMeans)

This function performs the backward DivisiveNormalization layer computation.

Note: Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously initialized LRN parameter descriptor (this descriptor is used for both LRN and DivisiveNormalization layers).

mode

Input. DivisiveNormalization layer mode of operation. Currently only CUDNN_DIVNORM_PRECOMPUTED_MEANS is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, x, means

Input. Tensor descriptor and pointers in device memory for the layer's x and means data. Note that the means tensor is expected to be precomputed by the user. It can also contain any valid values (not required to be actual means, and can be for instance a result of a convolution with a Gaussian kernel).

dy

Input. Tensor pointer in device memory for the layer's dy cumulative loss differential data (error backpropagation).

temp, temp2

Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the backward pass. These tensors do not have to be preserved from forward to backward pass. Both use xDesc as a descriptor.

dxDesc

Input. Tensor descriptor for dx and dMeans.

dx, dMeans

Output. Tensor pointers (in device memory) for the layers resulting cumulative gradients dx and dMeans (dLoss/dx and dLoss/dMeans). Both share the same descriptor.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the tensor pointers x, dx, temp, tmep2, dy is NULL.
  • Number of any of the input or output tensor dimensions is not within the [4,5] range.
  • Either alpha or beta pointer is NULL.
  • A mismatch in dimensions between xDesc and dxDesc.
  • LRN descriptor parameters are outside of their valid ranges.
  • Any of the tensor strides is negative.
CUDNN_STATUS_UNSUPPORTED

The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a non-supported configuration.

3.63. cudnnDivisiveNormalizationForward()

cudnnStatus_t cudnnDivisiveNormalizationForward(
    cudnnHandle_t                    handle,
    cudnnLRNDescriptor_t             normDesc,
    cudnnDivNormMode_t               mode,
    const void                      *alpha,
    const cudnnTensorDescriptor_t    xDesc,
    const void                      *x,
    const void                      *means,
    void                            *temp,
    void                            *temp2,
    const void                      *beta,
    const cudnnTensorDescriptor_t    yDesc,
    void                            *y)
This function performs the forward spatial DivisiveNormalization layer computation. It divides every value in a layer by the standard deviation of its spatial neighbors as described in What is the Best Multi-Stage Architecture for Object Recognition, Jarrett 2009, Local Contrast Normalization Layer section. Note that DivisiveNormalization only implements the x/max(c, sigma_x) portion of the computation, where sigma_x is the variance over the spatial neighborhood of x. The full LCN (Local Contrastive Normalization) computation can be implemented as a two-step process:
x_m = x-mean(x);
y = x_m/max(c, sigma(x_m));

The x-mean(x) which is often referred to as "subtractive normalization" portion of the computation can be implemented using cuDNN average pooling layer followed by a call to addTensor.

Note: Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.

Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously initialized LRN parameter descriptor. This descriptor is used for both LRN and DivisiveNormalization layers.

divNormMode

Input. DivisiveNormalization layer mode of operation. Currently only CUDNN_DIVNORM_PRECOMPUTED_MEANS is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, see Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Tensor descriptor objects for the input and output tensors. Note that xDesc is shared between x, means, temp, and temp2 tensors.

x

Input. Input tensor data pointer in device memory.

means

Input. Input means tensor data pointer in device memory. Note that this tensor can be NULL (in that case its values are assumed to be zero during the computation). This tensor also doesn't have to contain means, these can be any values, a frequently used variation is a result of convolution with a normalized positive kernel (such as Gaussian).

temp, temp2

Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the forward pass. These tensors do not have to be preserved as inputs from forward to the backward pass. Both use xDesc as their descriptor.

y

Output. Pointer in device memory to a tensor for the result of the forward DivisiveNormalization computation.

Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • One of the tensor pointers x, y, temp, temp2 is NULL.
  • Number of input tensor or output tensor dimensions is outside of [4,5] range.
  • A mismatch in dimensions between any two of the input or output tensors.
  • For in-place computation when pointers x == y, a mismatch in strides between the input data and output data tensors.
  • Alpha or beta pointer is NULL.
  • LRN descriptor parameters are outside of their valid ranges.
  • Any of the tensor strides are negative.
CUDNN_STATUS_UNSUPPORTED

The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a non-supported configuration.

3.64. cudnnDropoutBackward()

cudnnStatus_t cudnnDropoutBackward(
    cudnnHandle_t                   handle,
    const cudnnDropoutDescriptor_t  dropoutDesc,
    const cudnnTensorDescriptor_t   dydesc,
    const void                     *dy,
    const cudnnTensorDescriptor_t   dxdesc,
    void                           *dx,
    void                           *reserveSpace,
    size_t                          reserveSpaceSizeInBytes)

This function performs backward dropout operation over dy returning results in dx. If during forward dropout operation value from x was propagated to y then during backward operation value from dy will be propagated to dx, otherwise, dx value will be set to 0.

Note: Better performance is obtained for fully packed tensors.

Parameters

handle

Input. Handle to a previously created cuDNN context.

dropoutDesc

Input. Previously created dropout descriptor object.

dyDesc

Input. Handle to a previously initialized tensor descriptor.

dy

Input. Pointer to data of the tensor described by the dyDesc descriptor.

dxDesc

Input. Handle to a previously initialized tensor descriptor.

dx

Output. Pointer to data of the tensor described by the dxDesc descriptor.

reserveSpace

Input. Pointer to user-allocated GPU memory used by this function. It is expected that reserveSpace was populated during a call to cudnnDropoutForward and has not been changed.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided memory for the reserve space

Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • The number of elements of input tensor and output tensors differ.
  • The datatype of the input tensor and output tensors differs.
  • The strides of the input tensor and output tensors differ and in-place operation is used (i.e., x and y pointers are equal).
  • The provided reserveSpaceSizeInBytes is less then the value returned by cudnnDropoutGetReserveSpaceSize.
  • cudnnSetDropoutDescriptor has not been called on dropoutDesc with the non-NULLstates argument.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.65. cudnnDropoutForward()

cudnnStatus_t cudnnDropoutForward(
    cudnnHandle_t                       handle,
    const cudnnDropoutDescriptor_t      dropoutDesc,
    const cudnnTensorDescriptor_t       xdesc,
    const void                         *x,
    const cudnnTensorDescriptor_t       ydesc,
    void                               *y,
    void                               *reserveSpace,
    size_t                              reserveSpaceSizeInBytes)

This function performs forward dropout operation over x returning results in y. If dropout was used as a parameter to cudnnSetDropoutDescriptor(), the approximately dropout fraction of x values will be replaced by a 0, and the rest will be scaled by 1/(1-dropout). This function should not be running concurrently with another cudnnDropoutForward() function using the same states.

Note:
  • Better performance is obtained for fully packed tensors.
  • This function should not be called during inference.

Parameters

handle

Input. Handle to a previously created cuDNN context.

dropoutDesc

Input. Previously created dropout descriptor object.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Pointer to data of the tensor described by the xDesc descriptor.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Output. Pointer to data of the tensor described by the yDesc descriptor.

reserveSpace

Output. Pointer to user-allocated GPU memory used by this function. It is expected that the contents of reserveSpace does not change between cudnnDropoutForward() and cudnnDropoutBackward() calls.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided memory for the reserve space.

Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • The number of elements of input tensor and output tensors differ.
  • The datatype of the input tensor and output tensors differs.
  • The strides of the input tensor and output tensors differ and in-place operation is used (meaning, x and y pointers are equal).
  • The provided reserveSpaceSizeInBytes is less than the value returned by cudnnDropoutGetReserveSpaceSize().
  • cudnnSetDropoutDescriptor() has not been called on dropoutDesc with the non-NULLstates argument.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

3.66. cudnnDropoutGetReserveSpaceSize()

cudnnStatus_t cudnnDropoutGetReserveSpaceSize(
    cudnnTensorDescriptor_t     xDesc,
    size_t                     *sizeInBytes)

This function is used to query the amount of reserve needed to run dropout with the input dimensions given by xDesc. The same reserve space is expected to be passed to cudnnDropoutForward() and cudnnDropoutBackward(), and its contents is expected to remain unchanged between cudnnDropoutForward() and cudnnDropoutBackward() calls.

Parameters

xDesc

Input. Handle to a previously initialized tensor descriptor, describing input to a dropout operation.

sizeInBytes

Output. Amount of GPU memory needed as reserve space to be able to run dropout with an input tensor descriptor specified by xDesc.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

3.67. cudnnDropoutGetStatesSize()

cudnnStatus_t cudnnDropoutGetStatesSize(
    cudnnHandle_t       handle,
    size_t             *sizeInBytes)

This function is used to query the amount of space required to store the states of the random number generators used by cudnnDropoutForward() function.

Parameters

handle

Input. Handle to a previously created cuDNN context.

sizeInBytes

Output. Amount of GPU memory needed to store random generator states.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

3.68. cudnnFindConvolutionBackwardDataAlgorithm()

cudnnStatus_t cudnnFindConvolutionBackwardDataAlgorithm(
    cudnnHandle_t                          handle,
    const cudnnFilterDescriptor_t          wDesc,
    const cudnnTensorDescriptor_t          dyDesc,
    const cudnnConvolutionDescriptor_t     convDesc,
    const cudnnTensorDescriptor_t          dxDesc,
    const int                              requestedAlgoCount,
    int                                   *returnedAlgoCount,
    cudnnConvolutionBwdDataAlgoPerf_t     *perfResults)

This function attempts all algorithms available for cudnnConvolutionBackwardData(). It will attempt both the provided convDescmathType and CUDNN_DEFAULT_MATH (assuming the two differ).

Note: Algorithms without the CUDNN_TENSOR_OP_MATH availability will only be tried with CUDNN_DEFAULT_MATH, and returned as such.

Memory is allocated via cudaMalloc(). The performance metrics are returned in the user-allocated array of cudnnConvolutionBwdDataAlgoPerf_t. These metrics are written in a sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardDataAlgorithmMaxCount().

Note:
  • This function is host blocking.
  • It is recommended to run this function prior to allocating layer data; doing otherwise may needlessly inhibit some algorithm options due to resource usage.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • handle is not allocated properly.
  • wDesc, dyDesc, or dxDesc is not allocated properly.
  • wDesc, dyDesc, or dxDesc has fewer than 1 dimension.
  • Either returnedCount or perfResults is nil.
  • requestedCount is less than 1.
CUDNN_STATUS_ALLOC_FAILED

This function was unable to allocate memory to store sample input, filters and output.

CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

  • The function was unable to allocate necessary timing objects.
  • The function was unable to deallocate necessary timing objects.
  • The function was unable to deallocate sample input, filters and output.

3.69. cudnnFindConvolutionBackwardDataAlgorithmEx()

cudnnStatus_t cudnnFindConvolutionBackwardDataAlgorithmEx(
    cudnnHandle_t                          handle,
    const cudnnFilterDescriptor_t          wDesc,
    const void                            *w,
    const cudnnTensorDescriptor_t          dyDesc,
    const void                            *dy,
    const cudnnConvolutionDescriptor_t     convDesc,
    const cudnnTensorDescriptor_t          dxDesc,
    void                                  *dx,
    const int                              requestedAlgoCount,
    int                                   *returnedAlgoCount,
    cudnnConvolutionBwdDataAlgoPerf_t     *perfResults,
    void                                  *workSpace,
    size_t                                 workSpaceSizeInBytes)

This function attempts all algorithms available for cudnnConvolutionBackwardData(). It will attempt both the provided convDescmathType and CUDNN_DEFAULT_MATH (assuming the two differ).

Note: Algorithms without the CUDNN_TENSOR_OP_MATH availability will only be tried with CUDNN_DEFAULT_MATH, and returned as such.

Memory is allocated via cudaMalloc(). The performance metrics are returned in the user-allocated array of cudnnConvolutionBwdDataAlgoPerf_t. These metrics are written in a sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionBackwardDataAlgorithmMaxCount().

Note: This function is host blocking.

Parameters

handle

Input. Handle to a previously created cuDNN context.

wDesc

Input. Handle to a previously initialized filter descriptor.

w

Input. Data pointer to GPU memory associated with the filter descriptor wDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the filter descriptor dyDesc.

convDesc

Input. Previously initialized convolution descriptor.

dxDesc

Input. Handle to the previously initialized output tensor descriptor.

dxDesc

Input/Output. Data pointer to GPU memory associated with the tensor descriptor dxDesc. The content of this tensor will be overwritten with arbitrary values.

requestedAlgoCount

Input. The maximum number of elements to be stored in perfResults.

returnedAlgoCount

Output. The number of output elements stored in perfResults.

perfResults

Output. A user-allocated array to store performance metrics sorted ascending by compute time.

workSpace

Input. Data pointer to GPU memory that is a necessary workspace for some algorithms. The size of this workspace will determine the availability of algorithms. A nil pointer is considered a workSpace of 0 bytes.

workSpaceSizeInBytes

Input. Specifies the size in bytes of the provided workSpace.

Returns

CUDNN_STATUS_SUCCESS

The query was successful.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:

  • handle is not allocated properly.
  • wDesc, dyDesc, or dxDesc is not allocated properly.
  • wDesc, dyDesc, or dxDesc has fewer than 1 dimension.
  • w, dy, or dx is nil.
  • Either returnedCount or perfResults is nil.
  • requestedCount is less than 1.
CUDNN_STATUS_INTERNAL_ERROR

At least one of the following conditions are met:

  • The function was unable to allocate necessary timing objects.
  • The function was unable to deallocate necessary timing objects.
  • The function was unable to deallocate sample input, filters and output.