Abstract
This is the API Reference documentation for the NVIDIA cuDNN version 8.9.2 library. This API Reference lists the datatyes and functions per library. Specifically, this reference consists of a cuDNN datatype reference section that describes the types of enums and a cuDNN API reference section that describes all routines in the cuDNN library API. The cuDNN API is a contextbased API that allows for easy multithreading and (optional) interoperability with CUDA streams.
NVIDIA^{®} CUDA^{®} Deep Neural Network (cuDNN) library offers a contextbased API that allows for easy multithreading and (optional) interoperability with CUDA streams. This API Reference lists the datatyes and functions per library. Specifically, this reference consists of a cuDNN datatype reference section that describes the types of enums and a cuDNN API reference section that describes all routines in the cuDNN library API. The cuDNN library as well as this API document has been split into the following libraries:

cudnn_ops_infer
 This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common machine learning algorithms such as batch normalization, softmax, dropout, and so on.

cudnn_ops_train

This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, and so on. The
cudnn_ops_train
library depends oncudnn_ops_infer
. 
cudnn_cnn_infer

This entity contains all routines related to convolutional neural networks needed at inference time. The
cudnn_cnn_infer
library depends oncudnn_ops_infer
. 
cudnn_cnn_train

This entity contains all routines related to convolutional neural networks needed during training time. The
cudnn_cnn_train
library depends oncudnn_ops_infer
,cudnn_ops_train
, andcudnn_cnn_infer
. 
cudnn_adv_infer

This entity contains all other features and algorithms. This includes RNNs, CTC loss, and multihead attention. The
cudnn_adv_infer
library depends oncudnn_ops_infer
. 
cudnn_adv_train

This entity contains all the training counterparts of
cudnn_adv_infer
. Thecudnn_adv_train
library depends oncudnn_ops_infer
,cudnn_ops_train
, andcudnn_adv_infer
. 
cudnnBackend*
 Introduced in cuDNN version 8.x, this entity contains a list of valid cuDNN backend descriptor types, a list of valid attributes, a subset of valid attribute values, and a full description of each backend descriptor type and their attributes.

cudnn
 This is an optional shim layer between the application layer and the cuDNN code. This layer opportunistically opens the correct library for the API at runtime.
2.1. API Changes for cuDNN 8.7.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.7.0.
Backend descriptor types 

cudnnRngDistribution_t 
CUDNN_BACKEND_OPERATION_RNG_DESCRIPTOR 
CUDNN_BACKEND_RNG_DESCRIPTOR 
2.2. API Changes for cuDNN 8.5.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.5.0.
Backend descriptor types 

cudnnBackendNormFwdPhase_t 
cudnnBackendNormMode_t 
CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR 
CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR 
CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR 
CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR 
cudnnFraction_t 
cudnnSignalMode_t 
2.3. API Changes for cuDNN 8.4.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.4.0.
Backend descriptor types 

cudnnBackendBehaviorNote_t 
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR 
CUDNN_BACKEND_POINTWISE_DESCRIPTOR 
CUDNN_BACKEND_REDUCTION_DESCRIPTOR 
cudnnBackendTensorReordering_t 
cudnnBnFinalizeStatsMode_t 
cudnnPaddingMode_t 
cudnnResampleMode_t 
2.4. API Changes for cuDNN 8.3.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.3.0.
Backend descriptor types 

CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR 
CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR 
CUDNN_BACKEND_RESAMPLE_DESCRIPTOR 
2.5. API Changes for cuDNN 8.2.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.2.0.
New functions 

cudnnGetActivationDescriptorSwishBeta() 
cudnnSetActivationDescriptorSwishBeta() 
2.6. API Changes for cuDNN 8.1.0
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.1.0.
Backend descriptor types 

CUDNN_BACKEND_MATMUL_DESCRIPTOR 
CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR 
2.7. API Changes for cuDNN 8.0.3
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.0.3.
2.8. API Changes for cuDNN 8.0.2
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.0.2.
New functions and data types 

cudnnRNNBackwardData_v8() 
cudnnRNNBackwardWeights_v8() 
2.9. API Changes for cuDNN 8.0.0 Preview
The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.0.0 Preview Release.
For our deprecation policy, refer to the Backward Compatibility And Deprecation Policy.
Deprecated functions and data types  Replaced with 

cudnnCopyAlgorithmDescriptor() 

cudnnCreateAlgorithmDescriptor() 

cudnnCreatePersistentRNNPlan() 
cudnnBuildRNNDynamic() 
cudnnDestroyAlgorithmDescriptor() 

cudnnDestroyPersistentRNNPlan() 

cudnnFindRNNBackwardDataAlgorithmEx() 

cudnnFindRNNBackwardWeightsAlgorithmEx() 

cudnnFindRNNForwardInferenceAlgorithmEx() 

cudnnFindRNNForwardTrainingAlgorithmEx() 

cudnnGetAlgorithmDescriptor() 

cudnnGetAlgorithmPerformance() 

cudnnGetAlgorithmSpaceSize() 

cudnnGetRNNBackwardDataAlgorithmMaxCount() 

cudnnGetRNNBackwardWeightsAlgorithmMaxCount() 


cudnnGetRNNDescriptor_v8() 
cudnnGetRNNForwardInferenceAlgorithmMaxCount() 

cudnnGetRNNForwardTrainingAlgorithmMaxCount() 


cudnnGetRNNWeightParams() 
cudnnGetRNNParamsSize() 
cudnnGetRNNWeightSpaceSize() 

cudnnGetRNNTempSpaceSizes() 
cudnnPersistentRNNPlan_t 

cudnnRestoreAlgorithm() 


cudnnRNNBackwardData_v8() 

cudnnRNNBackwardWeights_v8() 

cudnnRNNForward() 
cudnnRNNGetClip() 
cudnnRNNGetClip_v8() 
cudnnRNNSetClip() 
cudnnRNNSetClip_v8() 
cudnnSaveAlgorithm() 

cudnnSetAlgorithmDescriptor() 

cudnnSetAlgorithmPerformance() 

cudnnSetPersistentRNNPlan() 

cudnnSetRNNAlgorithmDescriptor() 


cudnnSetRNNDescriptor_v8() 
Removed functions and data types 

cudnnConvolutionBwdDataPreference_t 
cudnnConvolutionBwdFilterPreference_t 
cudnnConvolutionFwdPreference_t 
cudnnGetConvolutionBackwardDataAlgorithm() 
cudnnGetConvolutionBackwardFilterAlgorithm() 
cudnnGetConvolutionForwardAlgorithm() 
cudnnGetRNNDescriptor() 
cudnnSetRNNDescriptor() 
This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common machine learning algorithms such as batch normalization, softmax, dropout, and so on.
3.1. Data Type References
These are the data type references in the cudnn_ops_infer.so
library.
3.1.1. Pointer To Opaque Struct Types
These are the pointers to the opaque struct types in the cudnn_ops_infer.so
library.
3.1.1.1. cudnnActivationDescriptor_t
cudnnActivationDescriptor_t
is a pointer to an opaque structure holding the description of an activation operation. cudnnCreateActivationDescriptor()
is used to create one instance, and cudnnSetActivationDescriptor()
must be used to initialize this instance.
3.1.1.2. cudnnCTCLossDescriptor_t
cudnnCTCLossDescriptor_t
is a pointer to an opaque structure holding the description of a CTC loss operation. cudnnCreateCTCLossDescriptor()
is used to create one instance, cudnnSetCTCLossDescriptor()
is used to initialize this instance, and cudnnDestroyCTCLossDescriptor()
is used to destroy this instance.
3.1.1.3. cudnnDropoutDescriptor_t
cudnnDropoutDescriptor_t
is a pointer to an opaque structure holding the description of a dropout operation. cudnnCreateDropoutDescriptor()
is used to create one instance, cudnnSetDropoutDescriptor()
is used to initialize this instance, cudnnDestroyDropoutDescriptor()
is used to destroy this instance, cudnnGetDropoutDescriptor()
is used to query fields of a previously initialized instance, cudnnRestoreDropoutDescriptor()
is used to restore an instance to a previously saved off state.
3.1.1.4. cudnnFilterDescriptor_t
cudnnFilterDescriptor_t
is a pointer to an opaque structure holding the description of a filter dataset. cudnnCreateFilterDescriptor()
is used to create one instance, and cudnnSetFilter4dDescriptor()
or cudnnSetFilterNdDescriptor()
must be used to initialize this instance.
3.1.1.5. cudnnHandle_t
cudnnHandle_t
is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using cudnnCreate()
and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cudnnDestroy()
. The context is associated with only one GPU device, the current device at the time of the call to cudnnCreate()
. However, multiple contexts can be created on the same GPU device.
3.1.1.6. cudnnLRNDescriptor_t
cudnnLRNDescriptor_t
is a pointer to an opaque structure holding the parameters of a local response normalization. cudnnCreateLRNDescriptor()
is used to create one instance, and the routine cudnnSetLRNDescriptor()
must be used to initialize this instance.
3.1.1.7. cudnnOpTensorDescriptor_t
cudnnOpTensorDescriptor_t
is a pointer to an opaque structure holding the description of a Tensor Core operation, used as a parameter to cudnnOpTensor()
. cudnnCreateOpTensorDescriptor()
is used to create one instance, and cudnnSetOpTensorDescriptor()
must be used to initialize this instance.
3.1.1.8. cudnnPoolingDescriptor_t
cudnnPoolingDescriptor_t
is a pointer to an opaque structure holding the description of a pooling operation. cudnnCreatePoolingDescriptor()
is used to create one instance, and cudnnSetPoolingNdDescriptor()
or cudnnSetPooling2dDescriptor()
must be used to initialize this instance.
3.1.1.9. cudnnReduceTensorDescriptor_t
cudnnReduceTensorDescriptor_t
is a pointer to an opaque structure holding the description of a tensor reduction operation, used as a parameter to cudnnReduceTensor()
. cudnnCreateReduceTensorDescriptor()
is used to create one instance, and cudnnSetReduceTensorDescriptor()
must be used to initialize this instance.
3.1.1.10. cudnnSpatialTransformerDescriptor_t
cudnnSpatialTransformerDescriptor_t
is a pointer to an opaque structure holding the description of a spatial transformation operation. cudnnCreateSpatialTransformerDescriptor()
is used to create one instance, cudnnSetSpatialTransformerNdDescriptor()
is used to initialize this instance, and cudnnDestroySpatialTransformerDescriptor()
is used to destroy this instance.
3.1.1.11. cudnnTensorDescriptor_t
cudnnTensorDescriptor_t
is a pointer to an opaque structure holding the description of a generic nD dataset. cudnnCreateTensorDescriptor()
is used to create one instance, and one of the routines cudnnSetTensorNdDescriptor()
, cudnnSetTensor4dDescriptor()
or cudnnSetTensor4dDescriptorEx()
must be used to initialize this instance.
3.1.1.12. cudnnTensorTransformDescriptor_t
cudnnTensorTransformDescriptor_t
is an opaque structure containing the description of the tensor transform. Use the cudnnCreateTensorTransformDescriptor()
function to create an instance of this descriptor, and cudnnDestroyTensorTransformDescriptor()
function to destroy a previously created instance.
3.1.2. Enumeration Types
These are the enumeration types in the cudnn_ops_infer.so
library.
3.1.2.1. cudnnActivationMode_t
cudnnActivationMode_t
is an enumerated type used to select the neuron activation function used in cudnnActivationForward()
, cudnnActivationBackward()
, and cudnnConvolutionBiasActivationForward()
.
Values
 Selects the sigmoid function.
 Selects the rectified linear function.
 Selects the hyperbolic tangent function.
 Selects the clipped rectified linear function.
 Selects the exponential linear function.

Selects the identity function, intended for bypassing the activation step in
cudnnConvolutionBiasActivationForward()
. (ThecudnnConvolutionBiasActivationForward()
function must useCUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
.) Does not work withcudnnActivationForward()
orcudnnActivationBackward()
.  Selects the swish function.
3.1.2.2. cudnnAlgorithm_t
This function has been deprecated in cuDNN 8.0.
3.1.2.3. cudnnBatchNormMode_t
cudnnBatchNormMode_t
is an enumerated type used to specify the mode of operation in cudnnBatchNormalizationForwardInference()
, cudnnBatchNormalizationForwardTraining()
, cudnnBatchNormalizationBackward()
and cudnnDeriveBNTensorDescriptor()
routines.
Values

Normalization is performed peractivation. This mode is intended to be used after the nonconvolutional network layers. In this mode, the tensor dimensions of
bnBias
andbnScale
and the parameters used in thecudnnBatchNormalization*
functions are 1xCxHxW. 
Normalization is performed over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode the
bnBias
andbnScale
tensor dimensions are 1xCx1x1. 
This mode is similar to
CUDNN_BATCHNORM_SPATIAL
but it can be faster for some tasks.An optimized path may be selected for
CUDNN_DATA_FLOAT
andCUDNN_DATA_HALF
types, compute capability 6.0 or higher for the following two batch normalization API calls:cudnnBatchNormalizationForwardTraining()
andcudnnBatchNormalizationBackward()
. In the case ofcudnnBatchNormalizationBackward()
, thesavedMean
andsavedInvVariance
arguments should not beNULL
.The rest of this section applies to
NCHW
mode only: This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaNs or Infs (infinity) in output buffers.When Infs/NaNs are present in the input data, the output in this mode is the same as from a pure floatingpoint implementation.
For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Infs/NaNs while
CUDNN_BATCHNORM_SPATIAL
will produce finite results. The user can invokecudnnQueryRuntimeError()
to check if a numerical overflow occurred in this mode.
3.1.2.4. cudnnBatchNormOps_t
cudnnBatchNormOps_t
is an enumerated type used to specify the mode of operation in cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
, cudnnBatchNormalizationForwardTrainingEx()
, cudnnGetBatchNormalizationBackwardExWorkspaceSize()
, cudnnBatchNormalizationBackwardEx()
, and cudnnGetBatchNormalizationTrainingExReserveSpaceSize()
functions.
Values
 Only batch normalization is performed, peractivation.
 First, the batch normalization is performed, and then the activation is performed.
 Performs the batch normalization, then elementwise addition, followed by the activation operation.
3.1.2.5. cudnnCTCLossAlgo_t
cudnnCTCLossAlgo_t
is an enumerated type that exposes the different algorithms available to execute the CTC loss operation.
Values
 Results are guaranteed to be reproducible.
 Results are not guaranteed to be reproducible.
3.1.2.6. cudnnDataType_t
cudnnDataType_t
is an enumerated type indicating the data type to which a tensor descriptor or filter descriptor refers.
Values

The data is a 32bit singleprecision floatingpoint (
float
). 
The data is a 64bit doubleprecision floatingpoint (
double
).  The data is a 16bit floatingpoint.
 The data is an 8bit signed integer.
 The data is a 32bit signed integer.

The data is 32bit elements each composed of 4 8bit signed integers. This data type is only supported with the tensor format
CUDNN_TENSOR_NCHW_VECT_C
.  The data is an 8bit unsigned integer.

The data is 32bit elements each composed of 4 8bit unsigned integers. This data type is only supported with the tensor format
CUDNN_TENSOR_NCHW_VECT_C
. 
The data is 32element vectors, each element being an 8bit signed integer. This data type is only supported with the tensor format
CUDNN_TENSOR_NCHW_VECT_C
. Moreover, this data type can only be used withalgo 1
, meaning,CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
. For more information, refer tocudnnConvolutionFwdAlgo_t
.  The data is a 16bit quantity, with 7 mantissa bits, 8 exponent bits, and 1 sign bit.
 The data is a 64bit signed integer.

The data is a boolean (
bool
).Note that for type
CUDNN_TYPE_BOOLEAN
, elements are expected to be "packed": that is, one byte contains 8 elements of typeCUDNN_TYPE_BOOLEAN
. Further, within each byte, elements are indexed from the least significant bit to the most significant bit. For example, a 1 dimensional tensor of 8 elements containing 01001111 has value 1 for elements 0 through 3, 0 for elements 4 and 5, 1 for element 6 and 0 for element 7.Tensors with more than 8 elements simply use more bytes, where the order is also from least significant to most significant byte. Note, CUDA is littleendian, meaning that the least significant byte has the lower memory address address. For example, in the case of 16 elements, 01001111 11111100 has value 1 for elements 0 through 3, 0 for elements 4 and 5, 1 for element 6 and 0 for element 7, value 0 for elements 8 and 9, 1 for elements 10 through 15.
 The data is an 8bit quantity, with 3 mantissa bits, 4 exponent bits, and 1 sign bit.
 The data is an 8bit quantity, with 2 mantissa bits, 5 exponent bits, and 1 sign bit.

The data type is a higher throughput but lower precision compute type (compared to
CUDNN_DATA_FLOAT
) used for FP8 tensor core operations
3.1.2.7. cudnnDeterminism_t
cudnnDeterminism_t
is an enumerated type used to indicate if the computed results are deterministic (reproducible). For more information, refer to Reproducibility (Determinism).
Values
 Results are not guaranteed to be reproducible.
 Results are guaranteed to be reproducible.
3.1.2.8. cudnnDivNormMode_t
cudnnDivNormMode_t
is an enumerated type used to specify the mode of operation in cudnnDivisiveNormalizationForward()
and cudnnDivisiveNormalizationBackward()
.
Values

The means tensor data pointer is expected to contain means or other kernel convolution values precomputed by the user. The means pointer can also be
NULL
, in that case, it's considered to be filled with zeroes. This is equivalent to spatial LRN.Note:In the backward pass, the means are treated as independent inputs and the gradient over means is computed independently. In this mode, to yield a net gradient over the entire LCN computational graph, the
destDiffMeans
result should be backpropagated through the user's means layer (which can be implemented using average pooling) and added to thedestDiffData
tensor produced bycudnnDivisiveNormalizationBackward()
.
3.1.2.9. cudnnErrQueryMode_t
cudnnErrQueryMode_t
is an enumerated type passed to cudnnQueryRuntimeError()
to select the remote kernel error query mode.
Values
 Read the error storage location regardless of the kernel completion status.
 Report if all tasks in the user stream of the cuDNN handle were completed. If that is the case, report the remote kernel error code.
 Wait for all tasks to complete in the user stream before reporting the remote kernel error code.
3.1.2.10. cudnnFoldingDirection_t
cudnnFoldingDirection_t
is an enumerated type used to select the folding direction. For more information, refer to cudnnTensorTransformDescriptor_t
.
Data Member
 Selects folding.
 Selects unfolding.
3.1.2.11. cudnnIndicesType_t
cudnnIndicesType_t
is an enumerated type used to indicate the data type for the indices to be computed by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
 Compute unsigned int indices.
 Compute unsigned long indices.
 Compute unsigned short indices.
 Compute unsigned char indices.
3.1.2.12. cudnnLRNMode_t
cudnnLRNMode_t
is an enumerated type used to specify the mode of operation in cudnnLRNCrossChannelForward()
and cudnnLRNCrossChannelBackward()
.
Values

LRN computation is performed across the tensor's dimension
dimA[1]
.
3.1.2.13. cudnnMathType_t
cudnnMathType_t
is an enumerated type used to indicate if the use of Tensor Core operations is permitted in a given library routine.
Values
 Tensor Core operations are not used on preNVIDIA A100 GPU devices. On A100 GPU architecture devices, Tensor Core TF32 operation is permitted.
 The use of Tensor Core operations is permitted but will not actively perform datatype down conversion on tensors in order to utilize Tensor Cores.
 The use of Tensor Core operations is permitted and will actively perform datatype down conversion on tensors in order to utilize Tensor Cores.
 Restricted to only kernels that use FMA instructions.
On preNVIDIA A100 GPU devices, CUDNN_DEFAULT_MATH
and CUDNN_FMA_MATH
have the same behavior: Tensor Core kernels will not be selected. With NVIDIA Ampere architecture and CUDA toolkit 11, CUDNN_DEFAULT_MATH
permits TF32 Tensor Core operation and CUDNN_FMA_MATH
does not. The TF32 behavior for CUDNN_DEFAULT_MATH
and the other Tensor Core math types can be explicitly disabled by the environment variable NVIDIA_TF32_OVERRIDE=0
.
3.1.2.14. cudnnNanPropagation_t
cudnnNanPropagation_t
is an enumerated type used to indicate if a given routine should propagate Nan
numbers. This enumerated type is used as a field for the cudnnActivationDescriptor_t
descriptor and cudnnPoolingDescriptor_t
descriptor.
Values

Nan
numbers are not propagated. 
Nan
numbers are propagated.
3.1.2.15. cudnnNormAlgo_t
cudnnNormAlgo_t
is an enumerated type used to specify the algorithm to execute the normalization operation.
Values
 Standard normalization is performed.

This mode is similar to
CUDNN_NORM_ALGO_STANDARD
, however it only supportsCUDNN_NORM_PER_CHANNEL
and can be faster for some tasks.An optimized path may be selected for
CUDNN_DATA_FLOAT
andCUDNN_DATA_HALF
types, compute capability 6.0 or higher for the following two normalization API calls:cudnnNormalizationForwardTraining()
andcudnnNormalizationBackward()
. In the case ofcudnnNormalizationBackward()
, thesavedMean
andsavedInvVariance
arguments should not beNULL
.The rest of this section applies to NCHW mode only: This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaNs or Infs (infinity) in output buffers.
When Infs/NaNs are present in the input data, the output in this mode is the same as from a pure floatingpoint implementation.
For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Infs/NaNs while
CUDNN_NORM_ALGO_STANDARD
will produce finite results. The user can invokecudnnQueryRuntimeError()
to check if a numerical overflow occurred in this mode.
3.1.2.16. cudnnNormMode_t
cudnnNormMode_t
is an enumerated type used to specify the mode of operation in cudnnNormalizationForwardInference()
, cudnnNormalizationForwardTraining()
, cudnnBatchNormalizationBackward()
, cudnnGetNormalizationForwardTrainingWorkspaceSize()
, cudnnGetNormalizationBackwardWorkspaceSize()
, and cudnnGetNormalizationTrainingReserveSpaceSize()
routines.
Values

Normalization is performed peractivation. This mode is intended to be used after the nonconvolutional network layers. In this mode, the tensor dimensions of
normBias
andnormScale
and the parameters used in thecudnnNormalization*
functions are 1xCxHxW. 
Normalization is performed perchannel over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode, the
normBias
andnormScale
tensor dimensions are 1xCx1x1.
3.1.2.17. cudnnNormOps_t
cudnnNormOps_t
is an enumerated type used to specify the mode of operation in cudnnGetNormalizationForwardTrainingWorkspaceSize()
, cudnnNormalizationForwardTraining()
, cudnnGetNormalizationBackwardWorkspaceSize()
, cudnnNormalizationBackward()
, and cudnnGetNormalizationTrainingReserveSpaceSize()
functions.
Values
 Only normalization is performed.
 First, the normalization is performed, then the activation is performed.
 Performs the normalization, then elementwise addition, followed by the activation operation.
3.1.2.18. cudnnOpTensorOp_t
cudnnOpTensorOp_t
is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnOpTensor()
routine. This enumerated type is used as a field for the cudnnOpTensorDescriptor_t
descriptor.
Values
 The operation to be performed is addition.
 The operation to be performed is multiplication.
 The operation to be performed is a minimum comparison.
 The operation to be performed is a maximum comparison.

The operation to be performed is square root, performed on only the
A
tensor. 
The operation to be performed is negation, performed on only the
A
tensor.
3.1.2.19. cudnnPoolingMode_t
cudnnPoolingMode_t
is an enumerated type passed to cudnnSetPooling2dDescriptor()
to select the pooling method to be used by cudnnPoolingForward()
and cudnnPoolingBackward()
.
Values
 The maximum value inside the pooling window is used.
 Values inside the pooling window are averaged. The number of elements used to calculate the average includes spatial locations falling in the padding region.
 Values inside the pooling window are averaged. The number of elements used to calculate the average excludes spatial locations falling in the padding region.
 The maximum value inside the pooling window is used. The algorithm used is deterministic.
3.1.2.20. cudnnReduceTensorIndices_t
cudnnReduceTensorIndices_t
is an enumerated type used to indicate whether indices are to be computed by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
 Do not compute indices.
 Compute indices. The resulting indices are relative, and flattened.
3.1.2.21. cudnnReduceTensorOp_t
cudnnReduceTensorOp_t
is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnReduceTensor()
routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t
descriptor.
Values
 The operation to be performed is addition.
 The operation to be performed is multiplication.
 The operation to be performed is a minimum comparison.
 The operation to be performed is a maximum comparison.
 The operation to be performed is a maximum comparison of absolute values.
 The operation to be performed is averaging.
 The operation to be performed is addition of absolute values.
 The operation to be performed is a square root of the sum of squares.
 The operation to be performed is multiplication, not including elements of value zero.
3.1.2.22. cudnnRNNAlgo_t
cudnnRNNAlgo_t
is an enumerated type used to specify the algorithm used in the cudnnRNNForwardInference()
, cudnnRNNForwardTraining()
, cudnnRNNBackwardData()
and cudnnRNNBackwardWeights()
routines.
Values
 Each RNN layer is executed as a sequence of operations. This algorithm is expected to have robust performance across a wide range of network parameters.

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch).
CUDNN_RNN_ALGO_PERSIST_STATIC
is only supported on devices with compute capability >= 6.0. 
The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch). When using
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
persistent kernels are prepared at runtime and are able to optimize using the specific parameters of the network and active GPU. As such, when usingCUDNN_RNN_ALGO_PERSIST_DYNAMIC
a onetime plan preparation stage must be executed. These plans can then be reused in repeated calls with the same model parameters.The limits on the maximum number of hidden units supported when using
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
are significantly higher than the limits when usingCUDNN_RNN_ALGO_PERSIST_STATIC
, however throughput is likely to significantly reduce when exceeding the maximums supported byCUDNN_RNN_ALGO_PERSIST_STATIC
. In this regime, this method will still outperformCUDNN_RNN_ALGO_STANDARD
for some cases.CUDNN_RNN_ALGO_PERSIST_DYNAMIC
is only supported on devices with compute capability >= 6.0 on Linux machines.
3.1.2.23. cudnnSamplerType_t
cudnnSamplerType_t
is an enumerated type passed to cudnnSetSpatialTransformerNdDescriptor()
to select the sampler type to be used by cudnnSpatialTfSamplerForward()
and cudnnSpatialTfSamplerBackward()
.
Values
 Selects the bilinear sampler.
3.1.2.24. cudnnSeverity_t
cudnnSeverity_t
is an enumerated type passed to the customized callback function for logging that users may set. This enumerate describes the severity level of the item, so the customized logging call back may react differently.
Values
 This value indicates a fatal error emitted by cuDNN.
 This value indicates a normal error emitted by cuDNN.
 This value indicates a warning emitted by cuDNN.
 This value indicates a piece of information (for example, API log) emitted by cuDNN.
3.1.2.25. cudnnSoftmaxAlgorithm_t
cudnnSoftmaxAlgorithm_t
is used to select an implementation of the softmax function used in cudnnSoftmaxForward()
and cudnnSoftmaxBackward()
.
Values
 This implementation applies the straightforward softmax operation.
 This implementation scales each point of the softmax input domain by its maximum value to avoid potential floating point overflows in the softmax evaluation.

This entry performs the log softmax operation, avoiding overflows by scaling each point in the input domain as in
CUDNN_SOFTMAX_ACCURATE
.
3.1.2.26. cudnnSoftmaxMode_t
cudnnSoftmaxMode_t
is used to select over which data the cudnnSoftmaxForward()
and cudnnSoftmaxBackward()
are computing their results.
Values

The softmax operation is computed per image (
N
) across the dimensionsC,H,W
. 
The softmax operation is computed per spatial location (
H,W
) per image (N
) across dimensionC
.
3.1.2.27. cudnnStatus_t
cudnnStatus_t
is an enumerated type used for function status returns. All cuDNN library functions return their status, which can be one of the following values:
Values
 The operation was completed successfully.

The cuDNN library was not initialized properly. This error is usually returned when a call to
cudnnCreate()
fails or whencudnnCreate()
has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called bycudnnCreate()
or by an error in the hardware setup. 
Resource allocation failed inside the cuDNN library. This is usually caused by an internal
cudaMalloc()
failure.To correct, prior to the function call, deallocate previously allocated memory as much as possible.

An incorrect value or parameter was passed to the function.
To correct, ensure that all the parameters being passed have valid values.

The function requires a feature absent from the current GPU device. Note that cuDNN only supports devices with compute capabilities greater than or equal to 3.0.
To correct, compile and run the application on a device with appropriate compute capability.

An access to GPU memory space failed, which is usually caused by a failure to bind a texture.
To correct, prior to the function call, unbind any previously bound textures.
Otherwise, this may indicate an internal error/bug in the library.

The GPU program failed to execute. This is usually caused by a failure to launch some cuDNN kernel on the GPU, which can occur for multiple reasons.
To correct, check that the hardware, an appropriate version of the driver, and the cuDNN library are correctly installed.
Otherwise, this may indicate an internal error/bug in the library.
 An internal cuDNN operation failed.
 The functionality requested is not presently supported by cuDNN.

The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable
NVIDIA_LICENSE_FILE
is not set properly. 
A runtime library required by cuDNN cannot be found in the predefined search paths. These libraries are
libcuda.so
(nvcuda.dll
) andlibnvrtc.so
(nvrtc64_<Major Release Version><Minor Release Version>_0.dll
andnvrtcbuiltins64_<Major Release Version><Minor Release Version>.dll
).  Some tasks in the user stream are not completed.
 Numerical overflow occurred during the GPU kernel execution.
3.1.2.28. cudnnTensorFormat_t
cudnnTensorFormat_t
is an enumerated type used by cudnnSetTensor4dDescriptor()
to create a tensor with a predefined layout. For a detailed explanation of how these tensors are arranged in memory, refer to Data Layout Formats.
Values
 This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension.
 This tensor format specifies that the data is laid out in the following order: batch size, rows, columns, feature maps. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, rows, columns, and feature maps; the feature maps are the inner dimension and the images are the outermost dimension.

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. However, each element of the tensor is a vector of multiple feature maps. The length of the vector is carried by the data type of the tensor. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension. This format is only supported with tensor data types
CUDNN_DATA_INT8x4
,CUDNN_DATA_INT8x32
, andCUDNN_DATA_UINT8x4
.The
CUDNN_TENSOR_NCHW_VECT_C
can also be interpreted in the following way: The NCHW INT8x32 format is really N x (C/32) x H x W x 32 (32 Cs for every W), just as the NCHW INT8x4 format is N x (C/4) x H x W x 4 (4 Cs for every W). Hence, theVECT_C
name  each W is a vector (4 or 32) of Cs.
3.2. API Functions
These are the API functions in the cudnn_ops_infer.so
library.
3.2.1. cudnnActivationForward()
This routine applies a specified neuron activation function elementwise over each input value.
cudnnStatus_t cudnnActivationForward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t activationDesc,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
Inplace operation is allowed for this routine; meaning, xData
and yData
pointers may be equal. However, this requires xDesc
and yDesc
descriptors to be identical (particularly, the strides of the input and output must match for an inplace operation to be allowed).
All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of xDesc
and yDesc
are equal and HWpacked
. For more than 5 dimensions the tensors must have their spatial dimensions packed.
Parameters

Input. Handle to a previously created cuDNN context. For more information, refer to
cudnnHandle_t
. 
Input. Activation descriptor. For more information, refer to
cudnnActivationDescriptor_t
. 
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Handle to the previously initialized input tensor descriptor. For more information, refer to
cudnnTensorDescriptor_t
. 
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
.  Input. Handle to the previously initialized output tensor descriptor.

Output. Data pointer to GPU memory associated with the output tensor descriptor
yDesc
.
Returns

CUDNN_STATUS_SUCCESS
 The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED
 The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:
 The parameter
mode
has an invalid enumerant value.  The dimensions
n
,c
,h
, andw
of the input tensor and output tensor differ.  The
datatype
of the input tensor and output tensor differs.  The strides
nStride
,cStride
,hStride
, andwStride
of the input tensor and output tensor differ and inplace operation is used (meaning,x
andy
pointers are equal).
 The parameter

CUDNN_STATUS_EXECUTION_FAILED
 The function failed to launch on the GPU.
3.2.2. cudnnAddTensor()
This function adds the scaled values of a bias tensor to another tensor. Each dimension of the bias tensor A
must match the corresponding dimension of the destination tensor C
or must be equal to 1. In the latter case, the same value from the bias tensor for those dimensions will be used to blend into the C
tensor.
cudnnStatus_t cudnnAddTensor(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t aDesc,
const void *A,
const void *beta,
const cudnnTensorDescriptor_t cDesc,
void *C)
Only 4D and 5D tensors are supported. Beyond these dimensions, this routine is not supported.
Parameters

Input. Handle to a previously created cuDNN context. For more information, refer to
cudnnHandle_t
. 
Input. Pointers to scaling factors (in host memory) used to blend the source value with the prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Handle to a previously initialized tensor descriptor. For more information, refer to
cudnnTensorDescriptor_t
. 
Input. Pointer to data of the tensor described by the
aDesc
descriptor.  Input. Handle to a previously initialized tensor descriptor.

Input/Output. Pointer to data of the tensor described by the
cDesc
descriptor.
Returns
 The function executed successfully.
 The function does not support the provided configuration.

The dimensions of the bias tensor refer to an amount of data that is incompatible with the output tensor dimensions or the
dataType
of the two tensor descriptors are different.  The function failed to launch on the GPU.
3.2.3. cudnnBatchNormalizationForwardInference()
This function performs the forward batch normalization layer computation for the inference phase. This layer is based on the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper.
cudnnStatus_t cudnnBatchNormalizationForwardInference(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t yDesc,
void *y,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScale,
const void *bnBias,
const void *estimatedMean,
const void *estimatedVariance,
double epsilon)
Only 4D and 5D tensors are supported. The input transformation performed by this function is defined as:
y = beta*y + alpha *[bnBias + (bnScale * (xestimatedMean)/sqrt(epsilon + estimatedVariance)]
For the training phase, refer to cudnnBatchNormalizationForwardTraining()
.
Higher performance can be obtained when HWpacked tensors are used for all of x
and dx
.
For more information, refer to cudnnDeriveBNTensorDescriptor()
for the secondary tensor descriptor generation for the parameters used in this function.
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handles to the previously initialized tensor descriptors.

Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
input data. 
Input/Output. Data pointer to GPU memory associated with the tensor descriptor
yDesc
, for they
output of the batch normalization layer.  Inputs. Tensor descriptors and pointers in device memory for the batch normalization scale and bias parameters (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, bias is referred to as beta and scale as gamma).

Inputs. Mean and variance tensors (these have the same descriptor as the bias and scale). The
resultRunningMean
andresultRunningVariance
, accumulated during the training phase from thecudnnBatchNormalizationForwardTraining()
call, should be passed as inputs here. 
Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
.
Supported configurations
This function supports the following combinations of data types for various descriptors.
Data Type Configurations  xDesc 
bnScaleBiasMeanVarDesc 
alpha , beta 
yDesc 

INT8_CONFIG 
CUDNN_DATA_INT8 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_INT8 
PSEUDO_HALF_CONFIG 
CUDNN_DATA_HALF 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_HALF 
FLOAT_CONFIG 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
DOUBLE_CONFIG 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
BFLOAT16_CONFIG 
CUDNN_DATA_BFLOAT16 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_BFLOAT16 
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 One of the pointers
alpha
,beta
,x
,y
,bnScale
,bnBias
,estimatedMean
, andestimatedInvVariance
isNULL
.  The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported.) bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode.epsilon
value is less thanCUDNN_BN_MIN_EPSILON
. Dimensions or data types mismatch for
xDesc
,yDesc
.
 One of the pointers
3.2.4. cudnnCopyAlgorithmDescriptor()
This function has been deprecated in cuDNN 8.0.
3.2.5. cudnnCreate()
This function initializes the cuDNN library and creates a handle to an opaque structure holding the cuDNN library context. It allocates hardware resources on the host and device and must be called prior to making any other cuDNN library calls.
cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)
The cuDNN library handle is tied to the current CUDA device (context). To use the library on multiple devices, one cuDNN handle needs to be created for each device.
For a given device, multiple cuDNN handles with different configurations (for example, different current CUDA streams) may be created. Because cudnnCreate()
allocates some internal resources, the release of those resources by calling cudnnDestroy()
will implicitly call cudaDeviceSynchronize()
; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy
outside of performancecritical code paths.
For multithreaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handle(s) per thread and use that cuDNN handle for the entire life of the thread.
Parameters

Output. Pointer to pointer where to store the address to the allocated cuDNN handle. For more information, refer to
cudnnHandle_t
.
Returns

Invalid (
NULL
) input pointer supplied.  No compatible GPU found, CUDA driver not installed or disabled, CUDA runtime API initialization failed.
 NVIDIA GPU architecture is too old.
 Host memory allocation failed.
 CUDA resource allocation failed.
 cuDNN license validation failed (only when the feature is enabled).
 cuDNN handle was created successfully.
3.2.6. cudnnCreateActivationDescriptor()
This function creates an activation descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnActivationDescriptor_t
.
cudnnStatus_t cudnnCreateActivationDescriptor(
cudnnActivationDescriptor_t *activationDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.7. cudnnCreateAlgorithmDescriptor()
This function has been deprecated in cuDNN 8.0.
This function creates an algorithm descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreateAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t *algoDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.8. cudnnCreateAlgorithmPerformance()
This function creates multiple algorithm performance objects by allocating the memory needed to hold their opaque structures.
cudnnStatus_t cudnnCreateAlgorithmPerformance(
cudnnAlgorithmPerformance_t *algoPerf,
int numberToCreate)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.9. cudnnCreateDropoutDescriptor()
This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnDropoutDescriptor_t
.
cudnnStatus_t cudnnCreateDropoutDescriptor(
cudnnDropoutDescriptor_t *dropoutDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.10. cudnnCreateFilterDescriptor()
This function creates a filter descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnFilterDescriptor_t
.
cudnnStatus_t cudnnCreateFilterDescriptor(
cudnnFilterDescriptor_t *filterDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.11. cudnnCreateLRNDescriptor()
This function allocates the memory needed to hold the data needed for LRN and DivisiveNormalization
layers operation and returns a descriptor used with subsequent layer forward and backward calls.
cudnnStatus_t cudnnCreateLRNDescriptor(
cudnnLRNDescriptor_t *poolingDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
cudnnCreateOpTensorDescriptor()
This function creates a tensor pointwise math descriptor. For more information, refer to cudnnOpTensorDescriptor_t
.
cudnnStatus_t cudnnCreateOpTensorDescriptor(
cudnnOpTensorDescriptor_t* opTensorDesc)
Parameters
 Output. Pointer to the structure holding the description of the tensor pointwise math such as add, multiply, and more.
Returns
 The function returned successfully.
 Tensor pointwise math descriptor passed to the function is invalid.
 Memory allocation for this tensor pointwise math descriptor failed.
3.2.13. cudnnCreatePoolingDescriptor()
This function creates a pooling descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreatePoolingDescriptor(
cudnnPoolingDescriptor_t *poolingDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.14. cudnnCreateReduceTensorDescriptor()
This function creates a reduced tensor descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreateReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t* reduceTensorDesc)
Returns
 The object was created successfully.

reduceTensorDesc
is aNULL
pointer.  The resources could not be allocated.
3.2.15. cudnnCreateSpatialTransformerDescriptor()
This function creates a generic spatial transformer descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreateSpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t *stDesc)
Returns
 The object was created successfully.
 The resources could not be allocated.
3.2.16. cudnnCreateTensorDescriptor()
This function creates a generic tensor descriptor object by allocating the memory needed to hold its opaque structure. The data is initialized to all zeros.
cudnnStatus_t cudnnCreateTensorDescriptor(
cudnnTensorDescriptor_t *tensorDesc)
Parameters
 Output. Pointer to pointer where the address to the allocated tensor descriptor object should be stored.
Returns
 Invalid input argument.
 The resources could not be allocated.
 The object was created successfully.
3.2.17. cudnnCreateTensorTransformDescriptor()
This function creates a tensor transform descriptor object by allocating the memory needed to hold its opaque structure. The tensor data is initialized to be all zero. Use the cudnnSetTensorTransformDescriptor()
function to initialize the descriptor created by this function.
cudnnStatus_t cudnnCreateTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t *transformDesc);
Parameters
 Output. A pointer to an uninitialized tensor transform descriptor.
Returns
 The descriptor object was created successfully.

The
transformDesc
isNULL
.  The memory allocation failed.
3.2.18. cudnnDeriveBNTensorDescriptor()
This function derives a secondary tensor descriptor for the batch normalization scale
, invVariance
, bnBias
, and bnScale
subtensors from the layer's x
data descriptor.
cudnnStatus_t cudnnDeriveBNTensorDescriptor(
cudnnTensorDescriptor_t derivedBnDesc,
const cudnnTensorDescriptor_t xDesc,
cudnnBatchNormMode_t mode)
Use the tensor descriptor produced by this function as the bnScaleBiasMeanVarDesc
parameter for the cudnnBatchNormalizationForwardInference()
and cudnnBatchNormalizationForwardTraining()
functions, and as the bnScaleBiasDiffDesc
parameter in the cudnnBatchNormalizationBackward()
function.
The resulting dimensions will be:
 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for
BATCHNORM_MODE_SPATIAL
 1xCxHxW for 4D and 1xCxDxHxW for 5D for
BATCHNORM_MODE_PER_ACTIVATION
mode
For HALF
input data type the resulting tensor descriptor will have a FLOAT
type. For other data types, it will have the same type as the input data.
 Only 4D and 5D tensors are supported.
 The
derivedBnDesc
should be first created usingcudnnCreateTensorDescriptor()
. xDesc
is the descriptor for the layer'sx
data and has to be set up with proper dimensions prior to calling this function.
Parameters
 Output. Handle to a previously created tensor descriptor.

Input. Handle to a previously created and initialized layer's
x
data descriptor.  Input. Batch normalization layer mode of operation.
Returns
 The computation was performed successfully.
 Invalid batch normalization mode.
3.2.19. cudnnDeriveNormTensorDescriptor()
This function derives tensor descriptors for the normalization mean
, invariance
, normBias
, and normScale
subtensors from the layer's x
data descriptor and norm mode. normalization
, mean
, and invariance
share the same descriptor while bias
and scale
share the same descriptor.
cudnnStatus_t CUDNNWINAPI
cudnnDeriveNormTensorDescriptor(cudnnTensorDescriptor_t derivedNormScaleBiasDesc,
cudnnTensorDescriptor_t derivedNormMeanVarDesc,
const cudnnTensorDescriptor_t xDesc,
cudnnNormMode_t mode,
int groupCnt)
Use the tensor descriptor produced by this function as the normScaleBiasDesc
or normMeanVarDesc
parameter for the cudnnNormalizationForwardInference()
and cudnnNormalizationForwardTraining()
functions, and as the dNormScaleBiasDesc
and normMeanVarDesc
parameters in the cudnnNormalizationBackward()
function.
The resulting dimensions will be:
 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for
CUDNN_NORM_PER_ACTIVATION
 1xCxHxW for 4D and 1xCxDxHxW for 5D for
CUDNN_NORM_PER_CHANNEL
mode
For HALF
input data type the resulting tensor descriptor will have a FLOAT
type. For other data types, it will have the same type as the input data.
 Only 4D and 5D tensors are supported.
 The
derivedNormScaleBiasDesc
andderivedNormMeanVarDesc
should be created first usingcudnnCreateTensorDescriptor()
. xDesc
is the descriptor for the layer'sx
data and has to be set up with proper dimensions prior to calling this function.
Parameters
 Output. Handle to a previously created tensor descriptor.
 Output. Handle to a previously created tensor descriptor.

Input. Handle to a previously created and initialized layer's
x
data descriptor.  Input. The normalization layer mode of operation.

Input. The number of grouped convolutions. Currently, only
1
is supported.
Returns
 The computation was performed successfully.
 Invalid batch normalization mode.
3.2.20. cudnnDestroy()
This function releases the resources used by the cuDNN handle. This function is usually the last call made to cuDNN with a particular handle. Because cudnnCreate()
allocates internal resources, the release of those resources by calling cudnnDestroy()
will implicitly call cudaDeviceSynchronize()
; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy
outside of performancecritical code paths.
cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)
Parameters
 Input. The cuDNN handle to be destroyed.
Returns
 The cuDNN context destruction was successful.

Invalid (
NULL
) pointer supplied.
3.2.21. cudnnDestroyActivationDescriptor()
This function destroys a previously created activation descriptor object.
cudnnStatus_t cudnnDestroyActivationDescriptor(
cudnnActivationDescriptor_t activationDesc)
Returns
 The object was destroyed successfully.
3.2.22. cudnnDestroyAlgorithmDescriptor()
This function has been deprecated in cuDNN 8.0.
This function destroys a previously created algorithm descriptor object.
cudnnStatus_t cudnnDestroyAlgorithmDescriptor(
cudnnActivationDescriptor_t algorithmDesc)
Returns
 The object was destroyed successfully.
3.2.23. cudnnDestroyAlgorithmPerformance()
This function destroys a previously created algorithm descriptor object.
cudnnStatus_t cudnnDestroyAlgorithmPerformance(
cudnnAlgorithmPerformance_t algoPerf)
Returns
 The object was destroyed successfully.
3.2.24. cudnnDestroyDropoutDescriptor()
This function destroys a previously created dropout descriptor object.
cudnnStatus_t cudnnDestroyDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc)
Returns
 The object was destroyed successfully.
3.2.25. cudnnDestroyFilterDescriptor()
This function destroys a filter object.
cudnnStatus_t cudnnDestroyFilterDescriptor(
cudnnFilterDescriptor_t filterDesc)
Returns
 The object was destroyed successfully.
3.2.26. cudnnDestroyLRNDescriptor()
This function destroys a previously created LRN descriptor object.
cudnnStatus_t cudnnDestroyLRNDescriptor(
cudnnLRNDescriptor_t lrnDesc)
Returns
 The object was destroyed successfully.
3.2.27. cudnnDestroyOpTensorDescriptor()
This function deletes a tensor pointwise math descriptor object.
cudnnStatus_t cudnnDestroyOpTensorDescriptor(
cudnnOpTensorDescriptor_t opTensorDesc)
Parameters
 Input. Pointer to the structure holding the description of the tensor pointwise math to be deleted.
Returns
 The function returned successfully.
3.2.28. cudnnDestroyPoolingDescriptor()
This function destroys a previously created pooling descriptor object.
cudnnStatus_t cudnnDestroyPoolingDescriptor(
cudnnPoolingDescriptor_t poolingDesc)
Returns
 The object was destroyed successfully.
3.2.29. cudnnDestroyReduceTensorDescriptor()
This function destroys a previously created reduce tensor descriptor object. When the input pointer is NULL
, this function performs no destroy operation.
cudnnStatus_t cudnnDestroyReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t tensorDesc)
Parameters
 Input. Pointer to the reduce tensor descriptor object to be destroyed.
Returns
 The object was destroyed successfully.
3.2.30. cudnnDestroySpatialTransformerDescriptor()
This function destroys a previously created spatial transformer descriptor object.
cudnnStatus_t cudnnDestroySpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t stDesc)
Returns
 The object was destroyed successfully.
3.2.31. cudnnDestroyTensorDescriptor()
This function destroys a previously created tensor descriptor object. When the input pointer is NULL
, this function performs no destroy operation.
cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)
Parameters
 Input. Pointer to the tensor descriptor object to be destroyed.
Returns
 The object was destroyed successfully.
3.2.32. cudnnDestroyTensorTransformDescriptor()
Destroys a previously created tensor transform descriptor.
cudnnStatus_t cudnnDestroyTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc);
Parameters
 Input. The tensor transform descriptor to be destroyed.
Returns
 The descriptor was destroyed successfully.
3.2.33. cudnnDivisiveNormalizationForward()
This function performs the forward spatial DivisiveNormalization
layer computation. It divides every value in a layer by the standard deviation of its spatial neighbors as described in the What is the Best MultiStage Architecture for Object Recognition paper. Note that DivisiveNormalization
only implements the x/max(c, sigma_x)
portion of the computation, where sigma_x
is the variance over the spatial neighborhood of x
.
cudnnStatus_t cudnnDivisiveNormalizationForward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnDivNormMode_t mode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *means,
void *temp,
void *temp2,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
The full LCN (Local Contrastive Normalization) computation can be implemented as a twostep process:
x_m = xmean(x);
y = x_m/max(c, sigma(x_m));
The xmean(x)
which is often referred to as "subtractive normalization" portion of the computation can be implemented using cuDNN average pooling layer followed by a call to addTensor
.
Supported tensor formats are NCHW for 4D and NCDHW for 5D with any nonoverlapping nonnegative strides. Only 4D and 5D tensors are supported.
Parameters
 Input. Handle to a previously created cuDNN library descriptor.

Input. Handle to a previously initialized LRN parameter descriptor. This descriptor is used for both LRN and
DivisiveNormalization
layers. 
Input.
DivisiveNormalization
layer mode of operation. Currently onlyCUDNN_DIVNORM_PRECOMPUTED_MEANS
is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user. 
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Tensor descriptor objects for the input and output tensors. Note that
xDesc
is shared betweenx
,means
,temp
, andtemp2
tensors.  Input. Input tensor data pointer in device memory.

Input. Input means tensor data pointer in device memory. Note that this tensor can be
NULL
(in that case its values are assumed to be zero during the computation). This tensor also doesn't have to containmeans
, these can be any values, a frequently used variation is a result of convolution with a normalized positive kernel (such as Gaussian). 
Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the forward pass. These tensors do not have to be preserved as inputs from forward to the backward pass. Both use
xDesc
as their descriptor. 
Output. Pointer in device memory to a tensor for the result of the forward
DivisiveNormalization
computation.
Returns
 The computation was performed successfully.

At least one of the following conditions are met:
 One of the tensor pointers
x
,y
,temp
, andtemp2
isNULL
.  Number of input tensor or output tensor dimensions is outside of
[4,5]
range.  A mismatch in dimensions between any two of the input or output tensors.
 For inplace computation when pointers
x == y
, a mismatch in strides between the input data and output data tensors.  Alpha or beta pointer is
NULL
.  LRN descriptor parameters are outside of their valid ranges.
 Any of the tensor strides are negative.
 One of the tensor pointers
 The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a nonsupported configuration.
3.2.34. cudnnDropoutForward()
This function performs forward dropout operation over x
returning results in y
. If dropout
was used as a parameter to cudnnSetDropoutDescriptor()
, the approximate dropout
fraction of x
values will be replaced by a 0
, and the rest will be scaled by 1/(1dropout)
. This function should not be running concurrently with another cudnnDropoutForward()
function using the same states
.
cudnnStatus_t cudnnDropoutForward(
cudnnHandle_t handle,
const cudnnDropoutDescriptor_t dropoutDesc,
const cudnnTensorDescriptor_t xdesc,
const void *x,
const cudnnTensorDescriptor_t ydesc,
void *y,
void *reserveSpace,
size_t reserveSpaceSizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Previously created dropout descriptor object.
 Input. Handle to a previously initialized tensor descriptor.

Input. Pointer to data of the tensor described by the
xDesc
descriptor.  Input. Handle to a previously initialized tensor descriptor.

Output. Pointer to data of the tensor described by the
yDesc
descriptor. 
Output. Pointer to userallocated GPU memory used by this function. It is expected that the contents of
reserveSpace
does not change betweencudnnDropoutForward()
andcudnnDropoutBackward()
calls.  Input. Specifies the size in bytes of the provided memory for the reserve space.
Returns
 The call was successful.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 The number of elements of input tensor and output tensors differ.
 The
datatype
of the input tensor and output tensors differs.  The strides of the input tensor and output tensors differ and inplace operation is used (meaning,
x
andy
pointers are equal).  The provided
reserveSpaceSizeInBytes
is less than the value returned bycudnnDropoutGetReserveSpaceSize()
. cudnnSetDropoutDescriptor()
has not been called ondropoutDesc
with the nonNULL
states
argument.
 The function failed to launch on the GPU.
3.2.35. cudnnDropoutGetReserveSpaceSize()
This function is used to query the amount of reserve needed to run dropout with the input dimensions given by xDesc
. The same reserve space is expected to be passed to cudnnDropoutForward()
and cudnnDropoutBackward()
, and its contents is expected to remain unchanged between cudnnDropoutForward()
and cudnnDropoutBackward()
calls.
cudnnStatus_t cudnnDropoutGetReserveSpaceSize(
cudnnTensorDescriptor_t xDesc,
size_t *sizeInBytes)
Parameters
 Input. Handle to a previously initialized tensor descriptor, describing input to a dropout operation.

Output. Amount of GPU memory needed as reserve space to be able to run dropout with an input tensor descriptor specified by
xDesc
.
Returns
 The query was successful.
3.2.36. cudnnDropoutGetStatesSize()
This function is used to query the amount of space required to store the states of the random number generators used by the cudnnDropoutForward()
function.
cudnnStatus_t cudnnDropoutGetStatesSize(
cudnnHandle_t handle,
size_t *sizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN context.
 Output. Amount of GPU memory needed to store random generator states.
Returns
 The query was successful.
3.2.37. cudnnGetActivationDescriptor()
This function queries a previously initialized generic activation descriptor object.
cudnnStatus_t cudnnGetActivationDescriptor(
const cudnnActivationDescriptor_t activationDesc,
cudnnActivationMode_t *mode,
cudnnNanPropagation_t *reluNanOpt,
double *coef)
Parameters
 Input. Handle to a previously created activation descriptor.
 Output. Enumerant to specify the activation mode.

Output. Enumerant to specify the
Nan
propagation mode. 
Output. Floating point number to specify the clipping threshold when the activation mode is set to
CUDNN_ACTIVATION_CLIPPED_RELU
or to specify the alpha coefficient when the activation mode is set toCUDNN_ACTIVATION_ELU
.
Returns
 The object was queried successfully.
3.2.38. cudnnGetActivationDescriptorSwishBeta()
This function queries the current beta parameter set for SWISH activation.
cudnnStatus_t
cudnnGetActivationDescriptorSwishBeta(cudnnActivationDescriptor_t
activationDesc, double* swish_beta)
Parameters
 Input. Handle to a previously created activation descriptor.
 Output. Pointer to a double value that will receive the currently configured SWISH beta parameter.
Returns
 The beta parameter was queried successfully.

At least one of
activationDesc
orswish_beta
wereNULL
.
3.2.39. cudnnGetAlgorithmDescriptor()
This function has been deprecated in cuDNN 8.0.
This function queries a previously initialized generic algorithm descriptor object.
cudnnStatus_t cudnnGetAlgorithmDescriptor(
const cudnnAlgorithmDescriptor_t algoDesc,
cudnnAlgorithm_t *algorithm)
Parameters
 Input. Handle to a previously created algorithm descriptor.
 Input. Struct to specify the algorithm.
Returns
 The object was queried successfully.
3.2.40. cudnnGetAlgorithmPerformance()
This function has been deprecated in cuDNN 8.0.
This function queries a previously initialized generic algorithm performance object.
cudnnStatus_t cudnnGetAlgorithmPerformance(
const cudnnAlgorithmPerformance_t algoPerf,
cudnnAlgorithmDescriptor_t* algoDesc,
cudnnStatus_t* status,
float* time,
size_t* memory)
Parameters
 Input/Output. Handle to a previously created algorithm performance object.
 Output. The algorithm descriptor which the performance results describe.

Output. The cuDNN status returned from running the
algoDesc
algorithm. 
Output. The GPU time spent running the
algoDesc
algorithm. 
Output. The GPU memory needed to run the
algoDesc
algorithm.
Returns
 The object was queried successfully.
3.2.41. cudnnGetAlgorithmSpaceSize()
This function has been deprecated in cuDNN 8.0.
This function queries for the amount of host memory needed to call cudnnSaveAlgorithm()
, much like the get workspace size function query for the amount of device memory needed.
cudnnStatus_t cudnnGetAlgorithmSpaceSize(
cudnnHandle_t handle,
cudnnAlgorithmDescriptor_t algoDesc,
size_t* algoSpaceSizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. A previously created algorithm descriptor.

Output. Amount of host memory needed as a workspace to be able to save the metadata from the specified
algoDesc
.
Returns
 The function launched successfully.

At least one of the arguments is
NULL
.
3.2.42. cudnnGetCallback()
This function queries the internal states of cuDNN error reporting functionality.
cudnnStatus_t cudnnGetCallback(
unsigned mask,
void **udata,
cudnnCallback_t fptr)
Parameters
 Output. Pointer to the address where the current internal error reporting message bit mask will be outputted.

Output. Pointer to the address where the current internally stored
udata
address will be stored. 
Output. Pointer to the address where the current internally stored
callback
function pointer will be stored. When the builtin default callback function is used,NULL
will be outputted.
Returns
 The function launched successfully.

If any of the input parameters are
NULL
.
3.2.43. cudnnGetCudartVersion()
The same version of a given cuDNN library can be compiled against different CUDA toolkit versions. This routine returns the CUDA toolkit version that the currently used cuDNN library has been compiled against.
size_t cudnnGetCudartVersion()
3.2.44. cudnnGetDropoutDescriptor()
This function queries the fields of a previously initialized dropout descriptor.
cudnnStatus_t cudnnGetDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc,
cudnnHandle_t handle,
float *dropout,
void **states,
unsigned long long *seed)
Parameters
 Input. Previously initialized dropout descriptor.
 Input. Handle to a previously created cuDNN context.
 Output. The probability with which the value from input is set to 0 during the dropout layer.
 Output. Pointer to userallocated GPU memory that holds random number generator states.
 Output. Seed used to initialize random number generator states.
Returns
 The call was successful.
 One or more of the arguments was an invalid pointer.
3.2.45. cudnnGetErrorString()
This function converts the cuDNN status code to a NULL
terminated (ASCIIZ) static string. For example, when the input argument is CUDNN_STATUS_SUCCESS
, the returned string is CUDNN_STATUS_SUCCESS
. When an invalid status value is passed to the function, the returned string is CUDNN_UNKNOWN_STATUS
.
const char * cudnnGetErrorString(cudnnStatus_t status)
Parameters
 Input. cuDNN enumerant status code.
Returns
Pointer to a static, NULL
terminated string with the status name.
3.2.46. cudnnGetFilter4dDescriptor()
This function queries the parameters of the previously initialized Filter4d
descriptor object.
cudnnStatus_t cudnnGetFilter4dDescriptor(
const cudnnFilterDescriptor_t filterDesc,
cudnnDataType_t *dataType,
cudnnTensorFormat_t *format,
int *k,
int *c,
int *h,
int *w)
Parameters
 Input. Handle to a previously created filter descriptor.
 Output. Data type.
 Output. Type of format.
 Output. Number of output feature maps.
 Output. Number of input feature maps.
 Output. Height of each filter.
 Output. Width of each filter.
Returns
 The object was set successfully.
3.2.47. cudnnGetFilterNdDescriptor()
This function queries a previously initialized FilterNd
descriptor object.
cudnnStatus_t cudnnGetFilterNdDescriptor(
const cudnnFilterDescriptor_t wDesc,
int nbDimsRequested,
cudnnDataType_t *dataType,
cudnnTensorFormat_t *format,
int *nbDims,
int filterDimA[])
Parameters
 Input. Handle to a previously initialized filter descriptor.

Input. Dimension of the expected filter descriptor. It is also the minimum size of the arrays
filterDimA
in order to be able to hold the results.  Output. Data type.
 Output. Type of format.
 Output. Actual dimension of the filter.

Output. Array of dimensions of at least
nbDimsRequested
that will be filled with the filter parameters from the provided filter descriptor.
Returns
 The object was set successfully.

The parameter
nbDimsRequested
is negative.
3.2.48. cudnnGetFilterSizeInBytes()
This function returns the size of the filter tensor in memory with respect to the given descriptor. It can be used to know the amount of GPU memory to be allocated to hold that filter tensor.
cudnnStatus_t
cudnnGetFilterSizeInBytes(const cudnnFilterDescriptor_t filterDesc, size_t *size) ;
Parameters
 Input. handle to a previously initialized filter descriptor.
 Output. size in bytes needed to hold the tensor in GPU memory.
Returns

filterDesc
is valid. 
filerDesc
is invald.
3.2.49. cudnnGetLRNDescriptor()
This function retrieves values stored in the previously initialized LRN
descriptor object.
cudnnStatus_t cudnnGetLRNDescriptor(
cudnnLRNDescriptor_t normDesc,
unsigned *lrnN,
double *lrnAlpha,
double *lrnBeta,
double *lrnK)
Parameters
 Input. Handle to a previously created LRN descriptor.

Output. Pointers to receive values of parameters stored in the descriptor object. For more information, refer to
cudnnSetLRNDescriptor()
. Any of these pointers can beNULL
(no value is returned for the corresponding parameter).
Returns
 Function completed successfully.
3.2.50. cudnnGetOpTensorDescriptor()
This function returns the configuration of the passed tensor pointwise math descriptor.
cudnnStatus_t cudnnGetOpTensorDescriptor(
const cudnnOpTensorDescriptor_t opTensorDesc,
cudnnOpTensorOp_t *opTensorOp,
cudnnDataType_t *opTensorCompType,
cudnnNanPropagation_t *opTensorNanOpt)
Parameters
 Input. Tensor pointwise math descriptor passed to get the configuration from.
 Output. Pointer to the tensor pointwise math operation type, associated with this tensor pointwise math descriptor.
 Output. Pointer to the cuDNN datatype associated with this tensor pointwise math descriptor.
 Output. Pointer to the NAN propagation option associated with this tensor pointwise math descriptor.
Returns
 The function returned successfully.
 Input tensor pointwise math descriptor passed is invalid.
3.2.51. cudnnGetPooling2dDescriptor()
This function queries a previously created Pooling2D
descriptor object.
cudnnStatus_t cudnnGetPooling2dDescriptor(
const cudnnPoolingDescriptor_t poolingDesc,
cudnnPoolingMode_t *mode,
cudnnNanPropagation_t *maxpoolingNanOpt,
int *windowHeight,
int *windowWidth,
int *verticalPadding,
int *horizontalPadding,
int *verticalStride,
int *horizontalStride)
Parameters
 Input. Handle to a previously created pooling descriptor.
 Output. Enumerant to specify the pooling mode.
 Output. Enumerant to specify the Nan propagation mode.
 Output. Height of the pooling window.
 Output. Width of the pooling window.
 Output. Size of vertical padding.
 Output. Size of horizontal padding.
 Output. Pooling vertical stride.
 Output. Pooling horizontal stride.
Returns
 The object was set successfully.
3.2.52. cudnnGetPooling2dForwardOutputDim()
This function provides the output dimensions of a tensor after Pooling2D
has been applied.
cudnnStatus_t cudnnGetPooling2dForwardOutputDim(
const cudnnPoolingDescriptor_t poolingDesc,
const cudnnTensorDescriptor_t inputDesc,
int *outN,
int *outC,
int *outH,
int *outW)
Each dimension h
and w
of the output images is computed as follows:
outputDim = 1 + (inputDim + 2*padding  windowDim)/poolingStride;
Parameters
 Input. Handle to a previously initialized pooling descriptor.
 Input. Handle to the previously initialized input tensor descriptor.
 Output. Number of images in the output.
 Output. Number of channels in the output.
 Output. Height of images in the output.
 Output. Width of images in the output.
Returns
 The function launched successfully.

At least one of the following conditions are met:
poolingDesc
has not been initialized.poolingDesc
orinputDesc
has an invalid number of dimensions (2 and 4 respectively are required).
3.2.53. cudnnGetPoolingNdDescriptor()
This function queries a previously initialized generic PoolingNd
descriptor object.
cudnnStatus_t cudnnGetPoolingNdDescriptor(
const cudnnPoolingDescriptor_t poolingDesc,
int nbDimsRequested,
cudnnPoolingMode_t *mode,
cudnnNanPropagation_t *maxpoolingNanOpt,
int *nbDims,
int windowDimA[],
int paddingA[],
int strideA[])
Parameters
 Input. Handle to a previously created pooling descriptor.

Input. Dimension of the expected pooling descriptor. It is also the minimum size of the arrays
windowDimA
,paddingA
, andstrideA
in order to be able to hold the results.  Output. Enumerant to specify the pooling mode.
 Output. Enumerant to specify the Nan propagation mode.
 Output. Actual dimension of the pooling descriptor.

Output. Array of dimension of at least
nbDimsRequested
that will be filled with the window parameters from the provided pooling descriptor. 
Output. Array of dimension of at least
nbDimsRequested
that will be filled with the padding parameters from the provided pooling descriptor. 
Output. Array of dimension at least
nbDimsRequested
that will be filled with the stride parameters from the provided pooling descriptor.
Returns
 The object was queried successfully.

The parameter
nbDimsRequested
is greater thanCUDNN_DIM_MAX
.
3.2.54. cudnnGetPoolingNdForwardOutputDim()
This function provides the output dimensions of a tensor after PoolingNd
has been applied.
cudnnStatus_t cudnnGetPoolingNdForwardOutputDim(
const cudnnPoolingDescriptor_t poolingDesc,
const cudnnTensorDescriptor_t inputDesc,
int nbDims,
int outDimA[])
Each dimension of the (nbDims2)D
images of the output tensor is computed as follows:
outputDim = 1 + (inputDim + 2*padding  windowDim)/poolingStride;
Parameters
 Input. Handle to a previously initialized pooling descriptor.
 Input. Handle to the previously initialized input tensor descriptor.
 Input. Number of dimensions in which pooling is to be applied.

Output. Array of
nbDims
output dimensions.
Returns
 The function launched successfully.

At least one of the following conditions are met:
poolingDesc
has not been initialized. The value of
nbDims
is inconsistent with the dimensionality ofpoolingDesc
andinputDesc
.
3.2.55. cudnnGetProperty()
This function writes a specific part of the cuDNN library version number into the provided host storage.
cudnnStatus_t cudnnGetProperty(
libraryPropertyType type,
int *value)
Parameters

Input. Enumerant type that instructs the function to report the numerical value of the cuDNN major version, minor version, or the patch level depending on whether
type
is set toMAJOR_VERSION
,MINOR_VERSION
, orPATCH_LEVEL
.  Output. Host pointer where the version information should be written.
Returns

Invalid value of the
type
argument.  Version information was stored successfully at the provided address.
3.2.56. cudnnGetReduceTensorDescriptor()
This function queries a previously initialized reduce tensor descriptor object.
cudnnStatus_t cudnnGetReduceTensorDescriptor(
const cudnnReduceTensorDescriptor_t reduceTensorDesc,
cudnnReduceTensorOp_t *reduceTensorOp,
cudnnDataType_t *reduceTensorCompType,
cudnnNanPropagation_t *reduceTensorNanOpt,
cudnnReduceTensorIndices_t *reduceTensorIndices,
cudnnIndicesType_t *reduceTensorIndicesType)
Parameters
 Input. Pointer to a previously initialized reduce tensor descriptor object.
 Output. Enumerant to specify the reduced tensor operation.
 Output. Enumerant to specify the computation datatype of the reduction.
 Output. Enumerant to specify the Nan propagation mode.
 Output. Enumerant to specify the reduced tensor indices.
 Output. Enumerant to specify the reduced tensor indices type.
Returns
 The object was queried successfully.

reduceTensorDesc
isNULL
.
3.2.57. cudnnGetReductionIndicesSize()
This is a helper function to return the minimum size of the index space to be passed to the reduction given the input and output tensors.
cudnnStatus_t cudnnGetReductionIndicesSize(
cudnnHandle_t handle,
const cudnnReduceTensorDescriptor_t reduceDesc,
const cudnnTensorDescriptor_t aDesc,
const cudnnTensorDescriptor_t cDesc,
size_t *sizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN library descriptor.
 Input. Pointer to a previously initialized reduce tensor descriptor object.
 Input. Pointer to the input tensor descriptor.
 Input. Pointer to the output tensor descriptor.
 Output. Minimum size of the index space to be passed to the reduction.
Returns
 The index space size is returned successfully.
3.2.58. cudnnGetReductionWorkspaceSize()
This is a helper function to return the minimum size of the workspace to be passed to the reduction given the input and output tensors.
cudnnStatus_t cudnnGetReductionWorkspaceSize(
cudnnHandle_t handle,
const cudnnReduceTensorDescriptor_t reduceDesc,
const cudnnTensorDescriptor_t aDesc,
const cudnnTensorDescriptor_t cDesc,
size_t *sizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN library descriptor.
 Input. Pointer to a previously initialized reduce tensor descriptor object.
 Input. Pointer to the input tensor descriptor.
 Input. Pointer to the output tensor descriptor.
 Output. Minimum size of the index space to be passed to the reduction.
Returns
 The workspace size is returned successfully.
3.2.59. cudnnGetStream()
This function retrieves the user CUDA stream programmed in the cuDNN handle. When the user's CUDA stream is not set in the cuDNN handle, this function reports the nullstream.
cudnnStatus_t cudnnGetStream(
cudnnHandle_t handle,
cudaStream_t *streamId)
Parameters
 Input. Pointer to the cuDNN handle.
 Output. Pointer where the current CUDA stream from the cuDNN handle should be stored.
Returns

Invalid (
NULL
) handle.  The stream identifier was retrieved successfully.
3.2.60. cudnnGetTensor4dDescriptor()
This function queries the parameters of the previously initialized Tensor4d
descriptor object.
cudnnStatus_t cudnnGetTensor4dDescriptor(
const cudnnTensorDescriptor_t tensorDesc,
cudnnDataType_t *dataType,
int *n,
int *c,
int *h,
int *w,
int *nStride,
int *cStride,
int *hStride,
int *wStride)
Parameters
 Input. Handle to a previously initialized tensor descriptor.
 Output. Data type.
 Output. Number of images.
 Output. Number of feature maps per image.
 Output. Height of each feature map.
 Output. Width of each feature map.
 Output. Stride between two consecutive images.
 Output. Stride between two consecutive feature maps.
 Output. Stride between two consecutive rows.
 Output. Stride between two consecutive columns.
Returns
 The operation succeeded.
3.2.61. cudnnGetTensorNdDescriptor()
This function retrieves values stored in a previously initialized TensorNd
descriptor object.
cudnnStatus_t cudnnGetTensorNdDescriptor(
const cudnnTensorDescriptor_t tensorDesc,
int nbDimsRequested,
cudnnDataType_t *dataType,
int *nbDims,
int dimA[],
int strideA[])
Parameters
 Input. Handle to a previously initialized tensor descriptor.

Input. Number of dimensions to extract from a given tensor descriptor. It is also the minimum size of the arrays
dimA
andstrideA
. If this number is greater than the resultingnbDims[0]
, onlynbDims[0]
dimensions will be returned.  Output. Data type.

Output. Actual number of dimensions of the tensor will be returned in
nbDims[0]
. 
Output. Array of dimensions of at least
nbDimsRequested
that will be filled with the dimensions from the provided tensor descriptor. 
Output. Array of dimensions of at least
nbDimsRequested
that will be filled with the strides from the provided tensor descriptor.
Returns
 The results were returned successfully.

Either
tensorDesc
ornbDims
pointer isNULL
.
3.2.62. cudnnGetTensorSizeInBytes()
This function returns the size of the tensor in memory in respect to the given descriptor. This function can be used to know the amount of GPU memory to be allocated to hold that tensor.
cudnnStatus_t cudnnGetTensorSizeInBytes(
const cudnnTensorDescriptor_t tensorDesc,
size_t *size)
Parameters
 Input. Handle to a previously initialized tensor descriptor.
 Output. Size in bytes needed to hold the tensor in GPU memory.
Returns
 The results were returned successfully.
3.2.63. cudnnGetTensorTransformDescriptor()
This function returns the values stored in a previously initialized tensor transform descriptor.
cudnnStatus_t cudnnGetTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc,
uint32_t nbDimsRequested,
cudnnTensorFormat_t *destFormat,
int32_t padBeforeA[],
int32_t padAfterA[],
uint32_t foldA[],
cudnnFoldingDirection_t *direction);
Parameters
 Input. A previously initialized tensor transform descriptor.
 Input. The number of dimensions to consider. For more information, refer to Tensor Descriptor.
 Output. The transform format that will be returned.

Output. An array filled with the amount of padding to add before each dimension. The dimension of this
padBeforeA[]
parameter is equal tonbDimsRequested
. 
Output. An array filled with the amount of padding to add after each dimension. The dimension of this
padBeforeA[]
parameter is equal tonbDimsRequested
. 
Output. An array that was filled with the folding parameters for each spatial dimension. The dimension of this
foldA[]
array isnbDimsRequested2
. 
Output. The setting that selects folding or unfolding. For more information, refer to
cudnnFoldingDirection_t
.
Returns
 The results were obtained successfully.

If
transformDesc
isNULL
or ifnbDimsRequested
is less than 3 or greater thanCUDNN_DIM_MAX
.
3.2.64. cudnnGetVersion()
This function returns the version number of the cuDNN library. It returns the CUDNN_VERSION
defined present in the cudnn.h
header file. Starting with release R2, the routine can be used to identify dynamically the current cuDNN library used by the application. The defined CUDNN_VERSION
can be used to have the same application linked against different cuDNN versions using conditional compilation statements.
size_t cudnnGetVersion()
3.2.65. cudnnInitTransformDest()
This function initializes and returns a destination tensor descriptor destDesc
for tensor transform operations. The initialization is done with the desired parameters described in the transform descriptor cudnnTensorDescriptor_t
.
cudnnStatus_t cudnnInitTransformDest(
const cudnnTensorTransformDescriptor_t transformDesc,
const cudnnTensorDescriptor_t srcDesc,
cudnnTensorDescriptor_t destDesc,
size_t *destSizeInBytes);
The returned tensor descriptor will be packed.
Parameters
 Input. Handle to a previously initialized tensor transform descriptor.
 Input. Handle to a previously initialized tensor descriptor.
 Output. Handle of the tensor descriptor that will be initialized and returned.
 Output. A pointer to hold the size, in bytes, of the new tensor.
Returns
 The tensor descriptor was initialized successfully.

If either
srcDesc
ordestDesc
isNULL
, or if the tensor descriptor’snbDims
is incorrect. For more information, refer to Tensor Descriptor.  If the provided configuration is not 4D.
 Function failed to launch on the GPU.
3.2.66. cudnnLRNCrossChannelForward()
This function performs the forward LRN layer computation.
cudnnStatus_t cudnnLRNCrossChannelForward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnLRNMode_t lrnMode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
Supported formats are: positivestrided
, NCHW and NHWC for 4D x
and y
, and only NCDHW DHWpacked for 5D (for both x
and y
). Only nonoverlapping 4D and 5D tensors are supported. NCHW layout is preferred for performance.
Parameters
 Input. Handle to a previously created cuDNN library descriptor.
 Input. Handle to a previously initialized LRN parameter descriptor.

Input. LRN layer mode of operation. Currently only
CUDNN_LRN_CROSS_CHANNEL_DIM1
is implemented. Normalization is performed along the tensor'sdimA[1]
. 
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Tensor descriptor objects for the input and output tensors.
 Input. Input tensor data pointer in device memory.
 Output. Output tensor data pointer in device memory.
Returns

CUDNN_STATUS_SUCCESS
 The computation was performed successfully.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:
 One of the tensor pointers
x
,y
isNULL
.  Number of input tensor dimensions is 2 or less.
 LRN descriptor parameters are outside of their valid ranges.
 One of the tensor parameters is 5D but is not in NCDHW DHWpacked format.
 One of the tensor pointers

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. Refer to the following examples of nonsupported configurations:
 Any of the input tensor datatypes is not the same as any of the output tensor datatype.
x
andy
tensor dimensions mismatch. Any tensor parameters strides are negative.
3.2.67. cudnnNormalizationForwardInference()
This function performs the forward normalization layer computation for the inference phase. Perchannel normalization layer is based on the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper.
cudnnStatus_t
cudnnNormalizationForwardInference(cudnnHandle_t handle,
cudnnNormMode_t mode,
cudnnNormOps_t normOps,
cudnnNormAlgo_t algo,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t normScaleBiasDesc,
const void *normScale,
const void *normBias,
const cudnnTensorDescriptor_t normMeanVarDesc,
const void *estimatedMean,
const void *estimatedVariance,
const cudnnTensorDescriptor_t zDesc,
const void *z,
cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t yDesc,
void *y,
double epsilon,
int groupCnt);
Only 4D and 5D tensors are supported. The input transformation performed by this function is defined as:
y = beta*y + alpha *[normBias + (normScale * (xestimatedMean)/sqrt(epsilon + estimatedVariance)]
The epsilon
value has to be the same during training, backpropagation, and inference.
For the training phase, refer to cudnnNormalizationForwardTraining()
.
Higher performance can be obtained when HWpacked tensors are used for all of x
and y
.
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (perchannel or peractivation). For more information, refer to
cudnnNormMode_t
. 
Input. Mode of postoperative. Currently,
CUDNN_NORM_OPS_NORM_ACTIVATION
andCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
are not supported. 
Input. Algorithm to be performed. For more information, refer to
cudnnNormAlgo_t
. 
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handles to the previously initialized tensor descriptors.

Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
input data. 
Output. Data pointer to GPU memory associated with the tensor descriptor
yDesc
, for they
output of the normalization layer. 
Input. Tensor descriptors and pointers in device memory for residual addition to the result of the normalization operation, prior to the activation.
zDesc
and*z
are optional and are only used whennormOps
isCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
, otherwise users may passNULL
. When in use,z
should have exactly the same dimension asx
and the final outputy
. For more information, refer tocudnnTensorDescriptor_t
.Since
normOps
is only supported forCUDNN_NORM_OPS_NORM
, we can set these toNULL
for now.  Inputs. Tensor descriptors and pointers in device memory for the normalization scale and bias parameters (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, bias is referred to as beta and scale as gamma).

Inputs. Mean and variance tensors and their tensor descriptors. The
estimatedMean
andestimatedVariance
inputs, accumulated during the training phase from thecudnnNormalizationForwardTraining()
call, should be passed as inputs here. 
Input. Descriptor for the activation operation. When the
normOps
input is set to eitherCUDNN_NORM_OPS_NORM_ACTIVATION
orCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
then this activation is used, otherwise the user may passNULL
. SincenormOps
is only supported forCUDNN_NORM_OPS_NORM
, we can set these toNULL
for now.  Input. Epsilon value used in the normalization formula. Its value should be equal to or greater than zero.

Input. The number of grouped convolutions. Currently, only
1
is supported.
Returns
 The computation was performed successfully.
 A compute or data type other than what is supported was chosen, or an unknown algorithm type was chosen.

At least one of the following conditions are met:
 One of the pointers
alpha
,beta
,x
,y
,normScale
,normBias
,estimatedMean
, andestimatedInvVariance
isNULL
.  The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). normScaleBiasDesc
andnormMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for perchannel, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode.epsilon
value is less than zero. Dimensions or data types mismatch for
xDesc
andyDesc
.
 One of the pointers

A compute or data type other than
FLOAT
was chosen, or an unknown algorithm type was chosen.  The function failed to launch on the GPU.
3.2.68. cudnnOpsInferVersionCheck()
This function is the first of a series of corresponding functions that check for consistent library versions among DLL files for different modules.
cudnnStatus_t cudnnOpsInferVersionCheck(void)
Returns
 The version of this DLL file is consistent with cuDNN DLLs on which it depends.
 The version of this DLL file does not match that of a cuDNN DLLs on which it depends.
3.2.69. cudnnOpTensor()
This function implements the equation C = op(alpha1[0] * A, alpha2[0] * B) + beta[0] * C
, given the tensors A
, B
, and C
and the scaling factors alpha1
, alpha2
, and beta
. The op
to use is indicated by the descriptor cudnnOpTensorDescriptor_t
, meaning, the type of opTensorDesc
. Currentlysupported ops are listed by the cudnnOpTensorOp_t
enum.
cudnnStatus_t cudnnOpTensor(
cudnnHandle_t handle,
const cudnnOpTensorDescriptor_t opTensorDesc,
const void *alpha1,
const cudnnTensorDescriptor_t aDesc,
const void *A,
const void *alpha2,
const cudnnTensorDescriptor_t bDesc,
const void *B,
const void *beta,
const cudnnTensorDescriptor_t cDesc,
void *C)
The following restrictions on the input and destination tensors apply:
opTensorCompType in opTensorDesc 
A 
B 
C (destination) 

FLOAT 
FLOAT 
FLOAT 
FLOAT 
FLOAT 
INT8 
INT8 
FLOAT 
FLOAT 
HALF 
HALF 
FLOAT 
FLOAT 
BFLOAT16 
BFLOAT16 
FLOAT 
DOUBLE 
DOUBLE 
DOUBLE 
DOUBLE 
FLOAT 
FLOAT 
FLOAT 
HALF 
FLOAT 
HALF 
HALF 
HALF 
FLOAT 
INT8 
INT8 
INT8 
FLOAT 
FLOAT 
FLOAT 
INT8 
FLOAT 
FLOAT 
FLOAT 
BFLOAT16 
FLOAT 
BFLOAT16 
BFLOAT16 
BFLOAT16 
CUDNN_TENSOR_NCHW_VECT_C
is not supported as input tensor format. All tensors up to dimension five (5) are supported. This routine does not support tensor formats beyond these dimensions.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Handle to a previously initialized op tensor descriptor.

Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handle to a previously initialized tensor descriptor.

Input. Pointer to data of the tensors described by the
aDesc
andbDesc
descriptors, respectively. 
Input/Output. Pointer to data of the tensor described by the
cDesc
descriptor.
Returns
 The function executed successfully.

The function does not support the provided configuration. Refer to the following examples of nonsupported configurations:
 The dimensions of the bias tensor and the output tensor dimensions are above 5.
opTensorCompType
is not set as stated above.

The data type of the destination tensor
C
is unrecognized, or the restrictions on the input and destination tensors, stated above, are not met.  The function failed to launch on the GPU.
3.2.70. cudnnPoolingForward()
This function computes pooling of input values (meaning, the maximum or average of several adjacent values) to produce an output with smaller height and/or width.
cudnnStatus_t cudnnPoolingForward(
cudnnHandle_t handle,
const cudnnPoolingDescriptor_t poolingDesc,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
All tensor formats are supported, best performance is expected when using HWpacked
tensors. Only 2 and 3 spatial dimensions are allowed. Vectorized tensors are only supported if they have 2 spatial dimensions.
The dimensions of the output tensor yDesc
can be smaller or bigger than the dimensions advised by the routine cudnnGetPooling2dForwardOutputDim()
or cudnnGetPoolingNdForwardOutputDim()
.
For average pooling, the compute type is float
even for integer input and output data type. Output round is nearesteven and clamp to the most negative or most positive value of type if out of range.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Handle to a previously initialized pooling descriptor.

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Handle to the previously initialized input tensor descriptor. Must be of type
FLOAT
,DOUBLE
,HALF
,INT8
,INT8x4
,INT8x32
, orBFLOAT16
. For more information, refer tocudnnDataType_t
. 
Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. 
Input. Handle to the previously initialized output tensor descriptor. Must be of type
FLOAT
,DOUBLE
,HALF
,INT8
,INT8x4
,INT8x32
, orBFLOAT16
. For more information, refer tocudnnDataType_t
. 
Output. Data pointer to GPU memory associated with the output tensor descriptor
yDesc
.
Returns
 The function launched successfully.

At least one of the following conditions are met:
 The dimensions
n
,c
of the input tensor and output tensors differ.  The
datatype
of the input tensor and output tensors differs.
 The dimensions
 The function does not support the provided configuration.
 The function failed to launch on the GPU.
3.2.71. cudnnQueryRuntimeError()
cuDNN library functions perform extensive input argument checking before launching GPU kernels. The last step is to verify that the GPU kernel actually started. When a kernel fails to start, CUDNN_STATUS_EXECUTION_FAILED
is returned by the corresponding API call. Typically, after a GPU kernel starts, no runtime checks are performed by the kernel itself  numerical results are simply written to output buffers.
cudnnStatus_t cudnnQueryRuntimeError(
cudnnHandle_t handle,
cudnnStatus_t *rstatus,
cudnnErrQueryMode_t mode,
cudnnRuntimeTag_t *tag)
When the CUDNN_BATCHNORM_SPATIAL_PERSISTENT
mode is selected in cudnnBatchNormalizationForwardTraining()
or cudnnBatchNormalizationBackward()
, the algorithm may encounter numerical overflows where CUDNN_BATCHNORM_SPATIAL
performs just fine albeit at a slower speed. The user can invoke cudnnQueryRuntimeError()
to make sure numerical overflows did not occur during the kernel execution. Those issues are reported by the kernel that performs computations.
cudnnQueryRuntimeError()
can be used in polling and blocking software control flows. There are two polling modes (CUDNN_ERRQUERY_RAWCODE
and CUDNN_ERRQUERY_NONBLOCKING
) and one blocking mode CUDNN_ERRQUERY_BLOCKING
.
CUDNN_ERRQUERY_RAWCODE
reads the error storage location regardless of the kernel completion status. The kernel might not even start and the error storage (allocated per cuDNN handle) might be used by an earlier call.
CUDNN_ERRQUERY_NONBLOCKING
checks if all tasks in the user stream are completed. The cudnnQueryRuntimeError()
function will return immediately and report CUDNN_STATUS_RUNTIME_IN_PROGRESS
in rstatus
if some tasks in the user stream are pending. Otherwise, the function will copy the remote kernel error code to rstatus
.
In the blocking mode (CUDNN_ERRQUERY_BLOCKING
), the function waits for all tasks to drain in the user stream before reporting the remote kernel error code. The blocking flavor can be further adjusted by calling cudaSetDeviceFlags
with the cudaDeviceScheduleSpin
, cudaDeviceScheduleYield
, or cudaDeviceScheduleBlockingSync
flag.
CUDNN_ERRQUERY_NONBLOCKING
and CUDNN_ERRQUERY_BLOCKING
modes should not be used when the user stream is changed in the cuDNN handle, meaning, cudnnSetStream()
is invoked between functions that report runtime kernel errors and the cudnnQueryRuntimeError()
function.
The remote error status reported in rstatus
can be set to: CUDNN_STATUS_SUCCESS
, CUDNN_STATUS_RUNTIME_IN_PROGRESS
, or CUDNN_STATUS_RUNTIME_FP_OVERFLOW
. The remote kernel error is automatically cleared by cudnnQueryRuntimeError()
.
The cudnnQueryRuntimeError()
function should be used in conjunction with cudnnBatchNormalizationForwardTraining()
and cudnnBatchNormalizationBackward()
when the cudnnBatchNormMode_t
argument is CUDNN_BATCHNORM_SPATIAL_PERSISTENT
.
Parameters
 Input. Handle to a previously created cuDNN context.
 Output. Pointer to the user's error code storage.
 Input. Remote error query mode.

Input/Output. Currently, this argument should be
NULL
.
Returns

No errors detected (
rstatus
holds a valid value).  Invalid input argument.
 A stream blocking synchronization or a nonblocking stream query failed.
 The device cannot access zerocopy memory to report kernel errors.
3.2.72. cudnnReduceTensor()
This function reduces tensor A
by implementing the equation C = alpha * reduce op ( A ) + beta * C
, given tensors A
and C
and scaling factors alpha
and beta
. The reduction op to use is indicated by the descriptor reduceTensorDesc
. Currentlysupported ops are listed by the cudnnReduceTensorOp_t
enum.
cudnnStatus_t cudnnReduceTensor(
cudnnHandle_t handle,
const cudnnReduceTensorDescriptor_t reduceTensorDesc,
void *indices,
size_t indicesSizeInBytes,
void *workspace,
size_t workspaceSizeInBytes,
const void *alpha,
const cudnnTensorDescriptor_t aDesc,
const void *A,
const void *beta,
const cudnnTensorDescriptor_t cDesc,
void *C)
Each dimension of the output tensor C
must match the corresponding dimension of the input tensor A
or must be equal to 1. The dimensions equal to 1 indicate the dimensions of A
to be reduced.
The implementation will generate indices for the min and max ops only, as indicated by the cudnnReduceTensorIndices_t
enum of the reduceTensorDesc
. Requesting indices for the other reduction ops results in an error. The data type of the indices is indicated by the cudnnIndicesType_t
enum; currently only the 32bit (unsigned int) type is supported.
The indices returned by the implementation are not absolute indices but relative to the dimensions being reduced. The indices are also flattened, meaning, not coordinate tuples.
The data types of the tensors A
and C
must match if of type double. In this case, alpha
and beta
and the computation enum of reduceTensorDesc
are all assumed to be of type double.
The HALF
and INT8
data types may be mixed with the FLOAT
data types. In these cases, the computation enum of reduceTensorDesc
is required to be of type FLOAT
.
Up to dimension 8, all tensor formats are supported. Beyond those dimensions, this routine is not supported.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Handle to a previously initialized reduce tensor descriptor.
 Output. Handle to a previously allocated space for writing indices.
 Input. Size of the above previously allocated space.
 Input. Handle to a previously allocated space for the reduction implementation.
 Input. Size of the above previously allocated space.

Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handle to a previously initialized tensor descriptor.

Input. Pointer to data of the tensor described by the
aDesc
descriptor. 
Input/Output. Pointer to data of the tensor described by the
cDesc
descriptor.
Returns
 The function executed successfully.

The function does not support the provided configuration. See the following for some examples of nonsupported configurations:
 The dimensions of the input tensor and the output tensor are above 8.
reduceTensorCompType
is not set as stated above.
 The corresponding dimensions of the input and output tensors all match, or the conditions in the above paragraphs are unmet.
 The allocations for the indices or workspace are insufficient.
 The function failed to launch on the GPU.
3.2.73. cudnnRestoreAlgorithm()
This function has been deprecated in cuDNN 8.0.
This function reads algorithm metadata from the host memory space provided by the user in algoSpace
, allowing the user to use the results of RNN finds from previous cuDNN sessions.
cudnnStatus_t cudnnRestoreAlgorithm(
cudnnHandle_t handle,
void* algoSpace,
size_t algoSpaceSizeInBytes,
cudnnAlgorithmDescriptor_t algoDesc)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. A previously created algorithm descriptor.
 Input. Pointer to the host memory to be read.

Input. Amount of host memory needed as a workspace to be able to hold the metadata from the specified
algoDesc
.
Returns
 The function launched successfully.
 The metadata is from a different cuDNN version.

At least one of the following conditions is met:
 One of the arguments is
NULL
.  The metadata is corrupted.
 One of the arguments is
3.2.74. cudnnRestoreDropoutDescriptor()
This function restores a dropout descriptor to a previously savedoff state.
cudnnStatus_t cudnnRestoreDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc,
cudnnHandle_t handle,
float dropout,
void *states,
size_t stateSizeInBytes,
unsigned long long seed)
Parameters
 Input/Output. Previously created dropout descriptor.
 Input. Handle to a previously created cuDNN context.

Input. Probability with which the value from an input tensor is set to
0
when performing dropout. 
Input. Pointer to GPU memory that holds random number generator states initialized by a prior call to
cudnnSetDropoutDescriptor()
. 
Input. Size in bytes of buffer holding random number generator
states
. 
Input. Seed used in prior calls to
cudnnSetDropoutDescriptor()
that initializedstates
buffer. Using a different seed from this has no effect. A change of seed, and subsequent update to random number generator states can be achieved by callingcudnnSetDropoutDescriptor()
.
Returns
 The call was successful.

The
states
buffer size (as indicated instateSizeInBytes
) is too small.
3.2.75. cudnnSaveAlgorithm()
This function has been deprecated in cuDNN 8.0.
This function writes algorithm metadata into the host memory space provided by the user in algoSpace
, allowing the user to preserve the results of RNN finds after cuDNN exits.
cudnnStatus_t cudnnSaveAlgorithm(
cudnnHandle_t handle,
cudnnAlgorithmDescriptor_t algoDesc,
void* algoSpace
size_t algoSpaceSizeInBytes)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. A previously created algorithm descriptor.
 Input. Pointer to the host memory to be written.

Input. Amount of host memory needed as a workspace to be able to save the metadata from the specified
algoDesc
.
Returns
 The function launched successfully.

At least one of the following conditions is met:
 One of the arguments is
NULL
. algoSpaceSizeInBytes
is too small.
 One of the arguments is
3.2.76. cudnnScaleTensor()
This function scales all the elements of a tensor by a given factor.
cudnnStatus_t cudnnScaleTensor(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t yDesc,
void *y,
const void *alpha)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Handle to a previously initialized tensor descriptor.

Input/Output. Pointer to data of the tensor described by the
yDesc
descriptor.  Input. Pointer in the host memory to a single value that all elements of the tensor will be scaled with. For more information, refer to Scaling Parameters.
Returns
 The function launched successfully.
 The function does not support the provided configuration.

One of the provided pointers is
NIL
.  The function failed to launch on the GPU.
3.2.77. cudnnSetActivationDescriptor()
This function initializes a previously created generic activation descriptor object.
cudnnStatus_t cudnnSetActivationDescriptor(
cudnnActivationDescriptor_t activationDesc,
cudnnActivationMode_t mode,
cudnnNanPropagation_t reluNanOpt,
double coef)
Parameters
 Input/Output. Handle to a previously created activation descriptor.
 Input. Enumerant to specify the activation mode.

Input. Enumerant to specify the
Nan
propagation mode. 
Input. Floating point number. When the activation mode (refer to
cudnnActivationMode_t
) is set toCUDNN_ACTIVATION_CLIPPED_RELU
, this input specifies the clipping threshold; and when the activation mode is set toCUDNN_ACTIVATION_RELU
, this input specifies the upper bound.
Returns
 The object was set successfully.

mode
orreluNanOpt
has an invalid enumerant value.
3.2.78. cudnnSetActivationDescriptorSwishBeta()
This function sets the beta parameter of the SWISH activation function to swish_beta
.
cudnnStatus_t cudnnSetActivationDescriptorSwishBeta(cudnnActivationDescriptor_t activationDesc, double swish_beta)
Parameters
 Input/Output. Handle to a previously created activation descriptor.
 Input. The value to set the SWISH activations' beta parameter to.
Returns
 The value was set successfully.

The activation descriptor is a
NULL
pointer.
3.2.79. cudnnSetAlgorithmDescriptor()
This function has been deprecated in cuDNN 8.0.
This function initializes a previously created generic algorithm descriptor object.
cudnnStatus_t cudnnSetAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t algorithmDesc,
cudnnAlgorithm_t algorithm)
Parameters
 Input/Output. Handle to a previously created algorithm descriptor.
 Input. Struct to specify the algorithm.
Returns
 The object was set successfully.
3.2.80. cudnnSetAlgorithmPerformance()
This function has been deprecated in cuDNN 8.0.
This function initializes a previously created generic algorithm performance object.
cudnnStatus_t cudnnSetAlgorithmPerformance(
cudnnAlgorithmPerformance_t algoPerf,
cudnnAlgorithmDescriptor_t algoDesc,
cudnnStatus_t status,
float time,
size_t memory)
Parameters
 Input/Output. Handle to a previously created algorithm performance object.
 Input. The algorithm descriptor which the performance results describe.

Input. The cuDNN status returned from running the
algoDesc
algorithm. 
Input. The GPU time spent running the
algoDesc
algorithm. 
Input. The GPU memory needed to run the
algoDesc
algorithm.
Returns
 The object was set successfully.

mode
orreluNanOpt
has an invalid enumerate value.
3.2.81. cudnnSetCallback()
This function sets the internal states of cuDNN error reporting functionality.
cudnnStatus_t cudnnSetCallback(
unsigned mask,
void *udata,
cudnnCallback_t fptr)
Parameters

Input. An unsigned integer. The four least significant bits (LSBs) of this unsigned integer are used for switching on and off the different levels of error reporting messages. This applies for both the default callbacks, and for the customized callbacks. The bit position is in correspondence with the enum of
cudnnSeverity_t
. The user may utilize the predefined macrosCUDNN_SEV_ERROR_EN
,CUDNN_SEV_WARNING_EN
, andCUDNN_SEV_INFO_EN
to form the bit mask. When a bit is set to1
, the corresponding message channel is enabled.For example, when bit 3 is set to
1
, the API logging is enabled. Currently, only the log output of levelCUDNN_SEV_INFO
is functional; the others are not yet implemented. When used for turning on and off the logging with the default callback, the user may passNULL
toudata
andfptr
. In addition, the environment variableCUDNN_LOGDEST_DBG
must be set. For more information, refer to Deprecation Policy.CUDNN_SEV_INFO_EN
= 0b1000 (functional).CUDNN_SEV_ERROR_EN
= 0b0010 (not yet functional).CUDNN_SEV_WARNING_EN
= 0b0100 (not yet functional).
The output of
CUDNN_SEV_FATAL
is always enabled and cannot be disabled. 
Input. A pointer provided by the user. This pointer will be passed to the user’s custom logging callback function. The data it points to will not be read, nor be changed by cuDNN. This pointer may be used in many ways, such as in a mutex or in a communication socket for the user’s callback function for logging. If the user is utilizing the default callback function, or doesn’t want to use this input in the customized callback function, they may pass in
NULL
. 
Input. A pointer to a usersupplied callback function. When
NULL
is passed to this pointer, then cuDNN switches back to the builtin default callback function. The usersupplied callback function prototype must be similar to the following (also defined in the header file):void customizedLoggingCallback (cudnnSeverity_t sev, void *udata, const cudnnDebug_t *dbg, const char *msg);
 The structure
cudnnDebug_t
is defined in the header file. It provides the metadata, such as time, time since start, stream ID, process and thread ID, that the user may choose to print or store in their customized callback.  The variable
msg
is the logging message generated by cuDNN. Each line of this message is terminated by\0
, and the end of the message is terminated by\0\0
. Users may select what is necessary to show in the log, and may reformat the string.
 The structure
Returns
 The function launched successfully.
3.2.82. cudnnSetDropoutDescriptor()
This function initializes a previously created dropout descriptor object. If the states
argument is equal to NULL
, then the random number generator states won't be initialized, and only the dropout
value will be set. The user is expected not to change the memory pointed at by states
for the duration of the computation.
cudnnStatus_t cudnnSetDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc,
cudnnHandle_t handle,
float dropout,
void *states,
size_t stateSizeInBytes,
unsigned long long seed)
When the states
argument is not NULL
, a cuRAND initialization kernel is invoked by cudnnSetDropoutDescriptor()
. This kernel requires a substantial amount of GPU memory for the stack. Memory is released when the kernel finishes. The CUDNN_STATUS_ALLOC_FAILED
status is returned when no sufficient free memory is available for the GPU stack.
Parameters
 Input/Output. Previously created dropout descriptor object.
 Input. Handle to a previously created cuDNN context.
 Input. The probability with which the value from input is set to zero during the dropout layer.
 Output. Pointer to userallocated GPU memory that will hold random number generator states.
 Input. Specifies the size in bytes of the provided memory for the states.
 Input. Seed used to initialize random number generator states.
Returns
 The call was successful.

The
sizeInBytes
argument is less than the value returned bycudnnDropoutGetStatesSize()
.  The function failed to temporarily extend the GPU stack.
 The function failed to launch on the GPU.
 Internally used CUDA functions returned an error status.
3.2.83. cudnnSetFilter4dDescriptor()
This function initializes a previously created filter descriptor object into a 4D filter. The layout of the filters must be contiguous in memory.
cudnnStatus_t cudnnSetFilter4dDescriptor(
cudnnFilterDescriptor_t filterDesc,
cudnnDataType_t dataType,
cudnnTensorFormat_t format,
int k,
int c,
int h,
int w)
Tensor format CUDNN_TENSOR_NHWC
has limited support in cudnnConvolutionForward()
, cudnnConvolutionBackwardData()
, and cudnnConvolutionBackwardFilter()
.
Parameters
 Input/Output. Handle to a previously created filter descriptor.
 Input. Data type.

Input.Type of the filter layout format. If this input is set to
CUDNN_TENSOR_NCHW
, which is one of the enumerant values allowed bycudnnTensorFormat_t
descriptor, then the layout of the filter is in the form ofKCRS
, where:K
represents the number of output feature mapsC
is the number of input feature mapsR
is the number of rows per filterS
is the number of columns per filter
If this input is set to
CUDNN_TENSOR_NHWC
, then the layout of the filter is in the form ofKRSC
. For more information, refer tocudnnTensorFormat_t
.  Input. Number of output feature maps.
 Input. Number of input feature maps.
 Input. Height of each filter.
 Input. Width of each filter.
Returns
 The object was set successfully.

At least one of the parameters
k
,c
,h
,w
is negative ordataType
orformat
has an invalid enumerant value.
3.2.84. cudnnSetFilterNdDescriptor()
This function initializes a previously created filter descriptor object. The layout of the filters must be contiguous in memory.
cudnnStatus_t cudnnSetFilterNdDescriptor(
cudnnFilterDescriptor_t filterDesc,
cudnnDataType_t dataType,
cudnnTensorFormat_t format,
int nbDims,
const int filterDimA[])
The tensor format CUDNN_TENSOR_NHWC
has limited support in cudnnConvolutionForward()
, cudnnConvolutionBackwardData()
, and cudnnConvolutionBackwardFilter()
.
Parameters
 Input/Output. Handle to a previously created filter descriptor.
 Input. Data type.

Input.Type of the filter layout format. If this input is set to
CUDNN_TENSOR_NCHW
, which is one of the enumerant values allowed bycudnnTensorFormat_t
descriptor, then the layout of the filter is as follows: For
N=4
, a 4D filter descriptor, the filter layout is in the form ofKCRS
:K
represents the number of output feature mapsC
is the number of input feature mapsR
is the number of rows per filterS
is the number of columns per filter
 For
N=3
, a 3D filter descriptor, the numberS
(number of columns per filter) is omitted.  For
N=5
and greater, the layout of the higher dimensions immediately followsRS
.
On the other hand, if this input is set to
CUDNN_TENSOR_NHWC
, then the layout of the filter is as follows: For
N=4
, a 4D filter descriptor, the filter layout is in the form ofKRSC
.  For
N=3
, a 3D filter descriptor, the numberS
(number of columns per filter) is omitted and the layout ofC
immediately followsR
.  For
N=5
and greater, the layout of the higher dimensions are inserted betweenS
andC
. For more information, refer tocudnnTensorFormat_t
.
 For
 Input. Dimension of the filter.

Input. Array of dimension
nbDims
containing the size of the filter for each dimension.
Returns
 The object was set successfully.

At least one of the elements of the array
filterDimA
is negative ordataType
orformat
has an invalid enumerant value. 
The parameter
nbDims
exceedsCUDNN_DIM_MAX
.
3.2.85. cudnnSetLRNDescriptor()
This function initializes a previously created LRN descriptor object.
cudnnStatus_t cudnnSetLRNDescriptor(
cudnnLRNDescriptor_t normDesc,
unsigned lrnN,
double lrnAlpha,
double lrnBeta,
double lrnK)
Parameters
 Output. Handle to a previously created LRN descriptor.

Input. Normalization window width in elements. The LRN layer uses a window
[centerlookBehind, center+lookAhead]
, wherelookBehind = floor( (lrnN1)/2 )
,lookAhead = lrnNlookBehind1
. So forn=10
, the window is[k4...k...k+5]
with a total of 10 samples. For theDivisiveNormalization
layer, the window has the same extent as above in all spatial dimensions (dimA[2]
,dimA[3]
,dimA[4]
). By default,lrnN
is set to5
incudnnCreateLRNDescriptor()
. 
Input. Value of the alpha variance scaling parameter in the normalization formula. Inside the library code, this value is divided by the window width for LRN and by
(window width)^#spatialDimensions
forDivisiveNormalization
. By default, this value is set to1e4
incudnnCreateLRNDescriptor()
. 
Input. Value of the beta power parameter in the normalization formula. By default, this value is set to
0.75
incudnnCreateLRNDescriptor()
. 
Input. Value of the
k
parameter in the normalization formula. By default, this value is set to2.0
.
Returns
 The object was set successfully.
 One of the input parameters was out of valid range as described above.
3.2.86. cudnnSetOpTensorDescriptor()
This function initializes a tensor pointwise math descriptor.
cudnnStatus_t cudnnSetOpTensorDescriptor(
cudnnOpTensorDescriptor_t opTensorDesc,
cudnnOpTensorOp_t opTensorOp,
cudnnDataType_t opTensorCompType,
cudnnNanPropagation_t opTensorNanOpt)
Parameters
 Output. Pointer to the structure holding the description of the tensor pointwise math descriptor.
 Input. Tensor pointwise math operation for this tensor pointwise math descriptor.
 Input. Computation datatype for this tensor pointwise math descriptor.
 Input. NAN propagation policy.
Returns
 The function returned successfully.
 At least one of the input parameters passed is invalid.
3.2.87. cudnnSetPooling2dDescriptor()
This function initializes a previously created generic pooling descriptor object into a 2D description.
cudnnStatus_t cudnnSetPooling2dDescriptor(
cudnnPoolingDescriptor_t poolingDesc,
cudnnPoolingMode_t mode,
cudnnNanPropagation_t maxpoolingNanOpt,
int windowHeight,
int windowWidth,
int verticalPadding,
int horizontalPadding,
int verticalStride,
int horizontalStride)
Parameters
 Input/Output. Handle to a previously created pooling descriptor.
 Input. Enumerant to specify the pooling mode.
 Input. Enumerant to specify the Nan propagation mode.
 Input. Height of the pooling window.
 Input. Width of the pooling window.
 Input. Size of vertical padding.
 Input. Size of horizontal padding.
 Input. Pooling vertical stride.
 Input. Pooling horizontal stride.
Returns
 The object was set successfully.

At least one of the parameters
windowHeight
,windowWidth
,verticalStride
,horizontalStride
is negative ormode
ormaxpoolingNanOpt
has an invalid enumerate value.
3.2.88. cudnnSetPoolingNdDescriptor()
This function initializes a previously created generic pooling descriptor object.
cudnnStatus_t cudnnSetPoolingNdDescriptor(
cudnnPoolingDescriptor_t poolingDesc,
const cudnnPoolingMode_t mode,
const cudnnNanPropagation_t maxpoolingNanOpt,
int nbDims,
const int windowDimA[],
const int paddingA[],
const int strideA[])
Parameters
 Input/Output. Handle to a previously created pooling descriptor.
 Input. Enumerant to specify the pooling mode.
 Input. Enumerant to specify the Nan propagation mode.
 Input. Dimension of the pooling operation. Must be greater than zero.

Input. Array of dimension
nbDims
containing the window size for each dimension. The value of array elements must be greater than zero. 
Input. Array of dimension
nbDims
containing the padding size for each dimension. Negative padding is allowed. 
Input. Array of dimension
nbDims
containing the striding size for each dimension. The value of array elements must be greater than zero (meaning, negative striding size is not allowed).
Returns
 The object was initialized successfully.

If (
nbDims
>CUDNN_DIM_MAX2
). 
Either
nbDims
, or at least one of the elements of the arrayswindowDimA
orstrideA
is negative, ormode
ormaxpoolingNanOpt
has an invalid enumerate value.
3.2.89. cudnnSetReduceTensorDescriptor()
This function initializes a previously created reduce tensor descriptor object.
cudnnStatus_t cudnnSetReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t reduceTensorDesc,
cudnnReduceTensorOp_t reduceTensorOp,
cudnnDataType_t reduceTensorCompType,
cudnnNanPropagation_t reduceTensorNanOpt,
cudnnReduceTensorIndices_t reduceTensorIndices,
cudnnIndicesType_t reduceTensorIndicesType)
Parameters
 Input/Output. Handle to a previously created reduce tensor descriptor.
 Input. Enumerant to specify the reduce tensor operation.
 Input. Enumerant to specify the computation datatype of the reduction.
 Input. Enumerant to specify the Nan propagation mode.
 Input. Enumerant to specify the reduced tensor indices.
 Input. Enumerant to specify the reduce tensor indices type.
Returns
 The object was set successfully.

reduceTensorDesc
isNULL
(reduceTensorOp
,reduceTensorCompType
,reduceTensorNanOpt
,reduceTensorIndices
orreduceTensorIndicesType
has an invalid enumerant value).
3.2.90. cudnnSetSpatialTransformerNdDescriptor()
This function initializes a previously created generic spatial transformer descriptor object.
cudnnStatus_t cudnnSetSpatialTransformerNdDescriptor(
cudnnSpatialTransformerDescriptor_t stDesc,
cudnnSamplerType_t samplerType,
cudnnDataType_t dataType,
const int nbDims,
const int dimA[])
Parameters
 Input/Output. Previously created spatial transformer descriptor object.
 Input. Enumerant to specify the sampler type.
 Input. Data type.
 Input. Dimension of the transformed tensor.

Input. Array of dimension
nbDims
containing the size of the transformed tensor for every dimension.
Returns
 The call was successful.

At least one of the following conditions are met:
 Either
stDesc
ordimA
isNULL
.  Either
dataType
orsamplerType
has an invalid enumerant value.
 Either
3.2.91. cudnnSetStream()
This function sets the user's CUDA stream in the cuDNN handle. The new stream will be used to launch cuDNN GPU kernels or to synchronize to this stream when cuDNN kernels are launched in the internal streams. If the cuDNN library stream is not set, all kernels use the default (NULL
) stream. Setting the user stream in the cuDNN handle guarantees the issueorder execution of cuDNN calls and other GPU kernels launched in the same stream.
cudnnStatus_t cudnnSetStream(
cudnnHandle_t handle,
cudaStream_t streamId)
With CUDA 11.x or later, internal streams have the same priority as the stream set by the last call to this function. In CUDA graph capture mode, CUDA 11.8 or later is required in order for the stream priorities to match.
Parameters
 Input. Pointer to the cuDNN handle.
 Input. New CUDA stream to be written to the cuDNN handle.
Returns

Invalid (
NULL
) handle.  Mismatch between the user stream and the cuDNN handle context.
 The new stream was set successfully.
3.2.92. cudnnSetTensor()
This function sets all the elements of a tensor to a given value.
cudnnStatus_t cudnnSetTensor(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t yDesc,
void *y,
const void *valuePtr)
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Handle to a previously initialized tensor descriptor.

Input/Output. Pointer to data of the tensor described by the
yDesc
descriptor. 
Input. Pointer in host memory to a single value. All elements of the
y
tensor will be set tovalue[0]
. The data type of the element invalue[0]
has to match the data type of tensory
.
Returns
 The function launched successfully.
 The function does not support the provided configuration.

One of the provided pointers is
NIL
.  The function failed to launch on the GPU.
3.2.93. cudnnSetTensor4dDescriptor()
This function initializes a previously created generic tensor descriptor object into a 4D tensor. The strides of the four dimensions are inferred from the format parameter and set in such a way that the data is contiguous in memory with no padding between dimensions.
cudnnStatus_t cudnnSetTensor4dDescriptor(
cudnnTensorDescriptor_t tensorDesc,
cudnnTensorFormat_t format,
cudnnDataType_t dataType,
int n,
int c,
int h,
int w)
The total size of a tensor including the potential padding between dimensions is limited to 2 Gigaelements of type datatype
.
Parameters
 Input/Output. Handle to a previously created tensor descriptor.
 Input. Type of format.
 Input. Data type.
 Input. Number of images.
 Input. Number of feature maps per image.
 Input. Height of each feature map.
 Input. Width of each feature map.
Returns
 The object was set successfully.

At least one of the parameters
n
,c
,h
,w
was negative orformat
has an invalid enumerant value ordataType
has an invalid enumerant value.  The total size of the tensor descriptor exceeds the maximum limit of 2 Gigaelements.
3.2.94. cudnnSetTensor4dDescriptorEx()
This function initializes a previously created generic tensor descriptor object into a 4D tensor, similarly to cudnnSetTensor4dDescriptor()
but with the strides explicitly passed as parameters. This can be used to lay out the 4D tensor in any order or simply to define gaps between dimensions.
cudnnStatus_t cudnnSetTensor4dDescriptorEx(
cudnnTensorDescriptor_t tensorDesc,
cudnnDataType_t dataType,
int n,
int c,
int h,
int w,
int nStride,
int cStride,
int hStride,
int wStride)
At present, some cuDNN routines have limited support for strides. Those routines will return CUDNN_STATUS_NOT_SUPPORTED
if a 4D tensor object with an unsupported stride is used. cudnnTransformTensor()
can be used to convert the data to a supported layout.
The total size of a tensor including the potential padding between dimensions is limited to 2 Gigaelements of type datatype
.
Parameters
 Input/Output. Handle to a previously created tensor descriptor.
 Input. Data type.
 Input. Number of images.
 Input. Number of feature maps per image.
 Input. Height of each feature map.
 Input. Width of each feature map.
 Input. Stride between two consecutive images.
 Input. Stride between two consecutive feature maps.
 Input. Stride between two consecutive rows.
 Input. Stride between two consecutive columns.
Returns
 The object was set successfully.

At least one of the parameters
n
,c
,h
,w
ornStride
,cStride
,hStride
,wStride
is negative ordataType
has an invalid enumerant value.  The total size of the tensor descriptor exceeds the maximum limit of 2 Gigaelements.
3.2.95. cudnnSetTensorNdDescriptor()
This function initializes a previously created generic tensor descriptor object.
cudnnStatus_t cudnnSetTensorNdDescriptor(
cudnnTensorDescriptor_t tensorDesc,
cudnnDataType_t dataType,
int nbDims,
const int dimA[],
const int strideA[])
The total size of a tensor including the potential padding between dimensions is limited to 2 Gigaelements of type datatype
. Tensors are restricted to having at least 4 dimensions, and at most CUDNN_DIM_MAX
dimensions (defined in cudnn.h
). When working with lower dimensional data, it is recommended that the user create a 4D tensor, and set the size along unused dimensions to 1.
Parameters
 Input/Output. Handle to a previously created tensor descriptor.
 Input. Data type.

Input. Dimension of the tensor.
Note:
Do not use 2 dimensions. Due to historical reasons, the minimum number of dimensions in the filter descriptor is three. For more information, refer to
cudnnGetRNNLinLayerBiasParams()
. 
Input. Array of dimension
nbDims
that contain the size of the tensor for every dimension. The size along unused dimensions should be set to1
. By convention, the ordering of dimensions in the array follows the format [N, C, D, H, W]
, withW
occupying the smallest index in the array. 
Input. Array of dimension
nbDims
that contain the stride of the tensor for every dimension. By convention, the ordering of the strides in the array follows the format [Nstride, Cstride, Dstride, Hstride, Wstride]
, withWstride
occupying the smallest index in the array.
Returns
 The object was set successfully.

At least one of the elements of the array
dimA
was negative or zero, ordataType
has an invalid enumerant value. 
The parameter
nbDims
is outside the range[4, CUDNN_DIM_MAX]
, or the total size of the tensor descriptor exceeds the maximum limit of 2 Gigaelements.
3.2.96. cudnnSetTensorNdDescriptorEx()
This function initializes an Nd
tensor descriptor.
cudnnStatus_t cudnnSetTensorNdDescriptorEx(
cudnnTensorDescriptor_t tensorDesc,
cudnnTensorFormat_t format,
cudnnDataType_t dataType,
int nbDims,
const int dimA[])
Parameters
 Output. Pointer to the tensor descriptor struct to be initialized.
 Input. Tensor format.
 Input. Tensor data type.

Input. Dimension of the tensor.
Note:
Do not use 2 dimensions. Due to historical reasons, the minimum number of dimensions in the filter descriptor is three. For more information, refer to
cudnnGetRNNLinLayerBiasParams()
.  Input. Array containing the size of each dimension.
Returns
 The function was successful.
 Tensor descriptor was not allocated properly; or input parameters are not set correctly.
 Dimension size requested is larger than maximum dimension size supported.
3.2.97. cudnnSetTensorTransformDescriptor()
This function initializes a tensor transform descriptor that was previously created using the cudnnCreateTensorTransformDescriptor()
function.
cudnnStatus_t cudnnSetTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc,
const uint32_t nbDims,
const cudnnTensorFormat_t destFormat,
const int32_t padBeforeA[],
const int32_t padAfterA[],
const uint32_t foldA[],
const cudnnFoldingDirection_t direction);
Parameters
 Output. The tensor transform descriptor to be initialized.
 Input. The dimensionality of the transform operands. Must be greater than 2. For more information, refer to Tensor Descriptor.
 Input. The desired destination format.

Input. An array that contains the amount of padding that should be added before each dimension. Set to
NULL
for no padding. 
Input. An array that contains the amount of padding that should be added after each dimension. Set to
NULL
for no padding. 
Input. An array that contains the folding parameters for each spatial dimension (dimensions 2 and up). Set to
NULL
for no folding. 
Input. Selects folding or unfolding. This input has no effect when folding parameters are all <= 1. For more information, refer to
cudnnFoldingDirection_t
.
Returns
 The function was launched successfully.

The parameter
transformDesc
isNULL
, or ifdirection
is invalid, ornbDims
is <= 2. 
If the dimension size requested is larger than maximum dimension size supported (meaning, one of the
nbDims
is larger thanCUDNN_DIM_MAX
), or ifdestFromat
is something other thanNCHW
orNHWC
.
3.2.98. cudnnSoftmaxForward()
This routine computes the softmax function.
cudnnStatus_t cudnnSoftmaxForward(
cudnnHandle_t handle,
cudnnSoftmaxAlgorithm_t algorithm,
cudnnSoftmaxMode_t mode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
All tensor formats are supported for all modes and algorithms with 4 and 5D tensors. Performance is expected to be highest with NCHW fullypacked
tensors. For more than 5 dimensions tensors must be packed in their spatial dimensions.
Data Types
This function supports the following data types:
CUDNN_DATA_FLOAT
CUDNN_DATA_DOUBLE
CUDNN_DATA_HALF
CUDNN_DATA_BFLOAT16
CUDNN_DATA_INT8
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Enumerant to specify the softmax algorithm.
 Input. Enumerant to specify the softmax mode.

Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handle to the previously initialized input tensor descriptor.

Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
.  Input. Handle to the previously initialized output tensor descriptor.

Output. Data pointer to GPU memory associated with the output tensor descriptor
yDesc
.
Returns
 The function launched successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 The dimensions
n
,c
,h
,w
of the input tensor and output tensors differ.  The
datatype
of the input tensor and output tensors differ.  The parameters
algorithm
ormode
have an invalid enumerant value.
 The dimensions
 The function failed to launch on the GPU.
3.2.99. cudnnSpatialTfGridGeneratorForward()
This function generates a grid of coordinates in the input tensor corresponding to each pixel from the output tensor.
cudnnStatus_t cudnnSpatialTfGridGeneratorForward(
cudnnHandle_t handle,
const cudnnSpatialTransformerDescriptor_t stDesc,
const void *theta,
void *grid)
Only 2D transformation is supported.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Previously created spatial transformer descriptor object.

Input. Affine transformation matrix. It should be of size
n*2*3
for a 2d transformation, wheren
is the number of images specified instDesc
. 
Output. A grid of coordinates. It is of size
n*h*w*2
for a 2d transformation, wheren
,h
,w
is specified instDesc
. In the 4th dimension, the first coordinate isx
, and the second coordinate isy
.
Returns
 The call was successful.

At least one of the following conditions are met:
handle
isNULL
. One of the parameters
grid
ortheta
isNULL
.

The function does not support the provided configuration. Refer to the following examples of nonsupported configurations:
 The dimension of the transformed tensor specified in
stDesc
> 4.
 The dimension of the transformed tensor specified in
 The function failed to launch on the GPU.
3.2.100. cudnnSpatialTfSamplerForward()
This function performs a sampler operation and generates the output tensor using the grid given by the grid generator.
cudnnStatus_t cudnnSpatialTfSamplerForward(
cudnnHandle_t handle,
const cudnnSpatialTransformerDescriptor_t stDesc,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *grid,
const void *beta,
cudnnTensorDescriptor_t yDesc,
void *y)
Only 2D transformation is supported.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Previously created spatial transformer descriptor object.

Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Input. Handle to the previously initialized input tensor descriptor.

Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
. 
Input. A grid of coordinates generated by
cudnnSpatialTfGridGeneratorForward()
.  Input. Handle to the previously initialized output tensor descriptor.

Output. Data pointer to GPU memory associated with the output tensor descriptor
yDesc
.
Returns
 The call was successful.

At least one of the following conditions are met:
handle
isNULL
. One of the parameters
x
,y
orgrid
isNULL
.

The function does not support the provided configuration. Refer to the following examples of nonsupported configurations:
 The dimension of transformed tensor > 4.
 The function failed to launch on the GPU.
3.2.101. cudnnTransformFilter()
This function converts the filter between different formats, data types, or dimensions based on the described transformation. It can be used to convert a filter with an unsupported layout format to a filter with a supported layout format.
cudnnStatus_t cudnnTransformFilter(
cudnnHandle_t handle,
const cudnnTensorTransformDescriptor_t transDesc,
const void *alpha,
const cudnnFilterDescriptor_t srcDesc,
const void *srcData,
const void *beta,
const cudnnFilterDescriptor_t destDesc,
void *destData);
This function copies the scaled data from the input filter srcDesc
to the output tensor destDesc
with a different layout. If the filter descriptors of srcDesc
and destDesc
have different dimensions, they must be consistent with folding and padding amount and order specified in transDesc
.
The srcDesc
and destDesc
tensors must not overlap in any way (that is, tensors cannot be transformed in place).
When performing a folding transform or a zeropadding transform, the scaling factors (alpha
, beta
) should be set to (1, 0). However, unfolding transforms support any (alpha
, beta
) values. This function is thread safe.
Parameters

Input. Handle to a previously created cuDNN context. For more information, refer to
cudnnHandle_t
. 
Input. A descriptor containing the details of the requested filter transformation. For more information, refer to
cudnnTensorTransformDescriptor_t
. 
Input. Pointers, in the host memory, to the scaling factors used to scale the data in the input tensor
srcDesc
.beta
is used to scale the destination tensor, whilealpha
is used to scale the source tensor. For more information, refer to Scaling Parameters.The beta scaling value is not honored in the folding and zeropadding cases. Unfolding supports any (
alpha
,beta
). 
Input. Handles to the previously initiated filter descriptors.
srcDesc
anddestDesc
must not overlap. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Pointers, in the host memory, to the data of the tensor described by
srcDesc
. 
Output. Pointers, in the host memory, to the data of the tensor described by
destDesc
.
Returns
 The function launched successfully.

A parameter is uninitialized or initialized incorrectly, or the number of dimensions is different between
srcDesc
anddestDesc
. 
The function does not support the provided configuration. Also, in the folding and padding paths, any value other than
A=1
andB=0
will result in aCUDNN_STATUS_NOT_SUPPORTED
.  The function failed to launch on the GPU.
3.2.102. cudnnTransformTensor()
This function copies the scaled data from one tensor to another tensor with a different layout. Those descriptors need to have the same dimensions but not necessarily the same strides. The input and output tensors must not overlap in any way (meaning, tensors cannot be transformed in place). This function can be used to convert a tensor with an unsupported format to a supported one.
cudnnStatus_t cudnnTransformTensor(
cudnnHandle_t handle,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t yDesc,
void *y)
Parameters
 Input. Handle to a previously created cuDNN context.

Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Handle to a previously initialized tensor descriptor. For more information, refer to
cudnnTensorDescriptor_t
. 
Input. Pointer to data of the tensor described by the
xDesc
descriptor. 
Input. Handle to a previously initialized tensor descriptor. For more information, refer to
cudnnTensorDescriptor_t
. 
Output. Pointer to data of the tensor described by the
yDesc
descriptor.
Returns
 The function launched successfully.
 The function does not support the provided configuration.

The dimensions
n
,c
,h
,w
or thedataType
of the two tensor descriptors are different.  The function failed to launch on the GPU.
3.2.103. cudnnTransformTensorEx()
This function converts the tensor layouts between different formats. It can be used to convert a tensor with an unsupported layout format to a tensor with a supported layout format.
cudnnStatus_t cudnnTransformTensorEx(
cudnnHandle_t handle,
const cudnnTensorTransformDescriptor_t transDesc,
const void *alpha,
const cudnnTensorDescriptor_t srcDesc,
const void *srcData,
const void *beta,
const cudnnTensorDescriptor_t destDesc,
void *destData);
This function copies the scaled data from the input tensor srcDesc
to the output tensor destDesc
with a different layout. The tensor descriptors of srcDesc
and destDesc
should have the same dimensions but need not have the same strides.
The srcDesc
and destDesc
tensors must not overlap in any way (that is, tensors cannot be transformed in place).
When performing a folding transform or a zeropadding transform, the scaling factors (alpha,beta)
should be set to (1, 0). However, unfolding transforms support any (alpha,beta)
values. This function is thread safe.
Parameters

Input. Handle to a previously created cuDNN context. For more information, refer to
cudnnHandle_t
. 
Input. A descriptor containing the details of the requested tensor transformation. For more information, refer to
cudnnTensorTransformDescriptor_t
. 
Input. Pointers, in the host memory, to the scaling factors used to scale the data in the input tensor
srcDesc
.beta
is used to scale the destination tensor, whilealpha
is used to scale the source tensor. For more information, refer to Scaling Parameters.The beta scaling value is not honored in the folding and zeropadding cases. Unfolding supports any (
alpha
,beta
). 
Input. Handles to the previously initiated tensor descriptors.
srcDesc
anddestDesc
must not overlap. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Pointers, in the host memory, to the data of the tensor described by
srcDesc
. 
Output. Pointers, in the host memory, to the data of the tensor described by
destDesc
.
Returns
 The function was launched successfully.

A parameter is uninitialized or initialized incorrectly, or the number of dimensions is different between
srcDesc
anddestDesc
. 
Function does not support the provided configuration. Also, in the folding and padding paths, any value other than
A=1
andB=0
will result in aCUDNN_STATUS_NOT_SUPPORTED
.  Function failed to launch on the GPU.
This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, and so on. The cudnn_ops_train
library depends on cudnn_ops_infer
.
4.1. API Functions
These are the API functions in the cudnn_ops_train.so
library.
4.1.1. cudnnActivationBackward()
This routine computes the gradient of a neuron activation function.
cudnnStatus_t cudnnActivationBackward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t activationDesc,
const void *alpha,
const cudnnTensorDescriptor_t yDesc,
const void *y,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx)
Inplace operation is allowed for this routine; meaning dy
and dx
pointers may be equal. However, this requires the corresponding tensor descriptors to be identical (particularly, the strides of the input and output must match for an inplace operation to be allowed).
All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of yDesc
and xDesc
are equal and HWpacked
. For more than 5 dimensions the tensors must have their spatial dimensions packed.
Parameters

Input. Handle to a previously created cuDNN context. For more information, refer to
cudnnHandle_t
. 
Input. Activation descriptor. For more information, refer to
cudnnActivationDescriptor_t
. 
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Handle to the previously initialized input tensor descriptor. For more information, refer to
cudnnTensorDescriptor_t
. 
Input. Data pointer to GPU memory associated with the tensor descriptor
yDesc
.  Input. Handle to the previously initialized input differential tensor descriptor.

Input. Data pointer to GPU memory associated with the tensor descriptor
dyDesc
.  Input. Handle to the previously initialized output tensor descriptor.

Input. Data pointer to GPU memory associated with the output tensor descriptor
xDesc
.  Input. Handle to the previously initialized output differential tensor descriptor.

Output. Data pointer to GPU memory associated with the output tensor descriptor
dxDesc
.
Returns
 The function launched successfully.

At least one of the following conditions are met:
 The strides
nStride
,cStride
,hStride
andwStride
of the input differential tensor and output differential tensor differ and inplace operation is used.
 The strides

The function does not support the provided configuration. Refer to the following examples of nonsupported configurations:
 The dimensions
n
,c
,h
, andw
of the input tensor and output tensor differ.  The
datatype
of the input tensor and output tensor differs.  The strides
nStride
,cStride
,hStride
, andwStride
of the input tensor and the input differential tensor differ.  The strides
nStride
,cStride
,hStride
, andwStride
of the output tensor and the output differential tensor differ.
 The dimensions
 The function failed to launch on the GPU.
4.1.2. cudnnBatchNormalizationBackward()
This function performs the backward batch normalization layer computation. This layer is based on the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper.
cudnnStatus_t cudnnBatchNormalizationBackward(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alphaDataDiff,
const void *betaDataDiff,
const void *alphaParamDiff,
const void *betaParamDiff,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnTensorDescriptor_t dxDesc,
void *dx,
const cudnnTensorDescriptor_t bnScaleBiasDiffDesc,
const void *bnScale,
void *resultBnScaleDiff,
void *resultBnBiasDiff,
double epsilon,
const void *savedMean,
const void *savedInvVariance)
Only 4D and 5D tensors are supported.
The epsilon
value has to be the same during training, backpropagation, and inference.
Higher performance can be obtained when HWpacked
tensors are used for all of x
, dy
, and dx
.
For more information, refer to cudnnDeriveBNTensorDescriptor()
for the secondary tensor descriptor generation for the parameters used in this function.
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output
dx
with a prior value in the destination tensor as follows:dstValue = alphaDataDiff[0]*resultValue + betaDataDiff[0]*priorDstValue
For more information, refer to Scaling Parameters.

Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs
resultBnScaleDiff
andresultBnBiasDiff
with prior values in the destination tensor as follows:dstValue = alphaParamDiff[0]*resultValue + betaParamDiff[0]*priorDstValue
For more information, refer to Scaling Parameters.
 Inputs. Handles to the previously initialized tensor descriptors.

Inputs. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
data. 
Inputs. Data pointer to GPU memory associated with the tensor descriptor
dyDesc
, for the backpropagated differentialdy
input. 
Inputs/Outputs. Data pointer to GPU memory associated with the tensor descriptor
dxDesc
, for the resulting differential output with respect tox
. 
Input. Shared tensor descriptor for the following five tensors:
bnScale
,resultBnScaleDiff
,resultBnBiasDiff
,savedMean
, andsavedInvVariance
. The dimensions for this tensor descriptor are dependent on normalization mode. For more information, refer tocudnnDeriveBNTensorDescriptor()
.Note:The data type of this tensor descriptor must be
float
for FP16 and FP32 input tensors, anddouble
for FP64 input tensors. 
Input. Pointer in the device memory for the batch normalization
scale
parameter (in the original paper the quantityscale
is referred to as gamma).Note:The
bnBias
parameter is not needed for this layer's computation.  Outputs. Pointers in device memory for the resulting scale and bias differentials computed by this routine. Note that these scale and bias gradients are weight gradients specific to this batch normalization operation, and by definition are not backpropagated.

Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. 
Inputs. Optional cache parameters containing saved intermediate results that were computed during the forward pass. For this to work correctly, the layer's
x
andbnScale
data have to remain unchanged until this backward function is called.Note:Both these parameters can be
NULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
Supported configurations
This function supports the following combinations of data types for various descriptors.
Data Type Configurations  xDesc 
bnScaleBiasMeanVarDesc 
alpha , beta 
yDesc 

PSEUDO_HALF_CONFIG 
CUDNN_DATA_HALF 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_HALF 
FLOAT_CONFIG 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
DOUBLE_CONFIG 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
PSEUDO_BFLOAT16_CONFIG 
CUDNN_DATA_BFLOAT16 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_BFLOAT16 
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 Any of the pointers
alpha
,beta
,x
,dy
,dx
,bnScale
,resultBnScaleDiff
, andresultBnBiasDiff
isNULL
.  The number of
xDesc
,yDesc
ordxDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported). bnScaleBiasDiffDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Exactly one of
savedMean
,savedInvVariance
pointers isNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
. Dimensions or data types mismatch for any pair of
xDesc
,dyDesc
, anddxDesc
.
 Any of the pointers
4.1.3. cudnnBatchNormalizationBackwardEx()
This function is an extension of the cudnnBatchNormalizationBackward()
for performing the backward batch normalization layer computation with a fast NHWC semipersistent kernel.
cudnnStatus_t cudnnBatchNormalizationBackwardEx (
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const void *alphaDataDiff,
const void *betaDataDiff,
const void *alphaParamDiff,
const void *betaParamDiff,
const cudnnTensorDescriptor_t xDesc,
const void *xData,
const cudnnTensorDescriptor_t yDesc,
const void *yData,
const cudnnTensorDescriptor_t dyDesc,
const void *dyData,
const cudnnTensorDescriptor_t dzDesc,
void *dzData,
const cudnnTensorDescriptor_t dxDesc,
void *dxData,
const cudnnTensorDescriptor_t dBnScaleBiasDesc,
const void *bnScaleData,
const void *bnBiasData,
void *dBnScaleData,
void *dBnBiasData,
double epsilon,
const void *savedMean,
const void *savedInvVariance,
const cudnnActivationDescriptor_t activationDesc,
void *workspace,
size_t workSpaceSizeInBytes
void *reserveSpace
size_t reserveSpaceSizeInBytes);
This API will trigger the new semipersistent NHWC kernel when the following conditions are true:
 All tensors, namely,
x
,y
,dz
,dy
anddx
must be NHWCfully packed, and must be of the typeCUDNN_DATA_HALF
.  The input parameter
mode
must be set toCUDNN_BATCHNORM_SPATIAL_PERSISTENT
. workspace
is notNULL
. Before cuDNN version 8.2.0, the tensor
C
dimension should always be a multiple of 4. After 8.2.0, the tensorC
dimension should be a multiple of 4 only whenbnOps
isCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
. workSpaceSizeInBytes
is equal to or larger than the amount required bycudnnGetBatchNormalizationBackwardExWorkspaceSize()
.reserveSpaceSizeInBytes
is equal to or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
. The content in
reserveSpace
stored bycudnnBatchNormalizationForwardTrainingEx()
must be preserved.
If workspace
is NULL
and workSpaceSizeInBytes
of zero is passed in, this API will function exactly like the nonextended function cudnnBatchNormalizationBackward
.
This workspace
is not required to be clean. Moreover, the workspace
does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.
This extended function can accept a *workspace
pointer to the GPU workspace, and workSpaceSizeInBytes
, the size of the workspace, from the user.
The bnOps
input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation.
Only 4D and 5D tensors are supported. The epsilon
value has to be the same during the training, the backpropagation, and the inference.
When the tensor layout is NCHW, higher performance can be obtained when HWpacked tensors are used for x
, dy
, and dx
.
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Input. Mode of operation. Currently,
CUDNN_BATCHNORM_OPS_BN_ACTIVATION
andCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
are only supported in the NHWC layout. For more information, refer tocudnnBatchNormOps_t
. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation. 
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output
dx
with a prior value in the destination tensor as follows:dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs
dBnScaleData
anddBnBiasData
with prior values in the destination tensor as follows:dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Inputs. Tensor descriptors and pointers in the device memory for the layer's
x
data, backpropagated gradient inputdy
, the original forward outputy
data.yDesc
andyData
are not needed ifbnOps
is set toCUDNN_BATCHNORM_OPS_BN
, users may passNULL
. For more information, refer tocudnnTensorDescriptor_t
. 
Inputs. Tensor descriptors and pointers in the device memory for the computed gradient output
dz
, anddx
.dzDesc
is not needed whenbnOps
isCUDNN_BATCHNORM_OPS_BN
orCUDNN_BATCHNORM_OPS_BN_ACTIVATION
, users may passNULL
. For more information, refer tocudnnTensorDescriptor_t
. 
Outputs. Tensor descriptors and pointers in the device memory for the computed gradient output
dz
, anddx
.*dzData
is not needed whenbnOps
isCUDNN_BATCHNORM_OPS_BN
orCUDNN_BATCHNORM_OPS_BN_ACTIVATION
, users may passNULL
. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Shared tensor descriptor for the following six tensors:
bnScaleData
,bnBiasData
,dBnScaleData
,dBnBiasData
,savedMean
, andsavedInvVariance
. For more information, refer tocudnnDeriveBNTensorDescriptor()
.The dimensions for this tensor descriptor are dependent on normalization mode.
Note:The data type of this tensor descriptor must be
float
for FP16 and FP32 input tensors anddouble
for FP64 input tensors.For more information, refer to
cudnnTensorDescriptor_t
.
 Input. Pointer in the device memory for the batch normalization scale parameter (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, the quantity scale is referred to as gamma).
 Input. Pointers in the device memory for the batch normalization bias parameter (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, bias is referred to as beta). This parameter is used only when activation should be performed.

Outputs. Pointers in the device memory for the gradients of
bnScaleData
andbnBiasData
, respectively. 
Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The same epsilon value should be used in forward and backward functions. 
Inputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's
x
andbnScaleData
,bnBiasData
data has to remain unchanged until this backward function is called. Note that both these parameters can beNULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small. 
Input. Descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
then this activation is used, otherwise the user may passNULL
. 
Input. Pointer to the GPU workspace. If
workspace
isNULL
andworkSpaceSizeInBytes
of zero is passed in, then this API will function exactly like the nonextended functioncudnnBatchNormalizationBackward()
.  Input. The size of the workspace. It must be large enough to trigger the fast NHWC semipersistent kernel by this function.

Input. Pointer to the GPU workspace for the
reserveSpace
. 
Input. The size of the
reserveSpace
. It must be equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.
Supported configurations
This function supports the following combinations of data types for various descriptors.
Data Type Configurations  xDesc 
bnScaleBiasMeanVarDesc 
alpha , beta 
yDesc 

PSEUDO_HALF_CONFIG 
CUDNN_DATA_HALF 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_HALF 
FLOAT_CONFIG 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
DOUBLE_CONFIG 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
PSEUDO_BFLOAT16_CONFIG 
CUDNN_DATA_BFLOAT16 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_BFLOAT16 
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 Any of the pointers
alphaDataDiff
,betaDataDiff
,alphaParamDiff
,betaParamDiff
,x
,dy
,dx
,bnScale
,resultBnScaleDiff
, andresultBnBiasDiff
isNULL
.  The number of
xDesc
,yDesc
, ordxDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported). dBnScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Exactly one of
savedMean
,savedInvVariance
pointers isNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
. Dimensions or data types mismatch for any pair of
xDesc
,dyDesc
, ordxDesc
.
 Any of the pointers
4.1.4. cudnnBatchNormalizationForwardTraining()
This function performs the forward batch normalization layer computation for the training phase. This layer is based on the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper.
cudnnStatus_t cudnnBatchNormalizationForwardTraining(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t yDesc,
void *y,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScale,
const void *bnBias,
double exponentialAverageFactor,
void *resultRunningMean,
void *resultRunningVariance,
double epsilon,
void *resultSaveMean,
void *resultSaveInvVariance)
Only 4D and 5D tensors are supported.
The epsilon value has to be the same during training, backpropagation, and inference.
For the inference phase, use cudnnBatchNormalizationForwardInference.
Higher performance can be obtained when HWpacked tensors are used for both x and y.
Refer to cudnnDeriveBNTensorDescriptor()
for the secondary tensor descriptor generation for the parameters used in this function.
Parameters

handle

Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
mode

Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
alpha
,beta

Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

xDesc
,yDesc

Tensor descriptors and pointers in device memory for the layer's
x
andy
data. For more information, refer tocudnnTensorDescriptor_t
. 
*x

Input. Data pointer to GPU memory associated with the tensor descriptor
xDesc
, for the layer’sx
input data. 
*y

Input. Data pointer to GPU memory associated with the tensor descriptor
yDesc
, for they
output of the batch normalization layer. 
bnScaleBiasMeanVarDesc

Shared tensor descriptor
desc
for the secondary tensor that was derived bycudnnDeriveBNTensorDescriptor()
. The dimensions for this tensor descriptor are dependent on the normalization mode. 
bnScale
,bnBias

Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, bias is referred to as beta and scale as gamma). Note that
bnBias
parameter can replace the previous layer's bias parameter for improved efficiency. 
exponentialAverageFactor

Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1factor) + newMean*factor
Use afactor=1/(1+n)
atN
th call to the function to get the Cumulative Moving Average (CMA) behavior, for example:CMA[n] = (x[1]+...+x[n])/n
For example:CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) = ((n+1)*CMA[n]CMA[n])/(n+1) + x[n+1]/(n+1) = CMA[n]*(11/(n+1))+x[n+1]*1/(n+1) = CMA[n]*(1factor) + x(n+1)*factor

resultRunningMean
,resultRunningVariance

Inputs/Outputs. Running mean and variance tensors (these have the same descriptor as the bias and scale). Both of these pointers can be
NULL
but only at the same time. The value stored inresultRunningVariance
(or passed as an input in inference mode) is the sample variance and is the moving average ofvariance[x]
where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are notNULL
, the tensors should be initialized to some reasonable values or to 0. 
epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. 
resultSaveMean
,resultSaveInvVariance

Outputs. Optional cache to save intermediate results computed during the forward pass. These buffers can be used to speed up the backward pass when supplied to the
cudnnBatchNormalizationBackward()
function. The intermediate results stored inresultSaveMean
andresultSaveInvVariance
buffers should not be used directly by the user. Depending on the batch normalization mode, the results stored inresultSaveInvVariance
may vary. For the cache to work correctly, the input layer data must remain unchanged until the backward function is called. Note that both parameters can beNULL
but only at the same time. In such a case, intermediate statistics will not be saved, andcudnnBatchNormalizationBackward()
will have to recompute them. It is recommended to use this cache as the memory overhead is relatively small because these tensors have a much lower product of dimensions than the data tensors.
Supported configurations
This function supports the following combinations of data types for various descriptors.
Data Type Configurations  xDesc 
bnScaleBiasMeanVarDesc 
alpha , beta 
yDesc 

PSEUDO_HALF_CONFIG 
CUDNN_DATA_HALF 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_HALF 
FLOAT_CONFIG 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
DOUBLE_CONFIG 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
PSEUDO_BFLOAT16_CONFIG 
CUDNN_DATA_BFLOAT16 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_BFLOAT16 
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 One of the pointers
alpha
,beta
,x
,y
,bnScale
, andbnBias
isNULL
.  The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported). bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Exactly one of
resultSaveMean
,resultSaveInvVariance
pointers areNULL
.  Exactly one of
resultRunningMean
,resultRunningInvVariance
pointers areNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
. Dimensions or data types mismatch for
xDesc
oryDesc
.
 One of the pointers
4.1.5. cudnnBatchNormalizationForwardTrainingEx()
This function is an extension of the cudnnBatchNormalizationForwardTraining()
for performing the forward batch normalization layer computation.
cudnnStatus_t cudnnBatchNormalizationForwardTrainingEx(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *xData,
const cudnnTensorDescriptor_t zDesc,
const void *zData,
const cudnnTensorDescriptor_t yDesc,
void *yData,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const void *bnScaleData,
const void *bnBiasData,
double exponentialAverageFactor,
void *resultRunningMeanData,
void *resultRunningVarianceData,
double epsilon,
void *saveMean,
void *saveInvVariance,
const cudnnActivationDescriptor_t activationDesc,
void *workspace,
size_t workSpaceSizeInBytes
void *reserveSpace
size_t reserveSpaceSizeInBytes);
This API will trigger the new semipersistent NHWC kernel when the following conditions are true:
 All tensors, namely,
x
,y
,dz
,dy
anddx
must be NHWCfully packed and must be of the typeCUDNN_DATA_HALF
.  The input parameter
mode
must be set toCUDNN_BATCHNORM_SPATIAL_PERSISTENT
. workspace
is notNULL
. Before cuDNN version 8.2.0, the tensor
C
dimension should always be a multiple of 4. After 8.2.0, the tensorC
dimension should be a multiple of 4 only whenbnOps
isCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
. workSpaceSizeInBytes
is equal to or larger than the amount required bycudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
.reserveSpaceSizeInBytes
is equal to or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
. The content in
reserveSpace
stored bycudnnBatchNormalizationForwardTrainingEx()
must be preserved.
If workspace
is NULL
and workSpaceSizeInBytes
of zero is passed in, this API will function exactly like the nonextended function cudnnBatchNormalizationForwardTraining()
.
This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.
This extended function can accept a *workspace
pointer to the GPU workspace, and workSpaceSizeInBytes
, the size of the workspace, from the user.
The bnOps
input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation.
Only 4D and 5D tensors are supported. The epsilon
value has to be the same during the training, the backpropagation, and the inference.
When the tensor layout is NCHW, higher performance can be obtained when HWpacked tensors are used for x
, dy
, and dx
.
Parameters

Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Input. Mode of operation for the fast NHWC kernel. For more information, refer to
cudnnBatchNormOps_t
. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation. 
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Tensor descriptors and pointers in device memory for the layer's input
x
and outputy
, and for the optionalz
tensor input for residual addition to the result of the batch normalization operation, prior to the activation. The optionalzDes
and*zData
descriptors are only used whenbnOps
isCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
, otherwise users may passNULL
. When in use,z
should have exactly the same dimension asx
and the final outputy
. For more information, refer tocudnnTensorDescriptor_t
. 
Shared tensor descriptor
desc
for the secondary tensor that was derived bycudnnDeriveBNTensorDescriptor()
. The dimensions for this tensor descriptor are dependent on the normalization mode. 
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift paper, bias is referred to as beta and scale as gamma). Note that
bnBiasData
parameter can replace the previous layer's bias parameter for improved efficiency. 
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1factor) + newMean*factor
factor=1/(1+n)
atN
th call to the function to get the Cumulative Moving Average (CMA) behavior, for example:CMA[n] = (x[1]+...+x[n])/n
For example:CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) = ((n+1)*CMA[n]CMA[n])/(n+1) + x[n+1]/(n+1) = CMA[n]*(11/(n+1))+x[n+1]*1/(n+1) = CMA[n]*(1factor) + x(n+1)*factor

Inputs/Outputs. Pointers to the running mean and running variance data. Both these pointers can be
NULL
but only at the same time. The value stored inresultRunningVarianceData
(or passed as an input in inference mode) is the sample variance and is the moving average ofvariance[x]
where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are notNULL
, the tensors should be initialized to some reasonable values or to0
. 
Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for
CUDNN_BN_MIN_EPSILON
incudnn.h
. The sameepsilon
value should be used in forward and backward functions. 
Outputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's
x
andbnScaleData
,bnBiasData
data has to remain unchanged until this backward function is called. Note that both these parameters can beNULL
but only at the same time. It is recommended to use this cache since the memory overhead is relatively small. 
Input. The tensor descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
then this activation is used, otherwise user may passNULL
. 
Inputs.
*workspace
is a pointer to the GPU workspace, andworkSpaceSizeInBytes
is the size of the workspace. When*workspace
is notNULL
and*workSpaceSizeInBytes
is large enough, and the tensor layout is NHWC and the data type configuration is supported, then this function will trigger a new semipersistent NHWC kernel for batch normalization. The workspace is not required to be clean. Also, the workspace does not need to remain unchanged between the forward and backward passes. 
Input. Pointer to the GPU workspace for the
reserveSpace
. 
Input. The size of the
reserveSpace
. Must be equal or larger than the amount required bycudnnGetBatchNormalizationTrainingExReserveSpaceSize()
.
Supported configurations
This function supports the following combinations of data types for various descriptors.
Data Type Configurations  xDesc 
bnScaleBiasMeanVarDesc 
alpha , beta 
yDesc 

PSEUDO_HALF_CONFIG 
CUDNN_DATA_HALF 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_HALF 
FLOAT_CONFIG 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
DOUBLE_CONFIG 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
CUDNN_DATA_DOUBLE 
PSEUDO_BFLOAT16_CONFIG 
CUDNN_DATA_BFLOAT16 
CUDNN_DATA_FLOAT 
CUDNN_DATA_FLOAT 
CUDNN_DATA_BFLOAT16 
Returns

CUDNN_STATUS_SUCCESS
 The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED
 The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:
 One of the pointers
alpha
,beta
,x
,y
,bnScaleData
, andbnBiasData
isNULL
.  The number of
xDesc
oryDesc
tensor descriptor dimensions is not within the[4,5]
range (only 4D and 5D tensors are supported). bnScaleBiasMeanVarDesc
dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Exactly one of
saveMean
,saveInvVariance
pointers areNULL
.  Exactly one of
resultRunningMeanData
,resultRunningInvVarianceData
pointers areNULL
. epsilon
value is less thanCUDNN_BN_MIN_EPSILON
. Dimensions or data types mismatch for
xDesc
andyDesc
.
 One of the pointers
4.1.6. cudnnDivisiveNormalizationBackward()
This function performs the backward DivisiveNormalization
layer computation.
cudnnStatus_t cudnnDivisiveNormalizationBackward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnDivNormMode_t mode,
const void *alpha,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *means,
const void *dy,
void *temp,
void *temp2,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx,
void *dMeans)
Supported tensor formats are NCHW for 4D and NCDHW for 5D with any nonoverlapping nonnegative strides. Only 4D and 5D tensors are supported.
Parameters
 Input. Handle to a previously created cuDNN library descriptor.

Input. Handle to a previously initialized LRN parameter descriptor (this descriptor is used for both LRN and
DivisiveNormalization
layers). 
Input.
DivisiveNormalization
layer mode of operation. Currently onlyCUDNN_DIVNORM_PRECOMPUTED_MEANS
is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user. 
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to Scaling Parameters.

Input. Tensor descriptor and pointers in device memory for the layer's x and means data. Note that the
means
tensor is expected to be precomputed by the user. It can also contain any valid values (not required to be actualmeans
, and can be for instance a result of a convolution with a Gaussian kernel). 
Input. Tensor pointer in device memory for the layer's
dy
cumulative loss differential data (error backpropagation). 
Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the backward pass. These tensors do not have to be preserved from forward to backward pass. Both use
xDesc
as a descriptor. 
Input. Tensor descriptor for
dx
anddMeans
. 
Output. Tensor pointers (in device memory) for the layers resulting in cumulative gradients
dx
anddMeans
(dLoss/dx
anddLoss/dMeans
). Both share the same descriptor.
Returns
 The computation was performed successfully.

At least one of the following conditions are met:
 One of the tensor pointers
x
,dx
,temp
,tmep2
, anddy
isNULL
.  Number of any of the input or output tensor dimensions is not within the
[4,5]
range.  Either alpha or beta pointer is
NULL
.  A mismatch in dimensions between
xDesc
anddxDesc
.  LRN descriptor parameters are outside of their valid ranges.
 Any of the tensor strides is negative.
 One of the tensor pointers
 The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a nonsupported configuration.
4.1.7. cudnnDropoutBackward()
This function performs backward dropout operation over dy
returning results in dx
. If during forward dropout operation value from x
was propagated to y
then during backward operation value from dy
will be propagated to dx
, otherwise, dx
value will be set to 0
.
cudnnStatus_t cudnnDropoutBackward(
cudnnHandle_t handle,
const cudnnDropoutDescriptor_t dropoutDesc,
const cudnnTensorDescriptor_t dydesc,
const void *dy,
const cudnnTensorDescriptor_t dxdesc,
void *dx,
void *reserveSpace,
size_t reserveSpaceSizeInBytes)
Better performance is obtained for fully packed tensors.
Parameters
 Input. Handle to a previously created cuDNN context.
 Input. Previously created dropout descriptor object.
 Input. Handle to a previously initialized tensor descriptor.

Input. Pointer to data of the tensor described by the
dyDesc
descriptor.  Input. Handle to a previously initialized tensor descriptor.

Output. Pointer to data of the tensor described by the
dxDesc
descriptor. 
Input. Pointer to userallocated GPU memory used by this function. It is expected that
reserveSpace
was populated during a call tocudnnDropoutForward
and has not been changed.  Input. Specifies the size in bytes of the provided memory for the reserve space.
Returns
 The call was successful.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 The number of elements of input tensor and output tensors differ.
 The
datatype
of the input tensor and output tensors differs.  The strides of the input tensor and output tensors differ and inplace operation is used (i.e.,
x
andy
pointers are equal).  The provided
reserveSpaceSizeInBytes
is less than the value returned bycudnnDropoutGetReserveSpaceSize
. cudnnSetDropoutDescriptor
has not been called ondropoutDesc
with the nonNULL
states
argument.
 The function failed to launch on the GPU.
4.1.8. cudnnGetBatchNormalizationBackwardExWorkspaceSize()
This function returns the amount of GPU memory workspace the user should allocate to be able to call cudnnGetBatchNormalizationBackwardExWorkspaceSize()
function for the specified bnOps
input setting. The workspace allocated will then be passed to the function cudnnGetBatchNormalizationBackwardExWorkspaceSize()
.
cudnnStatus_t cudnnGetBatchNormalizationBackwardExWorkspaceSize(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const cudnnTensorDescriptor_t xDesc,
const cudnnTensorDescriptor_t yDesc,
const cudnnTensorDescriptor_t dyDesc,
const cudnnTensorDescriptor_t dzDesc,
const cudnnTensorDescriptor_t dxDesc,
const cudnnTensorDescriptor_t dBnScaleBiasDesc,
const cudnnActivationDescriptor_t activationDesc,
size_t *sizeInBytes);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Input. Mode of operation for the fast NHWC kernel. For more information, refer to
cudnnBatchNormOps_t
. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation. 
Tensor descriptors and pointers in the device memory for the layer's
x
data, back propagated differentialdy
(inputs), the optionaly
input data, the optionaldz
output, and thedx
output, which is the resulting differential with respect tox
. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Shared tensor descriptor for the following six tensors:
bnScaleData
,bnBiasData
,dBnScaleData
,dBnBiasData
,savedMean
, andsavedInvVariance
. This is the shared tensor descriptor desc for the secondary tensor that was derived bycudnnDeriveBNTensorDescriptor()
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must befloat
for FP16 and FP32 input tensors, anddouble
for FP64 input tensors. 
Input. Descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
, then this activation is used, otherwise user may passNULL
. 
Output. Amount of GPU memory required for the workspace, as determined by this function, to be able to execute the
cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
function with the specifiedbnOps
input setting.
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 Number of
xDesc
,yDesc
ordxDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported). dBnScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Dimensions or data types mismatch for any pair of
xDesc
,dyDesc
, ordxDesc
.
 Number of
4.1.9. cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
This function returns the amount of GPU memory workspace the user should allocate to be able to call cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
function for the specified bnOps
input setting. The workspace allocated should then be passed by the user to the function cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
.
cudnnStatus_t cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const cudnnTensorDescriptor_t xDesc,
const cudnnTensorDescriptor_t zDesc,
const cudnnTensorDescriptor_t yDesc,
const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc,
const cudnnActivationDescriptor_t activationDesc,
size_t *sizeInBytes);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Input. Mode of operation for the fast NHWC kernel. For more information, refer to
cudnnBatchNormOps_t
. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation. 
Tensor descriptors and pointers in the device memory for the layer's
x
data, the optionalz
input data, and they
output.zDesc
is only needed whenbnOps
isCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
, otherwise the user may passNULL
. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Shared tensor descriptor for the following six tensors:
bnScaleData, bnBiasData, dBnScaleData, dBnBiasData, savedMean,
andsavedInvVariance
. This is the shared tensor descriptor desc for the secondary tensor that was derived bycudnnDeriveBNTensorDescriptor()
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must befloat
for FP16 and FP32 input tensors, anddouble
for FP64 input tensors. 
Input. Descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
then this activation is used, otherwise the user may passNULL
. 
Output. Amount of GPU memory required for the workspace, as determined by this function, to be able to execute the
cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize()
function with the specifiedbnOps
input setting.
Returns

CUDNN_STATUS_SUCCESS
 The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED
 The function does not support the provided configuration.

CUDNN_STATUS_BAD_PARAM

At least one of the following conditions are met:
 Number of
xDesc
,yDesc
ordxDesc
tensor descriptor dimensions is not within the range of[4,5]
(only 4D and 5D tensors are supported). dBnScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Dimensions or data types mismatch for
xDesc
oryDesc
.
 Number of
4.1.10. cudnnGetBatchNormalizationTrainingExReserveSpaceSize()
This function returns the amount of reserve GPU memory workspace the user should allocate for the batch normalization operation, for the specified bnOps
input setting. In contrast to the workspace
, the reserved space should be preserved between the forward and backward calls, and the data should not be altered.
cudnnStatus_t cudnnGetBatchNormalizationTrainingExReserveSpaceSize(
cudnnHandle_t handle,
cudnnBatchNormMode_t mode,
cudnnBatchNormOps_t bnOps,
const cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t xDesc,
size_t *sizeInBytes);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (spatial or peractivation). For more information, refer to
cudnnBatchNormMode_t
. 
Input. Mode of operation for the fast NHWC kernel. For more information, refer to
cudnnBatchNormOps_t
. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by elementwise addition and then activation. 
Tensor descriptors for the layer's
x
data. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Descriptor for the activation operation. When the
bnOps
input is set to eitherCUDNN_BATCHNORM_OPS_BN_ACTIVATION
orCUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION
then this activation is used, otherwise user may passNULL
.  Output. Amount of GPU memory reserved.
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 The
xDesc
tensor descriptor dimension is not within the[4,5]
range (only 4D and 5D tensors are supported).
 The
4.1.11. cudnnGetNormalizationBackwardWorkspaceSize()
This function returns the amount of GPU memory workspace the user should allocate to be able to call cudnnNormalizationBackward()
function for the specified normOps
and algo
input setting. The workspace allocated will then be passed to the function cudnnNormalizationBackward()
.
cudnnStatus_t
cudnnGetNormalizationBackwardWorkspaceSize(cudnnHandle_t handle,
cudnnNormMode_t mode,
cudnnNormOps_t normOps,
cudnnNormAlgo_t algo,
const cudnnTensorDescriptor_t xDesc,
const cudnnTensorDescriptor_t yDesc,
const cudnnTensorDescriptor_t dyDesc,
const cudnnTensorDescriptor_t dzDesc,
const cudnnTensorDescriptor_t dxDesc,
const cudnnTensorDescriptor_t dNormScaleBiasDesc,
const cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t normMeanVarDesc,
size_t *sizeInBytes,
int groupCnt);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (perchannel or peractivation). For more information, refer to
cudnnNormMode_t
. 
Input. Mode of postoperative. Currently
CUDNN_NORM_OPS_NORM_ACTIVATION
andCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
are only supported in the NHWC layout. For more information, refer tocudnnNormOps_t
. This input can be used to set this function to perform either only the normalization, or normalization followed by activation, or normalization followed by elementwise addition and then activation. 
Input. Algorithm to be performed. For more information, refer to
cudnnNormAlgo_t
. 
Tensor descriptors and pointers in the device memory for the layer's
x
data, back propagated differentialdy
(inputs), the optionaly
input data, the optionaldz
output, and thedx
output, which is the resulting differential with respect tox
. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Shared tensor descriptor for the following four tensors:
normScaleData
,normBiasData
,dNormScaleData
,dNormBiasData
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors. 
Input. Descriptor for the activation operation. When the
normOps
input is set to eitherCUDNN_NORM_OPS_NORM_ACTIVATION
orCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
, then this activation is used, otherwise the user may passNULL
. 
Input. Shared tensor descriptor for the following tensors:
savedMean
andsavedInvVariance
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors. 
Output. Amount of GPU memory required for the workspace, as determined by this function, to be able to execute the
cudnnGetNormalizationForwardTrainingWorkspaceSize()
function with the specifiednormOps
input setting. 
Input. The number of grouped convolutions. Currently, only
1
is supported.
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 Number of
xDesc
,yDesc
ordxDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). dNormScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for perchannel, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Dimensions or data types mismatch for any pair of
xDesc
,dyDesc
, ordxDesc
.
 Number of
4.1.12. cudnnGetNormalizationForwardTrainingWorkspaceSize()
This function returns the amount of GPU memory workspace the user should allocate to be able to call cudnnNormalizationForwardTraining()
function for the specified normOps
and algo
input setting. The workspace allocated should then be passed by the user to the function cudnnNormalizationForwardTraining()
.
cudnnStatus_t
cudnnGetNormalizationForwardTrainingWorkspaceSize(cudnnHandle_t handle,
cudnnNormMode_t mode,
cudnnNormOps_t normOps,
cudnnNormAlgo_t algo,
const cudnnTensorDescriptor_t xDesc,
const cudnnTensorDescriptor_t zDesc,
const cudnnTensorDescriptor_t yDesc,
const cudnnTensorDescriptor_t normScaleBiasDesc,
const cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t normMeanVarDesc,
size_t *sizeInBytes,
int groupCnt);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (perchannel or peractivation). For more information, refer to
cudnnNormMode_t
. 
Input. Mode of postoperative. Currently
CUDNN_NORM_OPS_NORM_ACTIVATION
andCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
are only supported in the NHWC layout. For more information, refer tocudnnNormOps_t
. This input can be used to set this function to perform either only the normalization, or normalization followed by activation, or normalization followed by elementwise addition and then activation. 
Input. Algorithm to be performed. For more information, refer to
cudnnNormAlgo_t
. 
Tensor descriptors and pointers in the device memory for the layer's
x
data, the optionalz
input data, and they
output.zDesc
is only needed whennormOps
isCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
, otherwise the user may passNULL
. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Shared tensor descriptor for the following tensors:
normScaleData
andnormBiasData
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors. 
Input. Descriptor for the activation operation. When the
normOps
input is set to eitherCUDNN_NORM_OPS_NORM_ACTIVATION
orCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
, then this activation is used, otherwise the user may passNULL
. 
Input. Shared tensor descriptor for the following tensors:
savedMean
andsavedInvVariance
. The dimensions for this tensor descriptor are dependent on normalization mode. Note that the data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors. 
Output. Amount of GPU memory required for the workspace, as determined by this function, to be able to execute the
cudnnGetNormalizationForwardTrainingWorkspaceSize()
function with the specifiednormOps
input setting. 
Input. The number of grouped convolutions. Currently, only
1
is supported.
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 Number of
xDesc
,yDesc
orzDesc
tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported). normScaleBiasDesc
dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for perchannel, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for peractivation mode. Dimensions or data types mismatch for
xDesc
oryDesc
.
 Number of
4.1.13. cudnnGetNormalizationTrainingReserveSpaceSize()
This function returns the amount of reserve GPU memory workspace the user should allocate for the normalization operation, for the specified normOps
input setting. In contrast to the workspace, the reserved space should be preserved between the forward and backward calls, and the data should not be altered.
cudnnStatus_t
cudnnGetNormalizationTrainingReserveSpaceSize(cudnnHandle_t handle,
cudnnNormMode_t mode,
cudnnNormOps_t normOps,
cudnnNormAlgo_t algo,
const cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t xDesc,
size_t *sizeInBytes,
int groupCnt);
Parameters

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to
cudnnHandle_t
. 
Input. Mode of operation (perchannel or peractivation). For more information, refer to
cudnnNormMode_t
. 
Input. Mode of postoperative. Currently
CUDNN_NORM_OPS_NORM_ACTIVATION
andCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
are only supported in the NHWC layout. For more information, refer tocudnnNormOps_t
. This input can be used to set this function to perform either only the normalization, or normalization followed by activation, or normalization followed by elementwise addition and then activation. 
Input. Algorithm to be performed. For more information, refer to
cudnnNormAlgo_t
. 
Tensor descriptors for the layer's
x
data. For more information, refer tocudnnTensorDescriptor_t
. 
Input. Descriptor for the activation operation. When the
normOps
input is set to eitherCUDNN_NORM_OPS_NORM_ACTIVATION
orCUDNN_NORM_OPS_NORM_ADD_ACTIVATION
then this activation is used, otherwise the user may passNULL
.  Output. Amount of GPU memory reserved.

Input. The number of grouped convolutions. Currently, only
1
is supported.
Returns
 The computation was performed successfully.
 The function does not support the provided configuration.

At least one of the following conditions are met:
 The
xDesc
tensor descriptor dimension is not within the [4,5] range (only 4D and 5D tensors are supported).
 The
4.1.14. cudnnLRNCrossChannelBackward()
This function performs the backward LRN layer computation.
cudnnStatus_t cudnnLRNCrossChannelBackward(
cudnnHandle_t handle,
cudnnLRNDescriptor_t normDesc,
cudnnLRNMode_t lrnMode,
const void *alpha,
const cudnnTensorDescriptor_t yDesc,
const void *y,
const cudnnTensorDescriptor_t dyDesc,
const void *dy,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const void *beta,
const cudnnTensorDescriptor_t dxDesc,
void *dx)
Supported formats are: positivestrided
, NCHW and NHWC for 4D x
and y
, and only NCDHW DHWpacked for 5D (for both x
and y
). Only nonoverlapping 4D and 5D tensors are supported. NCHW layout is preferred for performance.
Parameters
 Input. Handle to a previously created cuDNN library descriptor.
 Input. Handle to a previously initialized LRN parameter descriptor.

Input. LRN layer mode of operation. Currently, only
CUDNN_LRN_CROSS_CHANNEL_DIM1
is implemented. Normalization is performed along the tensor'sdimA[1]
. 
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue
For more information, refer to