## Abstract

This is the API Reference documentation for the NVIDIA cuDNN version 8.7.0 library. This API Reference lists the datatyes and functions per library. Specifically, this reference consists of a cuDNN datatype reference section that describes the types of enums and a cuDNN API reference section that describes all routines in the cuDNN library API. The cuDNN API is a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams.

For previously released cuDNN developer documentation, refer to the NVIDIA cuDNN Archives.

## 1. Introduction

NVIDIA® CUDA® Deep Neural Network (cuDNN) library offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams. This API Reference lists the datatyes and functions per library. Specifically, this reference consists of a cuDNN datatype reference section that describes the types of enums and a cuDNN API reference section that describes all routines in the cuDNN library API.

The cuDNN library as well as this API document has been split into the following libraries:
• cudnn_ops_infer - This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common ML algorithms such as batch normalization, softmax, dropout, etc.
• cudnn_ops_train - This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, etc. The cudnn_ops_train library depends on cudnn_ops_infer.
• cudnn_cnn_infer - This entity contains all routines related to convolutional neural networks needed at inference time. The cudnn_cnn_infer library depends on cudnn_ops_infer.
• cudnn_cnn_train - This entity contains all routines related to convolutional neural networks needed during training time. The cudnn_cnn_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_cnn_infer.
• cudnn_adv_infer - This entity contains all other features and algorithms. This includes RNNs, CTC loss, and multi-head attention. The cudnn_adv_infer library depends on cudnn_ops_infer.
• cudnnBackend* - Introduced in cuDNN version 8.x, this entity contains a list of valid cuDNN backend descriptor types, a list of valid attributes, a subset of valid attribute values, and a full description of each backend descriptor type and their attributes.
• cudnn - This is an optional shim layer between the application layer and the cuDNN code. This layer opportunistically opens the correct library for the API at runtime.

## 2. Added, Deprecated, And Removed API Functions

### 2.1. API Changes For cuDNN 8.7.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.7.0.
Table 1. API functions and data types that were added in cuDNN 8.7.0
Backend descriptor types
cudnnRngDistribution_t
CUDNN_BACKEND_OPERATION_RNG_DESCRIPTOR
CUDNN_BACKEND_RNG_DESCRIPTOR

### 2.2. API Changes For cuDNN 8.5.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.5.0.

### 2.3. API Changes For cuDNN 8.4.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.4.0.

### 2.4. API Changes For cuDNN 8.3.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.3.0.
Table 4. API functions and data types that were added in cuDNN 8.3.0
Backend descriptor types
CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR
CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR
CUDNN_BACKEND_RESAMPLE_DESCRIPTOR

### 2.5. API Changes For cuDNN 8.2.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.2.0.
Table 5. API functions and data types that were added in cuDNN 8.2.0
New functions
cudnnGetActivationDescriptorSwishBeta()
cudnnSetActivationDescriptorSwishBeta()

### 2.6. API Changes For cuDNN 8.1.0

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.1.0.
Table 6. API functions and data types that were added in cuDNN 8.1.0
Backend descriptor types
CUDNN_BACKEND_MATMUL_DESCRIPTOR
CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR

### 2.8. API Changes For cuDNN 8.0.2

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.0.2.
Table 8. API functions and data types that were added in cuDNN 8.0.2
New functions and data types
cudnnRNNBackwardData_v8()
cudnnRNNBackwardWeights_v8()

### 2.9. API Changes For cuDNN 8.0.0 Preview

The following tables show which API functions were added, deprecated, and removed for the cuDNN 8.0.0 Preview Release.
Table 9. API functions and data types that were added in cuDNN 8.0.0 Preview
New functions and data types
cudnnBackendAttributeName_t
cudnnBackendAttributeType_t
cudnnBackendCreateDescriptor()
cudnnBackendDescriptor_t
cudnnBackendDescriptorType_t
cudnnBackendDestroyDescriptor()
cudnnBackendExecute()
cudnnBackendFinalize()
cudnnBackendGetAttribute()
cudnnBackendHeurMode_t
cudnnBackendInitialize()
cudnnBackendKnobType_t
cudnnBackendLayoutType_t
cudnnBackendNumericalNote_t
cudnnBackendSetAttribute()
cudnnBuildRNNDynamic()
cudnnCTCLoss_v8()
cudnnDeriveNormTensorDescriptor()
cudnnForwardMode_t
cudnnGenStatsMode_t
cudnnGetCTCLossDescriptor_v8()
cudnnGetCTCLossDescriptorEx()
cudnnGetCTCLossWorkspaceSize_v8
cudnnGetFilterSizeInBytes()
cudnnGetNormalizationBackwardWorkspaceSize()
cudnnGetNormalizationForwardTrainingWorkspaceSize()
cudnnGetNormalizationTrainingReserveSpaceSize()
cudnnGetRNNDescriptor_v8()
cudnnGetRNNMatrixMathType()
cudnnGetRNNTempSpaceSizes()
cudnnGetRNNWeightParams()
cudnnGetRNNWeightSpaceSize()
cudnnLRNDescriptor_t
cudnnNormAlgo_t
cudnnNormalizationBackward()
cudnnNormalizationForwardInference()
cudnnNormalizationForwardTraining()
cudnnNormMode_t
cudnnNormOps_t
cudnnOpsInferVersionCheck()
cudnnOpsTrainVersionCheck()
cudnnPointwiseMode_t
cudnnRNNForward()
cudnnRNNGetClip_v8()
cudnnRNNSetClip_v8()
cudnnSetCTCLossDescriptor_v8()
cudnnSetRNNDescriptor_v8()
cudnnSeverity_t
For our deprecation policy, refer to the Backward Compatibility And Deprecation Policy section in the cuDNN Developer Guide.
Table 10. API functions and data types that were deprecated in cuDNN 8.0.0 Preview
Deprecated functions and data types Replaced with
cudnnCopyAlgorithmDescriptor()
cudnnCreateAlgorithmDescriptor()
cudnnCreatePersistentRNNPlan() cudnnBuildRNNDynamic()
cudnnDestroyAlgorithmDescriptor()
cudnnDestroyPersistentRNNPlan()
cudnnFindRNNBackwardDataAlgorithmEx()
cudnnFindRNNBackwardWeightsAlgorithmEx()
cudnnFindRNNForwardInferenceAlgorithmEx()
cudnnFindRNNForwardTrainingAlgorithmEx()
cudnnGetAlgorithmDescriptor()
cudnnGetAlgorithmPerformance()
cudnnGetAlgorithmSpaceSize()
cudnnGetRNNBackwardDataAlgorithmMaxCount()
cudnnGetRNNBackwardWeightsAlgorithmMaxCount()
• cudnnGetRNNDescriptor_v6()
• cudnnGetRNNMatrixMathType()
• cudnnGetRNNBiasMode()
• cudnnGetRNNProjectionLayers()
cudnnGetRNNDescriptor_v8()
cudnnGetRNNForwardInferenceAlgorithmMaxCount()
cudnnGetRNNForwardTrainingAlgorithmMaxCount()
• cudnnGetRNNLinLayerBiasParams()
• cudnnGetRNNLinLayerMatrixParams()
cudnnGetRNNWeightParams()
cudnnGetRNNParamsSize() cudnnGetRNNWeightSpaceSize()
• cudnnGetRNNWorkspaceSize()
• cudnnGetRNNTrainingReserveSize()
cudnnGetRNNTempSpaceSizes()
cudnnPersistentRNNPlan_t
cudnnRestoreAlgorithm()
• cudnnRNNBackwardData()
• cudnnRNNBackwardDataEx()
cudnnRNNBackwardData_v8()
• cudnnRNNBackwardWeights()
• cudnnRNNBackwardWeightsEx()
cudnnRNNBackwardWeights_v8()
• cudnnRNNForwardInference()
• cudnnRNNForwardInferenceEx()
• cudnnRNNForwardTraining()
• cudnnRNNForwardTrainingEx()
cudnnRNNForward()
cudnnRNNGetClip() cudnnRNNGetClip_v8()
cudnnRNNSetClip() cudnnRNNSetClip_v8()
cudnnSaveAlgorithm()
cudnnSetAlgorithmDescriptor()
cudnnSetAlgorithmPerformance()
cudnnSetPersistentRNNPlan()
cudnnSetRNNAlgorithmDescriptor()
• cudnnSetRNNBiasMode()
• cudnnSetRNNDescriptor_v6()
• cudnnSetRNNMatrixMathType()
• cudnnSetRNNProjectionLayers()
cudnnSetRNNDescriptor_v8()
Table 11. API functions and data types that were removed in cuDNN 8.0.0 Preview
Removed functions and data types
cudnnConvolutionBwdDataPreference_t
cudnnConvolutionBwdFilterPreference_t
cudnnConvolutionFwdPreference_t
cudnnGetConvolutionBackwardDataAlgorithm()
cudnnGetConvolutionBackwardFilterAlgorithm()
cudnnGetConvolutionForwardAlgorithm()
cudnnGetRNNDescriptor()
cudnnSetRNNDescriptor()

## 3. cudnn_ops_infer.so Library

### 3.1.1.1. cudnnActivationDescriptor_t

cudnnActivationDescriptor_t is a pointer to an opaque structure holding the description of an activation operation. cudnnCreateActivationDescriptor() is used to create one instance, and cudnnSetActivationDescriptor() must be used to initialize this instance.

### 3.1.1.2. cudnnCTCLossDescriptor_t

cudnnCTCLossDescriptor_t is a pointer to an opaque structure holding the description of a CTC loss operation. cudnnCreateCTCLossDescriptor() is used to create one instance, cudnnSetCTCLossDescriptor() is used to initialize this instance, and cudnnDestroyCTCLossDescriptor() is used to destroy this instance.

### 3.1.1.3. cudnnDropoutDescriptor_t

cudnnDropoutDescriptor_t is a pointer to an opaque structure holding the description of a dropout operation. cudnnCreateDropoutDescriptor() is used to create one instance, cudnnSetDropoutDescriptor() is used to initialize this instance, cudnnDestroyDropoutDescriptor() is used to destroy this instance, cudnnGetDropoutDescriptor() is used to query fields of a previously initialized instance, cudnnRestoreDropoutDescriptor() is used to restore an instance to a previously saved off state.

### 3.1.1.4. cudnnFilterDescriptor_t

cudnnFilterDescriptor_t is a pointer to an opaque structure holding the description of a filter dataset. cudnnCreateFilterDescriptor() is used to create one instance, and cudnnSetFilter4dDescriptor() or cudnnSetFilterNdDescriptor() must be used to initialize this instance.

### 3.1.1.5. cudnnHandle_t

cudnnHandle_t is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using cudnnCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cudnnDestroy(). The context is associated with only one GPU device, the current device at the time of the call to cudnnCreate(). However, multiple contexts can be created on the same GPU device.

### 3.1.1.6. cudnnLRNDescriptor_t

cudnnLRNDescriptor_t is a pointer to an opaque structure holding the parameters of a local response normalization. cudnnCreateLRNDescriptor() is used to create one instance, and the routine cudnnSetLRNDescriptor() must be used to initialize this instance.

### 3.1.1.7. cudnnOpTensorDescriptor_t

cudnnOpTensorDescriptor_t is a pointer to an opaque structure holding the description of a Tensor Core operation, used as a parameter to cudnnOpTensor(). cudnnCreateOpTensorDescriptor() is used to create one instance, and cudnnSetOpTensorDescriptor() must be used to initialize this instance.

### 3.1.1.8. cudnnPoolingDescriptor_t

cudnnPoolingDescriptor_t is a pointer to an opaque structure holding the description of a pooling operation. cudnnCreatePoolingDescriptor() is used to create one instance, and cudnnSetPoolingNdDescriptor() or cudnnSetPooling2dDescriptor() must be used to initialize this instance.

### 3.1.1.9. cudnnReduceTensorDescriptor_t

cudnnReduceTensorDescriptor_t is a pointer to an opaque structure holding the description of a tensor reduction operation, used as a parameter to cudnnReduceTensor(). cudnnCreateReduceTensorDescriptor() is used to create one instance, and cudnnSetReduceTensorDescriptor() must be used to initialize this instance.

### 3.1.1.10. cudnnSpatialTransformerDescriptor_t

cudnnSpatialTransformerDescriptor_t is a pointer to an opaque structure holding the description of a spatial transformation operation. cudnnCreateSpatialTransformerDescriptor() is used to create one instance, cudnnSetSpatialTransformerNdDescriptor() is used to initialize this instance, and cudnnDestroySpatialTransformerDescriptor() is used to destroy this instance.

### 3.1.1.11. cudnnTensorDescriptor_t

cudnnTensorDescriptor_t is a pointer to an opaque structure holding the description of a generic n-D dataset. cudnnCreateTensorDescriptor() is used to create one instance, and one of the routines cudnnSetTensorNdDescriptor(), cudnnSetTensor4dDescriptor() or cudnnSetTensor4dDescriptorEx() must be used to initialize this instance.

### 3.1.1.12. cudnnTensorTransformDescriptor_t

cudnnTensorTransformDescriptor_t is an opaque structure containing the description of the tensor transform. Use the cudnnCreateTensorTransformDescriptor() function to create an instance of this descriptor, and cudnnDestroyTensorTransformDescriptor() function to destroy a previously created instance.

### 3.1.2.1. cudnnActivationMode_t

cudnnActivationMode_t is an enumerated type used to select the neuron activation function used in cudnnActivationForward(), cudnnActivationBackward(), and cudnnConvolutionBiasActivationForward().

##### Values
CUDNN_ACTIVATION_SIGMOID

Selects the sigmoid function.

CUDNN_ACTIVATION_RELU

Selects the rectified linear function.

CUDNN_ACTIVATION_TANH

Selects the hyperbolic tangent function.

CUDNN_ACTIVATION_CLIPPED_RELU

Selects the clipped rectified linear function.

CUDNN_ACTIVATION_ELU

Selects the exponential linear function.

CUDNN_ACTIVATION_IDENTITY

Selects the identity function, intended for bypassing the activation step in cudnnConvolutionBiasActivationForward(). (The cudnnConvolutionBiasActivationForward() function must use CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.) Does not work with cudnnActivationForward() or cudnnActivationBackward().

CUDNN_ACTIVATION_SWISH
Selects the swish function.

### 3.1.2.2. cudnnAlgorithm_t

This function has been deprecated in cuDNN 8.0.

### 3.1.2.3. cudnnBatchNormMode_t

cudnnBatchNormMode_t is an enumerated type used to specify the mode of operation in cudnnBatchNormalizationForwardInference(), cudnnBatchNormalizationForwardTraining(), cudnnBatchNormalizationBackward() and cudnnDeriveBNTensorDescriptor() routines.

##### Values
CUDNN_BATCHNORM_PER_ACTIVATION

Normalization is performed per-activation. This mode is intended to be used after the non-convolutional network layers. In this mode, the tensor dimensions of bnBias and bnScale and the parameters used in the cudnnBatchNormalization* functions are 1xCxHxW.

CUDNN_BATCHNORM_SPATIAL

Normalization is performed over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode the bnBias and bnScale tensor dimensions are 1xCx1x1.

CUDNN_BATCHNORM_SPATIAL_PERSISTENT

This mode is similar to CUDNN_BATCHNORM_SPATIAL but it can be faster for some tasks.

An optimized path may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF types, compute capability 6.0 or higher for the following two batch normalization API calls: cudnnBatchNormalizationForwardTraining(), and cudnnBatchNormalizationBackward(). In the case of cudnnBatchNormalizationBackward(), the savedMean and savedInvVariance arguments should not be NULL.

The rest of this section applies to NCHW mode only:

This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaN-s or Inf-s (infinity) in output buffers.

When Inf-s/NaN-s are present in the input data, the output in this mode is the same as from a pure floating-point implementation.

For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Inf-s/NaN-s while CUDNN_BATCHNORM_SPATIAL will produce finite results. The user can invoke cudnnQueryRuntimeError() to check if a numerical overflow occurred in this mode.

### 3.1.2.4. cudnnBatchNormOps_t

cudnnBatchNormOps_t is an enumerated type used to specify the mode of operation in cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize(), cudnnBatchNormalizationForwardTrainingEx(), cudnnGetBatchNormalizationBackwardExWorkspaceSize(), cudnnBatchNormalizationBackwardEx(), and cudnnGetBatchNormalizationTrainingExReserveSpaceSize() functions.

##### Values
CUDNN_BATCHNORM_OPS_BN

Only batch normalization is performed, per-activation.

CUDNN_BATCHNORM_OPS_BN_ACTIVATION

First, the batch normalization is performed, and then the activation is performed.

Performs the batch normalization, then element-wise addition, followed by the activation operation.

### 3.1.2.5. cudnnCTCLossAlgo_t

cudnnCTCLossAlgo_t is an enumerated type that exposes the different algorithms available to execute the CTC loss operation.

##### Values
CUDNN_CTC_LOSS_ALGO_DETERMINISTIC

Results are guaranteed to be reproducible.

CUDNN_CTC_LOSS_ALGO_NON_DETERMINISTIC

Results are not guaranteed to be reproducible.

### 3.1.2.6. cudnnDataType_t

cudnnDataType_t is an enumerated type indicating the data type to which a tensor descriptor or filter descriptor refers.

##### Values
CUDNN_DATA_FLOAT

The data is a 32-bit single-precision floating-point (float).

CUDNN_DATA_DOUBLE

The data is a 64-bit double-precision floating-point (double).

CUDNN_DATA_HALF

The data is a 16-bit floating-point.

CUDNN_DATA_INT8

The data is an 8-bit signed integer.

CUDNN_DATA_INT32

The data is a 32-bit signed integer.

CUDNN_DATA_INT8x4

The data is 32-bit elements each composed of 4 8-bit signed integers. This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C.

CUDNN_DATA_UINT8
The data is an 8-bit unsigned integer.
CUDNN_DATA_UINT8x4
The data is 32-bit elements each composed of 4 8-bit unsigned integers. This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C.
CUDNN_DATA_INT8x32

The data is 32-element vectors, each element being an 8-bit signed integer. This data type is only supported with the tensor format CUDNN_TENSOR_NCHW_VECT_C. Moreover, this data type can only be used with algo 1, meaning, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM. For more information, refer to cudnnConvolutionFwdAlgo_t.

CUDNN_DATA_BFLOAT16
The data is a 16-bit quantity, with 7 mantissa bits, 8 exponent bits, and 1 sign bit.
CUDNN_DATA_INT64
The data is a 64-bit signed integer.
CUDNN_DATA_BOOLEAN
The data is a boolean (bool).

Note that for type CUDNN_TYPE_BOOLEAN, elements are expected to be “packed”: that is, one byte contains 8 elements of type CUDNN_TYPE_BOOLEAN. Further, within each byte, elements are indexed from the least significant bit to the most significant bit. For example, a 1 dimensional tensor of 8 elements containing 01001111 has value 1 for elements 0 through 3, 0 for elements 4 and 5, 1 for element 6 and 0 for element 7.

Tensors with more than 8 elements simply use more bytes, where the order is also from least significant to most significant byte. Note, CUDA is little-endian, meaning that the least significant byte has the lower memory address address. For example, in the case of 16 elements, 01001111 11111100 has value 1 for elements 0 through 3, 0 for elements 4 and 5, 1 for element 6 and 0 for element 7, value 0 for elements 8 and 9, 1 for elements 10 through 15.

CUDNN_DATA_FP8_E4M3
The data is an 8-bit quantity, with 3 mantissa bits, 4 exponent bits, and 1 sign bit.
CUDNN_DATA_FP8_E5M2
The data is an 8-bit quantity, with 2 mantissa bits, 5 exponent bits, and 1 sign bit.
CUDNN_DATA_FAST_FLOAT_FOR_FP8
The data type is a higher throughput but lower precision compute type (compared to CUDNN_DATA_FLOAT) used for FP8 tensor core operations

### 3.1.2.7. cudnnDeterminism_t

cudnnDeterminism_t is an enumerated type used to indicate if the computed results are deterministic (reproducible). For more information, refer to Reproducibility (determinism) in the cuDNN Developer Guide.

##### Values
CUDNN_NON_DETERMINISTIC

Results are not guaranteed to be reproducible.

CUDNN_DETERMINISTIC

Results are guaranteed to be reproducible.

### 3.1.2.8. cudnnDivNormMode_t

cudnnDivNormMode_t is an enumerated type used to specify the mode of operation in cudnnDivisiveNormalizationForward() and cudnnDivisiveNormalizationBackward().

##### Values
CUDNN_DIVNORM_PRECOMPUTED_MEANS
The means tensor data pointer is expected to contain means or other kernel convolution values precomputed by the user. The means pointer can also be NULL, in that case, it's considered to be filled with zeroes. This is equivalent to spatial LRN.
Note: In the backward pass, the means are treated as independent inputs and the gradient over means is computed independently. In this mode, to yield a net gradient over the entire LCN computational graph, the destDiffMeans result should be backpropagated through the user's means layer (which can be implemented using average pooling) and added to the destDiffData tensor produced by cudnnDivisiveNormalizationBackward().

### 3.1.2.9. cudnnErrQueryMode_t

cudnnErrQueryMode_t is an enumerated type passed to cudnnQueryRuntimeError() to select the remote kernel error query mode.

##### Values
CUDNN_ERRQUERY_RAWCODE

Read the error storage location regardless of the kernel completion status.

CUDNN_ERRQUERY_NONBLOCKING

Report if all tasks in the user stream of the cuDNN handle were completed. If that is the case, report the remote kernel error code.

CUDNN_ERRQUERY_BLOCKING

Wait for all tasks to complete in the user stream before reporting the remote kernel error code.

### 3.1.2.10. cudnnFoldingDirection_t

cudnnFoldingDirection_t is an enumerated type used to select the folding direction. For more information, refer to cudnnTensorTransformDescriptor_t.

##### Data Member
CUDNN_TRANSFORM_FOLD = 0U

Selects folding.

CUDNN_TRANSFORM_UNFOLD = 1U

Selects unfolding.

### 3.1.2.11. cudnnIndicesType_t

cudnnIndicesType_t is an enumerated type used to indicate the data type for the indices to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

##### Values
CUDNN_32BIT_INDICES

Compute unsigned int indices.

CUDNN_64BIT_INDICES

Compute unsigned long indices.

CUDNN_16BIT_INDICES

Compute unsigned short indices.

CUDNN_8BIT_INDICES

Compute unsigned char indices.

### 3.1.2.12. cudnnLRNMode_t

cudnnLRNMode_t is an enumerated type used to specify the mode of operation in cudnnLRNCrossChannelForward() and cudnnLRNCrossChannelBackward().

##### Values
CUDNN_LRN_CROSS_CHANNEL_DIM1
LRN computation is performed across the tensor's dimension dimA[1].

### 3.1.2.13. cudnnMathType_t

cudnnMathType_t is an enumerated type used to indicate if the use of Tensor Core operations is permitted in a given library routine.

##### Values
CUDNN_DEFAULT_MATH

Tensor Core operations are not used on pre-NVIDIA A100 GPU devices. On A100 GPU architecture devices, Tensor Core TF32 operation is permitted.

CUDNN_TENSOR_OP_MATH

The use of Tensor Core operations is permitted but will not actively perform datatype down conversion on tensors in order to utilize Tensor Cores.

CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION

The use of Tensor Core operations is permitted and will actively perform datatype down conversion on tensors in order to utilize Tensor Cores.

CUDNN_FMA_MATH

Restricted to only kernels that use FMA instructions.

On pre-NVIDIA A100 GPU devices, CUDNN_DEFAULT_MATH and CUDNN_FMA_MATH have the same behavior: Tensor Core kernels will not be selected. With NVIDIA Ampere Architecture and CUDA toolkit 11, CUDNN_DEFAULT_MATH permits TF32 Tensor Core operation and CUDNN_FMA_MATH does not. The TF32 behavior for CUDNN_DEFAULT_MATH and the other Tensor Core math types can be explicitly disabled by the environment variable NVIDIA_TF32_OVERRIDE=0.

### 3.1.2.14. cudnnNanPropagation_t

cudnnNanPropagation_t is an enumerated type used to indicate if a given routine should propagate Nan numbers. This enumerated type is used as a field for the cudnnActivationDescriptor_t descriptor and cudnnPoolingDescriptor_t descriptor.

##### Values
CUDNN_NOT_PROPAGATE_NAN

Nan numbers are not propagated.

CUDNN_PROPAGATE_NAN

Nan numbers are propagated.

### 3.1.2.15. cudnnNormAlgo_t

cudnnNormAlgo_t is an enumerated type used to specify the algorithm to execute the normalization operation.

##### Values
CUDNN_NORM_ALGO_STANDARD

Standard normalization is performed.

CUDNN_NORM_ALGO_PERSIST

This mode is similar to CUDNN_NORM_ALGO_STANDARD, however it only supports CUDNN_NORM_PER_CHANNEL and can be faster for some tasks.

An optimized path may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF types, compute capability 6.0 or higher for the following two normalization API calls: cudnnNormalizationForwardTraining() and cudnnNormalizationBackward(). In the case of cudnnNormalizationBackward(), the savedMean and savedInvVariance arguments should not be NULL.

The rest of this section applies to NCHW mode only: This mode may use a scaled atomic integer reduction that is deterministic but imposes more restrictions on the input data range. When a numerical overflow occurs, the algorithm may produce NaN-s or Inf-s (infinity) in output buffers.

When Inf-s/NaN-s are present in the input data, the output in this mode is the same as from a pure floating-point implementation.

For finite but very large input values, the algorithm may encounter overflows more frequently due to a lower dynamic range and emit Inf-s/NaN-s while CUDNN_NORM_ALGO_STANDARD will produce finite results. The user can invoke cudnnQueryRuntimeError() to check if a numerical overflow occurred in this mode.

### 3.1.2.16. cudnnNormMode_t

cudnnNormMode_t is an enumerated type used to specify the mode of operation in cudnnNormalizationForwardInference(), cudnnNormalizationForwardTraining(), cudnnBatchNormalizationBackward(), cudnnGetNormalizationForwardTrainingWorkspaceSize(), cudnnGetNormalizationBackwardWorkspaceSize(), and cudnnGetNormalizationTrainingReserveSpaceSize() routines.

##### Values
CUDNN_NORM_PER_ACTIVATION

Normalization is performed per-activation. This mode is intended to be used after the non-convolutional network layers. In this mode, the tensor dimensions of normBias and normScale and the parameters used in the cudnnNormalization* functions are 1xCxHxW.

CUDNN_NORM_PER_CHANNEL

Normalization is performed per-channel over N+spatial dimensions. This mode is intended for use after convolutional layers (where spatial invariance is desired). In this mode, the normBias and normScale tensor dimensions are 1xCx1x1.

### 3.1.2.17. cudnnNormOps_t

cudnnNormOps_t is an enumerated type used to specify the mode of operation in cudnnGetNormalizationForwardTrainingWorkspaceSize(), cudnnNormalizationForwardTraining(), cudnnGetNormalizationBackwardWorkspaceSize(), cudnnNormalizationBackward(), and cudnnGetNormalizationTrainingReserveSpaceSize() functions.

##### Values
CUDNN_NORM_OPS_NORM

Only normalization is performed.

CUDNN_NORM_OPS_NORM_ACTIVATION

First, the normalization is performed, then the activation is performed.

Performs the normalization, then element-wise addition, followed by the activation operation.

### 3.1.2.18. cudnnOpTensorOp_t

cudnnOpTensorOp_t is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnOpTensor() routine. This enumerated type is used as a field for the cudnnOpTensorDescriptor_t descriptor.

##### Values

The operation to be performed is addition.

CUDNN_OP_TENSOR_MUL

The operation to be performed is multiplication.

CUDNN_OP_TENSOR_MIN

The operation to be performed is a minimum comparison.

CUDNN_OP_TENSOR_MAX

The operation to be performed is a maximum comparison.

CUDNN_OP_TENSOR_SQRT

The operation to be performed is square root, performed on only the A tensor.

CUDNN_OP_TENSOR_NOT

The operation to be performed is negation, performed on only the A tensor.

### 3.1.2.19. cudnnPoolingMode_t

cudnnPoolingMode_t is an enumerated type passed to cudnnSetPooling2dDescriptor() to select the pooling method to be used by cudnnPoolingForward() and cudnnPoolingBackward().

##### Values
CUDNN_POOLING_MAX

The maximum value inside the pooling window is used.

Values inside the pooling window are averaged. The number of elements used to calculate the average includes spatial locations falling in the padding region.

Values inside the pooling window are averaged. The number of elements used to calculate the average excludes spatial locations falling in the padding region.

CUDNN_POOLING_MAX_DETERMINISTIC

The maximum value inside the pooling window is used. The algorithm used is deterministic.

### 3.1.2.20. cudnnReduceTensorIndices_t

cudnnReduceTensorIndices_t is an enumerated type used to indicate whether indices are to be computed by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

##### Values
CUDNN_REDUCE_TENSOR_NO_INDICES

Do not compute indices.

CUDNN_REDUCE_TENSOR_FLATTENED_INDICES

Compute indices. The resulting indices are relative, and flattened.

### 3.1.2.21. cudnnReduceTensorOp_t

cudnnReduceTensorOp_t is an enumerated type used to indicate the Tensor Core operation to be used by the cudnnReduceTensor() routine. This enumerated type is used as a field for the cudnnReduceTensorDescriptor_t descriptor.

##### Values

The operation to be performed is addition.

CUDNN_REDUCE_TENSOR_MUL

The operation to be performed is multiplication.

CUDNN_REDUCE_TENSOR_MIN

The operation to be performed is a minimum comparison.

CUDNN_REDUCE_TENSOR_MAX

The operation to be performed is a maximum comparison.

CUDNN_REDUCE_TENSOR_AMAX

The operation to be performed is a maximum comparison of absolute values.

CUDNN_REDUCE_TENSOR_AVG

The operation to be performed is averaging.

CUDNN_REDUCE_TENSOR_NORM1

The operation to be performed is addition of absolute values.

CUDNN_REDUCE_TENSOR_NORM2

The operation to be performed is a square root of the sum of squares.

CUDNN_REDUCE_TENSOR_MUL_NO_ZEROS

The operation to be performed is multiplication, not including elements of value zero.

### 3.1.2.22. cudnnRNNAlgo_t

cudnnRNNAlgo_t is an enumerated type used to specify the algorithm used in the cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData() and cudnnRNNBackwardWeights() routines.

##### Values
CUDNN_RNN_ALGO_STANDARD
Each RNN layer is executed as a sequence of operations. This algorithm is expected to have robust performance across a wide range of network parameters.
CUDNN_RNN_ALGO_PERSIST_STATIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch).

CUDNN_RNN_ALGO_PERSIST_STATIC is only supported on devices with compute capability >= 6.0.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC

The recurrent parts of the network are executed using a persistent kernel approach. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch). When using CUDNN_RNN_ALGO_PERSIST_DYNAMIC persistent kernels are prepared at runtime and are able to optimize using the specific parameters of the network and active GPU. As such, when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC a one-time plan preparation stage must be executed. These plans can then be reused in repeated calls with the same model parameters.

The limits on the maximum number of hidden units supported when using CUDNN_RNN_ALGO_PERSIST_DYNAMIC are significantly higher than the limits when using CUDNN_RNN_ALGO_PERSIST_STATIC, however throughput is likely to significantly reduce when exceeding the maximums supported by CUDNN_RNN_ALGO_PERSIST_STATIC. In this regime, this method will still outperform CUDNN_RNN_ALGO_STANDARD for some cases.

CUDNN_RNN_ALGO_PERSIST_DYNAMIC is only supported on devices with compute capability >= 6.0 on Linux machines.

### 3.1.2.23. cudnnSamplerType_t

cudnnSamplerType_t is an enumerated type passed to cudnnSetSpatialTransformerNdDescriptor() to select the sampler type to be used by cudnnSpatialTfSamplerForward() and cudnnSpatialTfSamplerBackward().

##### Values
CUDNN_SAMPLER_BILINEAR
Selects the bilinear sampler.

### 3.1.2.24. cudnnSeverity_t

cudnnSeverity_t is an enumerated type passed to the customized callback function for logging that users may set. This enumerate describes the severity level of the item, so the customized logging call back may react differently.

##### Values
CUDNN_SEV_FATAL

This value indicates a fatal error emitted by cuDNN.

CUDNN_SEV_ERROR

This value indicates a normal error emitted by cuDNN.

CUDNN_SEV_WARNING

This value indicates a warning emitted by cuDNN.

CUDNN_SEV_INFO

This value indicates a piece of information (for example, API log) emitted by cuDNN.

### 3.1.2.25. cudnnSoftmaxAlgorithm_t

cudnnSoftmaxAlgorithm_t is used to select an implementation of the softmax function used in cudnnSoftmaxForward() and cudnnSoftmaxBackward().

##### Values
CUDNN_SOFTMAX_FAST

This implementation applies the straightforward softmax operation.

CUDNN_SOFTMAX_ACCURATE

This implementation scales each point of the softmax input domain by its maximum value to avoid potential floating point overflows in the softmax evaluation.

CUDNN_SOFTMAX_LOG

This entry performs the log softmax operation, avoiding overflows by scaling each point in the input domain as in CUDNN_SOFTMAX_ACCURATE.

### 3.1.2.26. cudnnSoftmaxMode_t

cudnnSoftmaxMode_t is used to select over which data the cudnnSoftmaxForward() and cudnnSoftmaxBackward() are computing their results.

##### Values
CUDNN_SOFTMAX_MODE_INSTANCE

The softmax operation is computed per image (N) across the dimensions C,H,W.

CUDNN_SOFTMAX_MODE_CHANNEL

The softmax operation is computed per spatial location (H,W) per image (N) across dimension C.

### 3.1.2.27. cudnnStatus_t

cudnnStatus_t is an enumerated type used for function status returns. All cuDNN library functions return their status, which can be one of the following values:
##### Values
CUDNN_STATUS_SUCCESS
The operation was completed successfully.
CUDNN_STATUS_NOT_INITIALIZED
The cuDNN library was not initialized properly. This error is usually returned when a call to cudnnCreate() fails or when cudnnCreate() has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called by cudnnCreate() or by an error in the hardware setup.
CUDNN_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuDNN library. This is usually caused by an internal cudaMalloc() failure.

To correct, prior to the function call, deallocate previously allocated memory as much as possible.

An incorrect value or parameter was passed to the function.

To correct, ensure that all the parameters being passed have valid values.

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note that cuDNN only supports devices with compute capabilities greater than or equal to 3.0.

To correct, compile and run the application on a device with appropriate compute capability.

CUDNN_STATUS_MAPPING_ERROR

An access to GPU memory space failed, which is usually caused by a failure to bind a texture.

To correct, prior to the function call, unbind any previously bound textures.

Otherwise, this may indicate an internal error/bug in the library.

CUDNN_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is usually caused by a failure to launch some cuDNN kernel on the GPU, which can occur for multiple reasons.

To correct, check that the hardware, an appropriate version of the driver, and the cuDNN library are correctly installed.

Otherwise, this may indicate an internal error/bug in the library.

CUDNN_STATUS_INTERNAL_ERROR
An internal cuDNN operation failed.
CUDNN_STATUS_NOT_SUPPORTED
The functionality requested is not presently supported by cuDNN.
The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable NVIDIA_LICENSE_FILE is not set properly.
CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
A runtime library required by cuDNN cannot be found in the predefined search paths. These libraries are libcuda.so (nvcuda.dll) and libnvrtc.so (nvrtc64_<Major Release Version><Minor Release Version>_0.dll and nvrtc-builtins64_<Major Release Version><Minor Release Version>.dll).
CUDNN_STATUS_RUNTIME_IN_PROGRESS
Some tasks in the user stream are not completed.
CUDNN_STATUS_RUNTIME_FP_OVERFLOW
Numerical overflow occurred during the GPU kernel execution.

### 3.1.2.28. cudnnTensorFormat_t

cudnnTensorFormat_t is an enumerated type used by cudnnSetTensor4dDescriptor() to create a tensor with a pre-defined layout. For a detailed explanation of how these tensors are arranged in memory, refer to Data Layout Formats in the cuDNN Developer Guide.

##### Values
CUDNN_TENSOR_NCHW

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NHWC

This tensor format specifies that the data is laid out in the following order: batch size, rows, columns, feature maps. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, rows, columns, and feature maps; the feature maps are the inner dimension and the images are the outermost dimension.

CUDNN_TENSOR_NCHW_VECT_C

This tensor format specifies that the data is laid out in the following order: batch size, feature maps, rows, columns. However, each element of the tensor is a vector of multiple feature maps. The length of the vector is carried by the data type of the tensor. The strides are implicitly defined in such a way that the data are contiguous in memory with no padding between images, feature maps, rows, and columns; the columns are the inner dimension and the images are the outermost dimension. This format is only supported with tensor data types CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32, and CUDNN_DATA_UINT8x4.

The CUDNN_TENSOR_NCHW_VECT_C can also be interpreted in the following way: The NCHW INT8x32 format is really N x (C/32) x H x W x 32 (32 Cs for every W), just as the NCHW INT8x4 format is N x (C/4) x H x W x 4 (4 Cs for every W). Hence, the VECT_C name - each W is a vector (4 or 32) of Cs.

### 3.2.1. cudnnActivationForward()

cudnnStatus_t cudnnActivationForward(
cudnnHandle_t handle,
cudnnActivationDescriptor_t     activationDesc,
const void                     *alpha,
const cudnnTensorDescriptor_t   xDesc,
const void                     *x,
const void                     *beta,
const cudnnTensorDescriptor_t   yDesc,
void                           *y)
This routine applies a specified neuron activation function element-wise over each input value.
Note:
• In-place operation is allowed for this routine; meaning, xData and yData pointers may be equal. However, this requires xDesc and yDesc descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed).
• All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of xDesc and yDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

#### Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.

activationDesc

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to the previously initialized input tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The parameter mode has an invalid enumerant value.
• The dimensions n, c, h, w of the input tensor and output tensor differ.
• The datatype of the input tensor and output tensor differs.
• The strides nStride, cStride, hStride, wStride of the input tensor and output tensor differ and in-place operation is used (meaning, x and y pointers are equal).
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

cudnnStatus_t cudnnAddTensor(
cudnnHandle_t                     handle,
const void                       *alpha,
const void                       *A,
const void                       *beta,
const cudnnTensorDescriptor_t     cDesc,
void                             *C)

This function adds the scaled values of a bias tensor to another tensor. Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. In the latter case, the same value from the bias tensor for those dimensions will be used to blend into the C tensor.

Note: Only 4D and 5D tensors are supported. Beyond these dimensions, this routine is not supported.

#### Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with the prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

Input. Handle to a previously initialized tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.

A

Input. Pointer to data of the tensor described by the aDesc descriptor.

cDesc

Input. Handle to a previously initialized tensor descriptor.

C

Input/Output. Pointer to data of the tensor described by the cDesc descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The function executed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

The dimensions of the bias tensor refer to an amount of data that is incompatible with the output tensor dimensions or the dataType of the two tensor descriptors are different.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.3. cudnnBatchNormalizationForwardInference()

 cudnnStatus_t cudnnBatchNormalizationForwardInference(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alpha,
const void                      *beta,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    yDesc,
void                            *y,
const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
const void                      *bnScale,
const void                      *bnBias,
const void                      *estimatedMean,
const void                      *estimatedVariance,
double                           epsilon)
This function performs the forward batch normalization layer computation for the inference phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
Note:
• Only 4D and 5D tensors are supported.
• The input transformation performed by this function is defined as:
y = beta*y + alpha *[bnBias + (bnScale * (x-estimatedMean)/sqrt(epsilon + estimatedVariance)]
• The epsilon value has to be the same during training, backpropagation and inference.
• For the training phase, refer to cudnnBatchNormalizationForwardTraining().
• Higher performance can be obtained when HW-packed tensors are used for all of x and dx.
For more information, refer to cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, refer to cudnnBatchNormMode_t.

alpha, beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Handles to the previously initialized tensor descriptors.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x input data.

*y

Input/Output. Data pointer to GPU memory associated with the tensor descriptor yDesc, for the youtput of the batch normalization layer.

bnScaleBiasMeanVarDesc, bnScale, bnBias

Inputs. Tensor descriptors and pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma).

estimatedMean, estimatedVariance

Inputs. Mean and variance tensors (these have the same descriptor as the bias and scale). The resultRunningMean and resultRunningVariance, accumulated during the training phase from the cudnnBatchNormalizationForwardTraining() call, should be passed as inputs here.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h.

#### Supported configurations

This function supports the following combinations of data types for various descriptors.

Table 12. Supported configurations
Data Type Configurations xDesc bnScaleBiasMeanVarDesc alpha, beta yDesc
INT8_CONFIG CUDNN_DATA_INT8 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_INT8
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
BFLOAT16_CONFIG CUDNN_DATA_BFLOAT16 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_BFLOAT16

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• One of the pointers alpha, beta, x, y, bnScale, bnBias, estimatedMean, estimatedInvVariance is NULL.
• The number of xDesc or yDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported.)
• bnScaleBiasMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
• epsilon value is less than CUDNN_BN_MIN_EPSILON.
• Dimensions or data types mismatch for xDesc, yDesc.

### 3.2.4. cudnnCopyAlgorithmDescriptor()

This function has been deprecated in cuDNN 8.0.

### 3.2.5. cudnnCreate()

cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)

This function initializes the cuDNN library and creates a handle to an opaque structure holding the cuDNN library context. It allocates hardware resources on the host and device and must be called prior to making any other cuDNN library calls.

The cuDNN library handle is tied to the current CUDA device (context). To use the library on multiple devices, one cuDNN handle needs to be created for each device.

For a given device, multiple cuDNN handles with different configurations (for example, different current CUDA streams) may be created. Because cudnnCreate() allocates some internal resources, the release of those resources by calling cudnnDestroy() will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths.

For multithreaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handle(s) per thread and use that cuDNN handle for the entire life of the thread.

#### Parameters

handle

Output. Pointer to pointer where to store the address to the allocated cuDNN handle. For more information, refer to cudnnHandle_t.

#### Returns

Invalid (NULL) input pointer supplied.

CUDNN_STATUS_NOT_INITIALIZED

No compatible GPU found, CUDA driver not installed or disabled, CUDA runtime API initialization failed.

CUDNN_STATUS_ARCH_MISMATCH

NVIDIA GPU architecture is too old.

CUDNN_STATUS_ALLOC_FAILED

Host memory allocation failed.

CUDNN_STATUS_INTERNAL_ERROR

CUDA resource allocation failed.

cuDNN license validation failed (only when the feature is enabled).

CUDNN_STATUS_SUCCESS

cuDNN handle was created successfully.

### 3.2.6. cudnnCreateActivationDescriptor()

cudnnStatus_t cudnnCreateActivationDescriptor(
cudnnActivationDescriptor_t   *activationDesc)

This function creates an activation descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnActivationDescriptor_t.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.7. cudnnCreateAlgorithmDescriptor()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnCreateAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t *algoDesc)

This function creates an algorithm descriptor object by allocating the memory needed to hold its opaque structure.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.8. cudnnCreateAlgorithmPerformance()

cudnnStatus_t cudnnCreateAlgorithmPerformance(
cudnnAlgorithmPerformance_t *algoPerf,
int                         numberToCreate)

This function creates multiple algorithm performance objects by allocating the memory needed to hold their opaque structures.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.9. cudnnCreateDropoutDescriptor()

cudnnStatus_t cudnnCreateDropoutDescriptor(
cudnnDropoutDescriptor_t    *dropoutDesc)

This function creates a generic dropout descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnDropoutDescriptor_t.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.10. cudnnCreateFilterDescriptor()

cudnnStatus_t cudnnCreateFilterDescriptor(
cudnnFilterDescriptor_t *filterDesc)

This function creates a filter descriptor object by allocating the memory needed to hold its opaque structure. For more information, refer to cudnnFilterDescriptor_t.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.11. cudnnCreateLRNDescriptor()

cudnnStatus_t cudnnCreateLRNDescriptor(
cudnnLRNDescriptor_t    *poolingDesc)

This function allocates the memory needed to hold the data needed for LRN and DivisiveNormalization layers operation and returns a descriptor used with subsequent layer forward and backward calls.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### cudnnCreateOpTensorDescriptor()

cudnnStatus_t cudnnCreateOpTensorDescriptor(
cudnnOpTensorDescriptor_t* 	opTensorDesc)

This function creates a tensor pointwise math descriptor. For more information, refer to cudnnOpTensorDescriptor_t.

#### Parameters

opTensorDesc

Output. Pointer to the structure holding the description of the tensor pointwise math such as add, multiply, and more.

#### Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

Tensor pointwise math descriptor passed to the function is invalid.

CUDNN_STATUS_ALLOC_FAILED

Memory allocation for this tensor pointwise math descriptor failed.

### 3.2.13. cudnnCreatePoolingDescriptor()

cudnnStatus_t cudnnCreatePoolingDescriptor(
cudnnPoolingDescriptor_t    *poolingDesc)

This function creates a pooling descriptor object by allocating the memory needed to hold its opaque structure.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.14. cudnnCreateReduceTensorDescriptor()

cudnnStatus_t cudnnCreateReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t*	reduceTensorDesc)

This function creates a reduced tensor descriptor object by allocating the memory needed to hold its opaque structure.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

reduceTensorDesc is a NULL pointer.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.15. cudnnCreateSpatialTransformerDescriptor()

cudnnStatus_t cudnnCreateSpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t *stDesc)

This function creates a generic spatial transformer descriptor object by allocating the memory needed to hold its opaque structure.

#### Returns

CUDNN_STATUS_SUCCESS

The object was created successfully.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

### 3.2.16. cudnnCreateTensorDescriptor()

cudnnStatus_t cudnnCreateTensorDescriptor(
cudnnTensorDescriptor_t *tensorDesc)

This function creates a generic tensor descriptor object by allocating the memory needed to hold its opaque structure. The data is initialized to all zeros.

#### Parameters

tensorDesc

Output. Pointer to pointer where the address to the allocated tensor descriptor object should be stored.

#### Returns

Invalid input argument.

CUDNN_STATUS_ALLOC_FAILED

The resources could not be allocated.

CUDNN_STATUS_SUCCESS

The object was created successfully.

### 3.2.17. cudnnCreateTensorTransformDescriptor()

cudnnStatus_t cudnnCreateTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t *transformDesc);


This function creates a tensor transform descriptor object by allocating the memory needed to hold its opaque structure. The tensor data is initialized to be all zero. Use the cudnnSetTensorTransformDescriptor() function to initialize the descriptor created by this function.

#### Parameters

transformDesc
Output. A pointer to an uninitialized tensor transform descriptor.

#### Returns

CUDNN_STATUS_SUCCESS
The descriptor object was created successfully.
The transformDesc is NULL.
CUDNN_STATUS_ALLOC_FAILED
The memory allocation failed.

### 3.2.18. cudnnDeriveBNTensorDescriptor()

cudnnStatus_t cudnnDeriveBNTensorDescriptor(
cudnnTensorDescriptor_t         derivedBnDesc,
const cudnnTensorDescriptor_t   xDesc,
cudnnBatchNormMode_t            mode)

This function derives a secondary tensor descriptor for the batch normalization scale, invVariance, bnBias, and bnScale subtensors from the layer's x data descriptor.

Use the tensor descriptor produced by this function as the bnScaleBiasMeanVarDesc parameter for the cudnnBatchNormalizationForwardInference() and cudnnBatchNormalizationForwardTraining() functions, and as the bnScaleBiasDiffDesc parameter in the cudnnBatchNormalizationBackward() function.

The resulting dimensions will be:
• 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for BATCHNORM_MODE_SPATIAL
• 1xCxHxW for 4D and 1xCxDxHxW for 5D for BATCHNORM_MODE_PER_ACTIVATION mode
For HALF input data type the resulting tensor descriptor will have a FLOAT type. For other data types, it will have the same type as the input data.
Note:
• Only 4D and 5D tensors are supported.
• The derivedBnDesc should be first created using cudnnCreateTensorDescriptor().
• xDesc is the descriptor for the layer's x data and has to be set up with proper dimensions prior to calling this function.

#### Parameters

derivedBnDesc

Output. Handle to a previously created tensor descriptor.

xDesc

Input. Handle to a previously created and initialized layer's x data descriptor.

mode

Input. Batch normalization layer mode of operation.

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

Invalid Batch Normalization mode.

### 3.2.19. cudnnDeriveNormTensorDescriptor()

cudnnStatus_t CUDNNWINAPI
cudnnDeriveNormTensorDescriptor(cudnnTensorDescriptor_t derivedNormScaleBiasDesc,
cudnnTensorDescriptor_t derivedNormMeanVarDesc,
const cudnnTensorDescriptor_t xDesc,
cudnnNormMode_t mode,
int groupCnt)


This function derives tensor descriptors for the normalization mean, invariance, normBias, and normScale subtensors from the layer's x data descriptor and norm mode. normalization, mean, and invariance share the same descriptor while bias and scale share the same descriptor.

Use the tensor descriptor produced by this function as the normScaleBiasDesc or normMeanVarDesc parameter for the cudnnNormalizationForwardInference() and cudnnNormalizationForwardTraining() functions, and as the dNormScaleBiasDesc and normMeanVarDesc parameters in the cudnnNormalizationBackward() function.

The resulting dimensions will be:
• 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for CUDNN_NORM_PER_ACTIVATION
• 1xCxHxW for 4D and 1xCxDxHxW for 5D for CUDNN_NORM_PER_CHANNEL mode
For HALF input data type the resulting tensor descriptor will have a FLOAT type. For other data types, it will have the same type as the input data.
• Only 4D and 5D tensors are supported.
• The derivedNormScaleBiasDesc and derivedNormMeanVarDesc should be created first using cudnnCreateTensorDescriptor().
• xDesc is the descriptor for the layer's x data and has to be set up with proper dimensions prior to calling this function.

#### Parameters

derivedNormScaleBiasDesc

Output. Handle to a previously created tensor descriptor.

derivedNormMeanVarDesc

Output. Handle to a previously created tensor descriptor.

xDesc

Input. Handle to a previously created and initialized layer's x data descriptor.

mode

Input. The normalization layer mode of operation.

groupCnt
Input. The number of grouped convolutions. Currently, only 1 is supported.

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

Invalid Batch Normalization mode.

### 3.2.20. cudnnDestroy()

cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)

This function releases the resources used by the cuDNN handle. This function is usually the last call made to cuDNN with a particular handle. Because cudnnCreate() allocates internal resources, the release of those resources by calling cudnnDestroy() will implicitly call cudaDeviceSynchronize; therefore, the recommended best practice is to call cudnnCreate/cudnnDestroy outside of performance-critical code paths.

#### Parameters

handle

Input. The cuDNN handle to be destroyed.

#### Returns

CUDNN_STATUS_SUCCESS

The cuDNN context destruction was successful.

Invalid (NULL) pointer supplied.

### 3.2.21. cudnnDestroyActivationDescriptor()

cudnnStatus_t cudnnDestroyActivationDescriptor(
cudnnActivationDescriptor_t activationDesc)

This function destroys a previously created activation descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.22. cudnnDestroyAlgorithmDescriptor()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnDestroyAlgorithmDescriptor(
cudnnActivationDescriptor_t algorithmDesc)

This function destroys a previously created algorithm descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.23. cudnnDestroyAlgorithmPerformance()

cudnnStatus_t cudnnDestroyAlgorithmPerformance(
cudnnAlgorithmPerformance_t     algoPerf)

This function destroys a previously created algorithm descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.24. cudnnDestroyDropoutDescriptor()

cudnnStatus_t cudnnDestroyDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc)

This function destroys a previously created dropout descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.25. cudnnDestroyFilterDescriptor()

cudnnStatus_t cudnnDestroyFilterDescriptor(
cudnnFilterDescriptor_t filterDesc)

This function destroys a filter object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.26. cudnnDestroyLRNDescriptor()

cudnnStatus_t cudnnDestroyLRNDescriptor(
cudnnLRNDescriptor_t lrnDesc)

This function destroys a previously created LRN descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.27. cudnnDestroyOpTensorDescriptor()

cudnnStatus_t cudnnDestroyOpTensorDescriptor(
cudnnOpTensorDescriptor_t   opTensorDesc)

This function deletes a tensor pointwise math descriptor object.

#### Parameters

opTensorDesc

Input. Pointer to the structure holding the description of the tensor pointwise math to be deleted.

#### Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

### 3.2.28. cudnnDestroyPoolingDescriptor()

cudnnStatus_t cudnnDestroyPoolingDescriptor(
cudnnPoolingDescriptor_t poolingDesc)

This function destroys a previously created pooling descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.29. cudnnDestroyReduceTensorDescriptor()

cudnnStatus_t cudnnDestroyReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t   tensorDesc)

This function destroys a previously created reduce tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

#### Parameters

tensorDesc

Input. Pointer to the reduce tensor descriptor object to be destroyed.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.30. cudnnDestroySpatialTransformerDescriptor()

cudnnStatus_t cudnnDestroySpatialTransformerDescriptor(
cudnnSpatialTransformerDescriptor_t stDesc)

This function destroys a previously created spatial transformer descriptor object.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.31. cudnnDestroyTensorDescriptor()

cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)

This function destroys a previously created tensor descriptor object. When the input pointer is NULL, this function performs no destroy operation.

#### Parameters

tensorDesc

Input. Pointer to the tensor descriptor object to be destroyed.

#### Returns

CUDNN_STATUS_SUCCESS

The object was destroyed successfully.

### 3.2.32. cudnnDestroyTensorTransformDescriptor()

cudnnStatus_t cudnnDestroyTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc);


Destroys a previously created tensor transform descriptor.

#### Parameters

transformDesc
Input. The tensor transform descriptor to be destroyed.

#### Returns

CUDNN_STATUS_SUCCESS
The descriptor was destroyed successfully.

### 3.2.33. cudnnDivisiveNormalizationForward()

cudnnStatus_t cudnnDivisiveNormalizationForward(
cudnnHandle_t                    handle,
cudnnLRNDescriptor_t             normDesc,
cudnnDivNormMode_t               mode,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *means,
void                            *temp,
void                            *temp2,
const void                      *beta,
const cudnnTensorDescriptor_t    yDesc,
void                            *y)
This function performs the forward spatial DivisiveNormalization layer computation. It divides every value in a layer by the standard deviation of its spatial neighbors as described in What is the Best Multi-Stage Architecture for Object Recognition, Jarrett 2009, Local Contrast Normalization Layer section. Note that DivisiveNormalization only implements the x/max(c, sigma_x) portion of the computation, where sigma_x is the variance over the spatial neighborhood of x. The full LCN (Local Contrastive Normalization) computation can be implemented as a two-step process:
x_m = x-mean(x);
y = x_m/max(c, sigma(x_m));


The x-mean(x) which is often referred to as "subtractive normalization" portion of the computation can be implemented using cuDNN average pooling layer followed by a call to addTensor.

Note: Supported tensor formats are NCHW for 4D and NCDHW for 5D with any non-overlapping non-negative strides. Only 4D and 5D tensors are supported.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously initialized LRN parameter descriptor. This descriptor is used for both LRN and DivisiveNormalization layers.

divNormMode

Input. DivisiveNormalization layer mode of operation. Currently only CUDNN_DIVNORM_PRECOMPUTED_MEANS is implemented. Normalization is performed using the means input tensor that is expected to be precomputed by the user.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Tensor descriptor objects for the input and output tensors. Note that xDesc is shared between x, means, temp, and temp2 tensors.

x

Input. Input tensor data pointer in device memory.

means

Input. Input means tensor data pointer in device memory. Note that this tensor can be NULL (in that case its values are assumed to be zero during the computation). This tensor also doesn't have to contain means, these can be any values, a frequently used variation is a result of convolution with a normalized positive kernel (such as Gaussian).

temp, temp2

Workspace. Temporary tensors in device memory. These are used for computing intermediate values during the forward pass. These tensors do not have to be preserved as inputs from forward to the backward pass. Both use xDesc as their descriptor.

y

Output. Pointer in device memory to a tensor for the result of the forward DivisiveNormalization computation.

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

At least one of the following conditions are met:

• One of the tensor pointers x, y, temp, temp2 is NULL.
• Number of input tensor or output tensor dimensions is outside of [4,5] range.
• A mismatch in dimensions between any two of the input or output tensors.
• For in-place computation when pointers x == y, a mismatch in strides between the input data and output data tensors.
• Alpha or beta pointer is NULL.
• LRN descriptor parameters are outside of their valid ranges.
• Any of the tensor strides are negative.
CUDNN_STATUS_UNSUPPORTED

The function does not support the provided configuration, for example, any of the input and output tensor strides mismatch (for the same dimension) is a non-supported configuration.

### 3.2.34. cudnnDropoutForward()

cudnnStatus_t cudnnDropoutForward(
cudnnHandle_t                       handle,
const cudnnDropoutDescriptor_t      dropoutDesc,
const cudnnTensorDescriptor_t       xdesc,
const void                         *x,
const cudnnTensorDescriptor_t       ydesc,
void                               *y,
void                               *reserveSpace,
size_t                              reserveSpaceSizeInBytes)

This function performs forward dropout operation over x returning results in y. If dropout was used as a parameter to cudnnSetDropoutDescriptor(), the approximate dropout fraction of x values will be replaced by a 0, and the rest will be scaled by 1/(1-dropout). This function should not be running concurrently with another cudnnDropoutForward() function using the same states.

Note:
• Better performance is obtained for fully packed tensors.
• This function should not be called during inference.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

dropoutDesc

Input. Previously created dropout descriptor object.

xDesc

Input. Handle to a previously initialized tensor descriptor.

x

Input. Pointer to data of the tensor described by the xDesc descriptor.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Output. Pointer to data of the tensor described by the yDesc descriptor.

reserveSpace

Output. Pointer to user-allocated GPU memory used by this function. It is expected that the contents of reserveSpace does not change between cudnnDropoutForward() and cudnnDropoutBackward() calls.

reserveSpaceSizeInBytes

Input. Specifies the size in bytes of the provided memory for the reserve space.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The number of elements of input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
• The strides of the input tensor and output tensors differ and in-place operation is used (meaning, x and y pointers are equal).
• The provided reserveSpaceSizeInBytes is less than the value returned by cudnnDropoutGetReserveSpaceSize().
• cudnnSetDropoutDescriptor() has not been called on dropoutDesc with the non-NULLstates argument.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.35. cudnnDropoutGetReserveSpaceSize()

cudnnStatus_t cudnnDropoutGetReserveSpaceSize(
cudnnTensorDescriptor_t     xDesc,
size_t                     *sizeInBytes)

This function is used to query the amount of reserve needed to run dropout with the input dimensions given by xDesc. The same reserve space is expected to be passed to cudnnDropoutForward() and cudnnDropoutBackward(), and its contents is expected to remain unchanged between cudnnDropoutForward() and cudnnDropoutBackward() calls.

#### Parameters

xDesc

Input. Handle to a previously initialized tensor descriptor, describing input to a dropout operation.

sizeInBytes

Output. Amount of GPU memory needed as reserve space to be able to run dropout with an input tensor descriptor specified by xDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The query was successful.

### 3.2.36. cudnnDropoutGetStatesSize()

cudnnStatus_t cudnnDropoutGetStatesSize(
cudnnHandle_t       handle,
size_t             *sizeInBytes)

This function is used to query the amount of space required to store the states of the random number generators used by cudnnDropoutForward() function.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

sizeInBytes

Output. Amount of GPU memory needed to store random generator states.

#### Returns

CUDNN_STATUS_SUCCESS

The query was successful.

### 3.2.37. cudnnGetActivationDescriptor()

cudnnStatus_t cudnnGetActivationDescriptor(
const cudnnActivationDescriptor_t   activationDesc,
cudnnActivationMode_t              *mode,
cudnnNanPropagation_t              *reluNanOpt,
double                             *coef)

This function queries a previously initialized generic activation descriptor object.

#### Parameters

activationDesc

Input. Handle to a previously created activation descriptor.

mode

Output. Enumerant to specify the activation mode.

reluNanOpt

Output. Enumerant to specify the Nan propagation mode.

coef

Output. Floating point number to specify the clipping threshold when the activation mode is set to CUDNN_ACTIVATION_CLIPPED_RELU or to specify the alpha coefficient when the activation mode is set to CUDNN_ACTIVATION_ELU.

#### Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 3.2.38. cudnnGetActivationDescriptorSwishBeta()

cudnnStatus_t
cudnnGetActivationDescriptorSwishBeta(cudnnActivationDescriptor_t
activationDesc, double* swish_beta)


This function queries the current beta parameter set for SWISH activation.

#### Parameters

activationDesc

Input. Handle to a previously created activation descriptor.

swish_beta

Output. Pointer to a double value that will receive the currently configured SWISH beta parameter.

#### Returns

CUDNN_STATUS_SUCCESS

The beta parameter was queried successfully.

At least one of activationDesc or swish_beta were NULL.

### 3.2.39. cudnnGetAlgorithmDescriptor()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnGetAlgorithmDescriptor(
const cudnnAlgorithmDescriptor_t    algoDesc,
cudnnAlgorithm_t                    *algorithm)

This function queries a previously initialized generic algorithm descriptor object.

#### Parameters

algorithmDesc

Input. Handle to a previously created algorithm descriptor.

algorithm

Input. Struct to specify the algorithm.

#### Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 3.2.40. cudnnGetAlgorithmPerformance()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnGetAlgorithmPerformance(
const cudnnAlgorithmPerformance_t   algoPerf,
cudnnAlgorithmDescriptor_t*         algoDesc,
cudnnStatus_t*                      status,
float*                              time,
size_t*                             memory)

This function queries a previously initialized generic algorithm performance object.

#### Parameters

algoPerf

Input/Output. Handle to a previously created algorithm performance object.

algoDesc

Output. The algorithm descriptor which the performance results describe.

status

Output. The cuDNN status returned from running the algoDesc algorithm.

timecoef

Output. The GPU time spent running the algoDesc algorithm.

memory

Output. The GPU memory needed to run the algoDesc algorithm.

#### Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

### 3.2.41. cudnnGetAlgorithmSpaceSize()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnGetAlgorithmSpaceSize(
cudnnHandle_t               handle,
cudnnAlgorithmDescriptor_t  algoDesc,
size_t*                     algoSpaceSizeInBytes)

This function queries for the amount of host memory needed to call cudnnSaveAlgorithm(), much like the “get workspace size” function query for the amount of device memory needed.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

algoDesc

Input. A previously created algorithm descriptor.

algoSpaceSizeInBytes

Output. Amount of host memory needed as a workspace to be able to save the metadata from the specified algoDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the arguments is NULL.

### 3.2.42. cudnnGetCallback()

cudnnStatus_t cudnnGetCallback(
void                **udata,
cudnnCallback_t     fptr)

This function queries the internal states of cuDNN error reporting functionality.

#### Parameters

Output. Pointer to the address where the current internal error reporting message bit mask will be outputted.

udata

Output. Pointer to the address where the current internally stored udata address will be stored.

fptr

Output. Pointer to the address where the current internally stored callback function pointer will be stored. When the built-in default callback function is used, NULL will be outputted.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

If any of the input parameters are NULL.

### 3.2.43. cudnnGetCudartVersion()

size_t cudnnGetCudartVersion()

The same version of a given cuDNN library can be compiled against different CUDA toolkit versions. This routine returns the CUDA toolkit version that the currently used cuDNN library has been compiled against.

### 3.2.44. cudnnGetDropoutDescriptor()

cudnnStatus_t cudnnGetDropoutDescriptor(
cudnnDropoutDescriptor_t    dropoutDesc,
cudnnHandle_t               handle,
float                      *dropout,
void                       **states,
unsigned long long         *seed)

This function queries the fields of a previously initialized dropout descriptor.

#### Parameters

dropoutDesc

Input. Previously initialized dropout descriptor.

handle

Input. Handle to a previously created cuDNN context.

dropout

Output. The probability with which the value from input is set to 0 during the dropout layer.

states

Output. Pointer to user-allocated GPU memory that holds random number generator states.

seed

Output. Seed used to initialize random number generator states.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

One or more of the arguments was an invalid pointer.

### 3.2.45. cudnnGetErrorString()

const char * cudnnGetErrorString(cudnnStatus_t status)

This function converts the cuDNN status code to a NULL terminated (ASCIIZ) static string. For example, when the input argument is CUDNN_STATUS_SUCCESS, the returned string is CUDNN_STATUS_SUCCESS. When an invalid status value is passed to the function, the returned string is CUDNN_UNKNOWN_STATUS.

#### Parameters

status

Input. cuDNN enumerant status code.

#### Returns

Pointer to a static, NULL terminated string with the status name.

### 3.2.46. cudnnGetFilter4dDescriptor()

cudnnStatus_t cudnnGetFilter4dDescriptor(
const cudnnFilterDescriptor_t     filterDesc,
cudnnDataType_t            *dataType,
cudnnTensorFormat_t        *format,
int                        *k,
int                        *c,
int                        *h,
int                        *w)

This function queries the parameters of the previously initialized filter descriptor object.

#### Parameters

filterDesc

Input. Handle to a previously created filter descriptor.

datatype

Output. Data type.

format

Output. Type of format.

k

Output. Number of output feature maps.

c

Output. Number of input feature maps.

h

Output. Height of each filter.

w

Output. Width of each filter.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 3.2.47. cudnnGetFilterNdDescriptor()

cudnnStatus_t cudnnGetFilterNdDescriptor(
const cudnnFilterDescriptor_t   wDesc,
int                             nbDimsRequested,
cudnnDataType_t                *dataType,
cudnnTensorFormat_t            *format,
int                            *nbDims,
int                             filterDimA[])

This function queries a previously initialized filter descriptor object.

#### Parameters

wDesc

Input. Handle to a previously initialized filter descriptor.

nbDimsRequested

Input. Dimension of the expected filter descriptor. It is also the minimum size of the arrays filterDimA in order to be able to hold the results

datatype

Output. Data type.

format

Output. Type of format.

nbDims

Output. Actual dimension of the filter.

filterDimA

Output. Array of dimensions of at least nbDimsRequested that will be filled with the filter parameters from the provided filter descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

The parameter nbDimsRequested is negative.

### 3.2.48. cudnnGetFilterSizeInBytes()

cudnnStatus_t
cudnnGetFilterSizeInBytes(const cudnnFilterDescriptor_t filterDesc, size_t *size) ;


This function returns the size of the filter tensor in memory with respect to the given descriptor. It can be used to know the amount of GPU memory to be allocated to hold that filter tensor.

#### Parameters

filterDesc

Input. handle to a previously initialized filter descriptor.

size

Output. size in bytes needed to hold the tensor in GPU memory.

#### Returns

CUDNN_STATUS_SUCCESS

filterDesc is valid.

filerDesc is invald.

### 3.2.49. cudnnGetLRNDescriptor()

cudnnStatus_t cudnnGetLRNDescriptor(
cudnnLRNDescriptor_t    normDesc,
unsigned               *lrnN,
double                 *lrnAlpha,
double                 *lrnBeta,
double                 *lrnK)

This function retrieves values stored in the previously initialized LRN descriptor object.

#### Parameters

normDesc

Input. Handle to a previously created LRN descriptor.

lrnN, lrnAlpha, lrnBeta, lrnK

Output. Pointers to receive values of parameters stored in the descriptor object. For more information, refer to cudnnSetLRNDescriptor(). Any of these pointers can be NULL (no value is returned for the corresponding parameter).

#### Returns

CUDNN_STATUS_SUCCESS

Function completed successfully.

### 3.2.50. cudnnGetOpTensorDescriptor()

cudnnStatus_t cudnnGetOpTensorDescriptor(
const cudnnOpTensorDescriptor_t opTensorDesc,
cudnnOpTensorOp_t               *opTensorOp,
cudnnDataType_t                 *opTensorCompType,
cudnnNanPropagation_t           *opTensorNanOpt)

This function returns the configuration of the passed tensor pointwise math descriptor.

#### Parameters

opTensorDesc

Input. Tensor pointwise math descriptor passed to get the configuration from.

opTensorOp

Output. Pointer to the tensor pointwise math operation type, associated with this tensor pointwise math descriptor.

opTensorCompType

Output. Pointer to the cuDNN data-type associated with this tensor pointwise math descriptor.

opTensorNanOpt

Output. Pointer to the NAN propagation option associated with this tensor pointwise math descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

Input tensor pointwise math descriptor passed is invalid.

### 3.2.51. cudnnGetPooling2dDescriptor()

cudnnStatus_t cudnnGetPooling2dDescriptor(
const cudnnPoolingDescriptor_t      poolingDesc,
cudnnPoolingMode_t                 *mode,
cudnnNanPropagation_t              *maxpoolingNanOpt,
int                                *windowHeight,
int                                *windowWidth,
int                                *verticalStride,
int                                *horizontalStride)

This function queries a previously created 2D pooling descriptor object.

#### Parameters

poolingDesc

Input. Handle to a previously created pooling descriptor.

mode

Output. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Output. Enumerant to specify the Nan propagation mode.

windowHeight

Output. Height of the pooling window.

windowWidth

Output. Width of the pooling window.

verticalStride

Output. Pooling vertical stride.

horizontalStride

Output. Pooling horizontal stride.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 3.2.52. cudnnGetPooling2dForwardOutputDim()

cudnnStatus_t cudnnGetPooling2dForwardOutputDim(
const cudnnPoolingDescriptor_t      poolingDesc,
const cudnnTensorDescriptor_t       inputDesc,
int                                *outN,
int                                *outC,
int                                *outH,
int                                *outW)

This function provides the output dimensions of a tensor after 2d pooling has been applied.

Each dimension h and w of the output images is computed as follows:
    outputDim = 1 + (inputDim + 2*padding - windowDim)/poolingStride;


#### Parameters

poolingDesc

Input. Handle to a previously initialized pooling descriptor.

inputDesc

Input. Handle to the previously initialized input tensor descriptor.

outN

Output. Number of images in the output.

outC

Output. Number of channels in the output.

outH

Output. Height of images in the output.

outW

Output. Width of images in the output.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• poolingDesc has not been initialized.
• poolingDesc or inputDesc has an invalid number of dimensions (2 and 4 respectively are required).

### 3.2.53. cudnnGetPoolingNdDescriptor()

cudnnStatus_t cudnnGetPoolingNdDescriptor(
const cudnnPoolingDescriptor_t      poolingDesc,
int                                 nbDimsRequested,
cudnnPoolingMode_t                 *mode,
cudnnNanPropagation_t              *maxpoolingNanOpt,
int                                *nbDims,
int                                 windowDimA[],
int                                 strideA[])

This function queries a previously initialized generic pooling descriptor object.

#### Parameters

poolingDesc

Input. Handle to a previously created pooling descriptor.

nbDimsRequested

Input. Dimension of the expected pooling descriptor. It is also the minimum size of the arrays windowDimA, paddingA, and strideA in order to be able to hold the results.

mode

Output. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Output. Enumerant to specify the Nan propagation mode.

nbDims

Output. Actual dimension of the pooling descriptor.

windowDimA

Output. Array of dimension of at least nbDimsRequested that will be filled with the window parameters from the provided pooling descriptor.

Output. Array of dimension of at least nbDimsRequested that will be filled with the padding parameters from the provided pooling descriptor.

strideA

Output. Array of dimension at least nbDimsRequested that will be filled with the stride parameters from the provided pooling descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

CUDNN_STATUS_NOT_SUPPORTED

The parameter nbDimsRequested is greater than CUDNN_DIM_MAX.

### 3.2.54. cudnnGetPoolingNdForwardOutputDim()

cudnnStatus_t cudnnGetPoolingNdForwardOutputDim(
const cudnnPoolingDescriptor_t  poolingDesc,
const cudnnTensorDescriptor_t   inputDesc,
int                             nbDims,
int                             outDimA[])

This function provides the output dimensions of a tensor after Nd pooling has been applied.

Each dimension of the (nbDims-2)-D images of the output tensor is computed as follows:
    outputDim = 1 + (inputDim + 2*padding - windowDim)/poolingStride;


#### Parameters

poolingDesc

Input. Handle to a previously initialized pooling descriptor.

inputDesc

Input. Handle to the previously initialized input tensor descriptor.

nbDims

Input. Number of dimensions in which pooling is to be applied.

outDimA

Output. Array of nbDims output dimensions.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• poolingDesc has not been initialized.
• The value of nbDims is inconsistent with the dimensionality of poolingDesc and inputDesc.

### 3.2.55. cudnnGetProperty()

cudnnStatus_t cudnnGetProperty(
libraryPropertyType     type,
int                    *value)

This function writes a specific part of the cuDNN library version number into the provided host storage.

#### Parameters

type

Input. Enumerant type that instructs the function to report the numerical value of the cuDNN major version, minor version, or the patch level.

value

Output. Host pointer where the version information should be written.

#### Returns

CUDNN_STATUS_INVALID_VALUE

Invalid value of the type argument.

CUDNN_STATUS_SUCCESS

Version information was stored successfully at the provided address.

### 3.2.56. cudnnGetReduceTensorDescriptor()

cudnnStatus_t cudnnGetReduceTensorDescriptor(
const cudnnReduceTensorDescriptor_t reduceTensorDesc,
cudnnReduceTensorOp_t               *reduceTensorOp,
cudnnDataType_t                     *reduceTensorCompType,
cudnnNanPropagation_t               *reduceTensorNanOpt,
cudnnReduceTensorIndices_t          *reduceTensorIndices,
cudnnIndicesType_t                  *reduceTensorIndicesType)

This function queries a previously initialized reduce tensor descriptor object.

#### Parameters

reduceTensorDesc

Input. Pointer to a previously initialized reduce tensor descriptor object.

reduceTensorOp

Output. Enumerant to specify the reduce tensor operation.

reduceTensorCompType

Output. Enumerant to specify the computation datatype of the reduction.

reduceTensorNanOpt

Output. Enumerant to specify the Nan propagation mode.

reduceTensorIndices

Output. Enumerant to specify the reduced tensor indices.

reduceTensorIndicesType

Output. Enumerant to specify the reduce tensor indices type.

#### Returns

CUDNN_STATUS_SUCCESS

The object was queried successfully.

reduceTensorDesc is NULL.

### 3.2.57. cudnnGetReductionIndicesSize()

cudnnStatus_t cudnnGetReductionIndicesSize(
cudnnHandle_t                       handle,
const cudnnReduceTensorDescriptor_t reduceDesc,
const cudnnTensorDescriptor_t       cDesc,
size_t                              *sizeInBytes)

This is a helper function to return the minimum size of the index space to be passed to the reduction given the input and output tensors.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

reduceDesc

Input. Pointer to a previously initialized reduce tensor descriptor object.

Input. Pointer to the input tensor descriptor.

cDesc

Input. Pointer to the output tensor descriptor.

sizeInBytes

Output. Minimum size of the index space to be passed to the reduction.

#### Returns

CUDNN_STATUS_SUCCESS

The index space size is returned successfully.

### 3.2.58. cudnnGetReductionWorkspaceSize()

cudnnStatus_t cudnnGetReductionWorkspaceSize(
cudnnHandle_t                       handle,
const cudnnReduceTensorDescriptor_t reduceDesc,
const cudnnTensorDescriptor_t       cDesc,
size_t                              *sizeInBytes)

This is a helper function to return the minimum size of the workspace to be passed to the reduction given the input and output tensors.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

reduceDesc

Input. Pointer to a previously initialized reduce tensor descriptor object.

Input. Pointer to the input tensor descriptor.

cDesc

Input. Pointer to the output tensor descriptor.

sizeInBytes

Output. Minimum size of the index space to be passed to the reduction.

#### Returns

CUDNN_STATUS_SUCCESS

The workspace size is returned successfully.

### 3.2.59. cudnnGetStream()

cudnnStatus_t cudnnGetStream(
cudnnHandle_t   handle,
cudaStream_t   *streamId)

This function retrieves the user CUDA stream programmed in the cuDNN handle. When the user's CUDA stream is not set in the cuDNN handle, this function reports the null-stream.

#### Parameters

handle

Input. Pointer to the cuDNN handle.

streamID

Output. Pointer where the current CUDA stream from the cuDNN handle should be stored.

#### Returns

Invalid (NULL) handle.

CUDNN_STATUS_SUCCESS

The stream identifier was retrieved successfully.

### 3.2.60. cudnnGetTensor4dDescriptor()

cudnnStatus_t cudnnGetTensor4dDescriptor(
const cudnnTensorDescriptor_t  tensorDesc,
cudnnDataType_t         *dataType,
int                     *n,
int                     *c,
int                     *h,
int                     *w,
int                     *nStride,
int                     *cStride,
int                     *hStride,
int                     *wStride)

This function queries the parameters of the previously initialized tensor4D descriptor object.

#### Parameters

tensorDesc

Input. Handle to a previously initialized tensor descriptor.

datatype

Output. Data type.

n

Output. Number of images.

c

Output. Number of feature maps per image.

h

Output. Height of each feature map.

w

Output. Width of each feature map.

nStride

Output. Stride between two consecutive images.

cStride

Output. Stride between two consecutive feature maps.

hStride

Output. Stride between two consecutive rows.

wStride

Output. Stride between two consecutive columns.

#### Returns

CUDNN_STATUS_SUCCESS

The operation succeeded.

### 3.2.61. cudnnGetTensorNdDescriptor()

cudnnStatus_t cudnnGetTensorNdDescriptor(
const cudnnTensorDescriptor_t   tensorDesc,
int                             nbDimsRequested,
cudnnDataType_t                *dataType,
int                            *nbDims,
int                             dimA[],
int                             strideA[])

This function retrieves values stored in a previously initialized tensor descriptor object.

#### Parameters

tensorDesc

Input. Handle to a previously initialized tensor descriptor.

nbDimsRequested

Input. Number of dimensions to extract from a given tensor descriptor. It is also the minimum size of the arrays dimA and strideA. If this number is greater than the resulting nbDims[0], only nbDims[0] dimensions will be returned.

datatype

Output. Data type.

nbDims

Output. Actual number of dimensions of the tensor will be returned in nbDims[0].

dimA

Output. Array of dimensions of at least nbDimsRequested that will be filled with the dimensions from the provided tensor descriptor.

strideA

Output. Array of dimensions of at least nbDimsRequested that will be filled with the strides from the provided tensor descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The results were returned successfully.

Either tensorDesc or nbDims pointer is NULL.

### 3.2.62. cudnnGetTensorSizeInBytes()

cudnnStatus_t cudnnGetTensorSizeInBytes(
const cudnnTensorDescriptor_t   tensorDesc,
size_t                         *size)

This function returns the size of the tensor in memory in respect to the given descriptor. This function can be used to know the amount of GPU memory to be allocated to hold that tensor.

#### Parameters

tensorDesc

Input. Handle to a previously initialized tensor descriptor.

size

Output. Size in bytes needed to hold the tensor in GPU memory.

#### Returns

CUDNN_STATUS_SUCCESS

The results were returned successfully.

### 3.2.63. cudnnGetTensorTransformDescriptor()

cudnnStatus_t cudnnGetTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc,
uint32_t nbDimsRequested,
cudnnTensorFormat_t *destFormat,
uint32_t foldA[],
cudnnFoldingDirection_t *direction);


This function returns the values stored in a previously initialized tensor transform descriptor.

#### Parameters

transformDesc
Input. A previously initialized tensor transform descriptor.
nbDimsRequested
Input. The number of dimensions to consider. For more information, refer to Tensor Descriptor in the cuDNN Developer Guide.
destFormat
Output. The transform format that will be returned.
Output. An array filled with the amount of padding to add before each dimension. The dimension of this padBeforeA[] parameter is equal to nbDimsRequested.
Output. An array filled with the amount of padding to add after each dimension. The dimension of this padBeforeA[] parameter is equal to nbDimsRequested.
foldA[]
Output. An array that was filled with the folding parameters for each spatial dimension. The dimension of this foldA[] array is nbDimsRequested-2.
direction
Output. The setting that selects folding or unfolding. For more information, refer to cudnnFoldingDirection_t.

#### Returns

CUDNN_STATUS_SUCCESS
The results were obtained successfully.
If transformDesc is NULL or if nbDimsRequested is less than 3 or greater than CUDNN_DIM_MAX.

### 3.2.64. cudnnGetVersion()

size_t cudnnGetVersion()

This function returns the version number of the cuDNN library. It returns the CUDNN_VERSION defined present in the cudnn.h header file. Starting with release R2, the routine can be used to identify dynamically the current cuDNN library used by the application. The defined CUDNN_VERSION can be used to have the same application linked against different cuDNN versions using conditional compilation statements.

### 3.2.65. cudnnInitTransformDest()

cudnnStatus_t cudnnInitTransformDest(
const cudnnTensorTransformDescriptor_t transformDesc,
const cudnnTensorDescriptor_t srcDesc,
cudnnTensorDescriptor_t destDesc,
size_t *destSizeInBytes);

This function initializes and returns a destination tensor descriptor destDesc for tensor transform operations. The initialization is done with the desired parameters described in the transform descriptor cudnnTensorDescriptor_t.
Note: The returned tensor descriptor will be packed.

#### Parameters

transformDesc
Input. Handle to a previously initialized tensor transform descriptor.
srcDesc
Input. Handle to a previously initialized tensor descriptor.
destDesc
Output. Handle of the tensor descriptor that will be initialized and returned.
destSizeInBytes
Output. A pointer to hold the size, in bytes, of the new tensor.

#### Returns

CUDNN_STATUS_SUCCESS
The tensor descriptor was initialized successfully.
If either srcDesc or destDesc is NULL, or if the tensor descriptor’s nbDims is incorrect. For more information, refer to Tensor Descriptor in the cuDNN Developer Guide.
CUDNN_STATUS_NOT_SUPPORTED
If the provided configuration is not 4D.
CUDNN_STATUS_EXECUTION_FAILED
Function failed to launch on the GPU.

### 3.2.66. cudnnLRNCrossChannelForward()

cudnnStatus_t cudnnLRNCrossChannelForward(
cudnnHandle_t                    handle,
cudnnLRNDescriptor_t             normDesc,
cudnnLRNMode_t                   lrnMode,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *beta,
const cudnnTensorDescriptor_t    yDesc,
void                            *y)

This function performs the forward LRN layer computation.

Note: Supported formats are: positive-strided, NCHW and NHWC for 4D x and y, and only NCDHW DHW-packed for 5D (for both x and y). Only non-overlapping 4D and 5D tensors are supported. NCHW layout is preferred for performance.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor.

normDesc

Input. Handle to a previously initialized LRN parameter descriptor.

lrnMode

Input. LRN layer mode of operation. Currently only CUDNN_LRN_CROSS_CHANNEL_DIM1 is implemented. Normalization is performed along the tensor's dimA[1].

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Tensor descriptor objects for the input and output tensors.

x

Input. Input tensor data pointer in device memory.

y

Output. Output tensor data pointer in device memory.

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

At least one of the following conditions are met:

• One of the tensor pointers x, y is NULL.
• Number of input tensor dimensions is 2 or less.
• LRN descriptor parameters are outside of their valid ranges.
• One of the tensor parameters is 5D but is not in NCDHW DHW-packed format.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• Any of the input tensor datatypes is not the same as any of the output tensor datatype.
• x and y tensor dimensions mismatch.
• Any tensor parameters strides are negative.

### 3.2.67. cudnnNormalizationForwardInference()

cudnnStatus_t
cudnnNormalizationForwardInference(cudnnHandle_t handle,
cudnnNormMode_t mode,
cudnnNormOps_t normOps,
cudnnNormAlgo_t algo,
const void *alpha,
const void *beta,
const cudnnTensorDescriptor_t xDesc,
const void *x,
const cudnnTensorDescriptor_t normScaleBiasDesc,
const void *normScale,
const void *normBias,
const cudnnTensorDescriptor_t normMeanVarDesc,
const void *estimatedMean,
const void *estimatedVariance,
const cudnnTensorDescriptor_t zDesc,
const void *z,
cudnnActivationDescriptor_t activationDesc,
const cudnnTensorDescriptor_t yDesc,
void *y,
double epsilon,
int groupCnt);
This function performs the forward normalization layer computation for the inference phase. Per-channel normalization layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
Note:
• Only 4D and 5D tensors are supported.
• The input transformation performed by this function is defined as:
y = beta*y + alpha *[normBias + (normScale * (x-estimatedMean)/sqrt(epsilon + estimatedVariance)]
• The epsilon value has to be the same during training, backpropagation, and inference.
• For the training phase, refer to cudnnNormalizationForwardTraining().
• Higher performance can be obtained when HW-packed tensors are used for all of x and y.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Input. Mode of operation (per-channel or per-activation). For more information, refer to cudnnNormMode_t.

normOps

Input. Mode of post-operative. Currently, CUDNN_NORM_OPS_NORM_ACTIVATION and CUDNN_NORM_OPS_NORM_ADD_ACTIVATION are not supported.

algo

alpha, beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Input. Handles to the previously initialized tensor descriptors.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x input data.

*y

Output. Data pointer to GPU memory associated with the tensor descriptor yDesc, for the y output of the normalization layer.

zDesc, *z

Input. Tensor descriptors and pointers in device memory for residual addition to the result of the normalization operation, prior to the activation. zDesc and *z are optional and are only used when normOps is CUDNN_NORM_OPS_NORM_ADD_ACTIVATION, otherwise users may pass NULL. When in use, z should have exactly the same dimension as x and the final output y. For more information, refer to cudnnTensorDescriptor_t.

Since normOps is only supported for CUDNN_NORM_OPS_NORM, we can set these to NULL for now.

normScaleBiasDesc, normScale, normBias

Inputs. Tensor descriptors and pointers in device memory for the normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma).

normMeanVarDesc, estimatedMean, estimatedVariance

Inputs. Mean and variance tensors and their tensor descriptors. The estimatedMean and estimatedVariance inputs, accumulated during the training phase from the cudnnNormalizationForwardTraining() call, should be passed as inputs here.

activationDesc

Input. Descriptor for the activation operation. When the normOps input is set to either CUDNN_NORM_OPS_NORM_ACTIVATION or CUDNN_NORM_OPS_NORM_ADD_ACTIVATION then this activation is used, otherwise the user may pass NULL. Since normOps is only supported for CUDNN_NORM_OPS_NORM, we can set these to NULL for now.

epsilon

Input. Epsilon value used in the normalization formula. Its value should be equal to or greater than zero.

groupCnt
Input. The number of grouped convolutions. Currently, only 1 is supported.

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

A compute or data type other than what is supported was chosen, or an unknown algorithm type was chosen.

At least one of the following conditions are met:

• One of the pointers alpha, beta, x, y, normScale, normBias, estimatedMean, and estimatedInvVariance is NULL.
• The number of xDesc or yDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
• normScaleBiasDesc and normMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for per-channel, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
• epsilon value is less than zero.
• Dimensions or data types mismatch for xDesc and yDesc.
CUDNN_STATUS_NOT_SUPPORTED

A compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.68. cudnnOpsInferVersionCheck()

cudnnStatus_t cudnnOpsInferVersionCheck(void)

This function is the first of a series of corresponding functions that check for consistent library versions among DLL files for different modules.

#### Returns

CUDNN_STATUS_SUCCESS

The version of this DLL file is consistent with cuDNN DLLs on which it depends.

CUDNN_STATUS_VERSION_MISMATCH

The version of this DLL file does not match that of a cuDNN DLLs on which it depends.

### 3.2.69. cudnnOpTensor()

cudnnStatus_t cudnnOpTensor(
cudnnHandle_t                     handle,
const cudnnOpTensorDescriptor_t   opTensorDesc,
const void                       *alpha1,
const void                       *A,
const void                       *alpha2,
const cudnnTensorDescriptor_t     bDesc,
const void                       *B,
const void                       *beta,
const cudnnTensorDescriptor_t     cDesc,
void                             *C)

This function implements the equation C = op(alpha1[0] * A, alpha2[0] * B) + beta[0] * C, given the tensors A, B, and C and the scaling factors alpha1, alpha2, and beta. The op to use is indicated by the descriptor cudnnOpTensorDescriptor_t, meaning, the type of opTensorDesc. Currently-supported ops are listed by the cudnnOpTensorOp_t enum.

The following restrictions on the input and destination tensors apply:
• Each dimension of the input tensor A must match the corresponding dimension of the destination tensor C, and each dimension of the input tensor B must match the corresponding dimension of the destination tensor C or must be equal to 1. In the latter case, the same value from the input tensor B for those dimensions will be used to blend into the C tensor.
• The data types of the input tensors A and B, and the destination tensor C, must satisfy Table 13.
Table 13. Supported Datatypes
opTensorCompType in opTensorDesc A B C (destination)
FLOAT FLOAT FLOAT FLOAT
FLOAT INT8 INT8 FLOAT
FLOAT HALF HALF FLOAT
FLOAT BFLOAT16 BFLOAT16 FLOAT
DOUBLE DOUBLE DOUBLE DOUBLE
FLOAT FLOAT FLOAT HALF
FLOAT HALF HALF HALF
FLOAT INT8 INT8 INT8
FLOAT FLOAT FLOAT INT8
FLOAT FLOAT FLOAT BFLOAT16
FLOAT BFLOAT16 BFLOAT16 BFLOAT16
Note:CUDNN_TENSOR_NCHW_VECT_C is not supported as input tensor format. All tensors up to dimension five (5) are supported. This routine does not support tensor formats beyond these dimensions.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

opTensorDesc

Input. Handle to a previously initialized op tensor descriptor.

alpha1, alpha2, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

Input. Handle to a previously initialized tensor descriptor.

A, B

Input. Pointer to data of the tensors described by the aDesc and bDesc descriptors, respectively.

C

Input/Output. Pointer to data of the tensor described by the cDesc descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The function executed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimensions of the bias tensor and the output tensor dimensions are above 5.
• opTensorCompType is not set as stated above.

The data type of the destination tensor C is unrecognized, or the restrictions on the input and destination tensors, stated above, are not met.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.70. cudnnPoolingForward()

cudnnStatus_t cudnnPoolingForward(
cudnnHandle_t                    handle,
const cudnnPoolingDescriptor_t   poolingDesc,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *beta,
const cudnnTensorDescriptor_t    yDesc,
void                            *y)

This function computes pooling of input values (meaning, the maximum or average of several adjacent values) to produce an output with smaller height and/or width.

Note:
• All tensor formats are supported, best performance is expected when using HW-packed tensors. Only 2 and 3 spatial dimensions are allowed. Vectorized tensors are only supported if they have 2 spatial dimensions.
• The dimensions of the output tensor yDesc can be smaller or bigger than the dimensions advised by the routine cudnnGetPooling2dForwardOutputDim() or cudnnGetPoolingNdForwardOutputDim().
• For average pooling, the compute type is float even for integer input and output data type. Output round is nearest-even and clamp to the most negative or most positive value of type if out of range.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

poolingDesc

Input. Handle to a previously initialized pooling descriptor.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to the previously initialized input tensor descriptor. Must be of type FLOAT, DOUBLE, HALF, INT8, INT8x4, INT8x32, or BFLOAT16. For more information, refer to cudnnDataType_t.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

yDesc

Input. Handle to the previously initialized output tensor descriptor. Must be of type FLOAT, DOUBLE, HALF, INT8, INT8x4, INT8x32, or BFLOAT16. For more information, refer to cudnnDataType_t.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• The dimensions n, c of the input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differs.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.71. cudnnQueryRuntimeError()

cudnnStatus_t cudnnQueryRuntimeError(
cudnnHandle_t            handle,
cudnnStatus_t           *rstatus,
cudnnErrQueryMode_t      mode,
cudnnRuntimeTag_t       *tag)

cuDNN library functions perform extensive input argument checking before launching GPU kernels. The last step is to verify that the GPU kernel actually started. When a kernel fails to start, CUDNN_STATUS_EXECUTION_FAILED is returned by the corresponding API call. Typically, after a GPU kernel starts, no runtime checks are performed by the kernel itself - numerical results are simply written to output buffers.

When the CUDNN_BATCHNORM_SPATIAL_PERSISTENT mode is selected in cudnnBatchNormalizationForwardTraining() or cudnnBatchNormalizationBackward(), the algorithm may encounter numerical overflows where CUDNN_BATCHNORM_SPATIAL performs just fine albeit at a slower speed. The user can invoke cudnnQueryRuntimeError() to make sure numerical overflows did not occur during the kernel execution. Those issues are reported by the kernel that performs computations.

cudnnQueryRuntimeError() can be used in polling and blocking software control flows. There are two polling modes (CUDNN_ERRQUERY_RAWCODE and CUDNN_ERRQUERY_NONBLOCKING) and one blocking mode CUDNN_ERRQUERY_BLOCKING.

CUDNN_ERRQUERY_RAWCODE reads the error storage location regardless of the kernel completion status. The kernel might not even start and the error storage (allocated per cuDNN handle) might be used by an earlier call.

CUDNN_ERRQUERY_NONBLOCKING checks if all tasks in the user stream are completed. The cudnnQueryRuntimeError() function will return immediately and report CUDNN_STATUS_RUNTIME_IN_PROGRESS in rstatus if some tasks in the user stream are pending. Otherwise, the function will copy the remote kernel error code to rstatus.

In the blocking mode (CUDNN_ERRQUERY_BLOCKING), the function waits for all tasks to drain in the user stream before reporting the remote kernel error code. The blocking flavor can be further adjusted by calling cudaSetDeviceFlags with the cudaDeviceScheduleSpin, cudaDeviceScheduleYield, or cudaDeviceScheduleBlockingSync flag.

CUDNN_ERRQUERY_NONBLOCKING and CUDNN_ERRQUERY_BLOCKING modes should not be used when the user stream is changed in the cuDNN handle, meaning, cudnnSetStream() is invoked between functions that report runtime kernel errors and the cudnnQueryRuntimeError() function.

The remote error status reported in rstatus can be set to: CUDNN_STATUS_SUCCESS, CUDNN_STATUS_RUNTIME_IN_PROGRESS, or CUDNN_STATUS_RUNTIME_FP_OVERFLOW. The remote kernel error is automatically cleared by cudnnQueryRuntimeError().

Note: The cudnnQueryRuntimeError() function should be used in conjunction with cudnnBatchNormalizationForwardTraining() and cudnnBatchNormalizationBackward() when the cudnnBatchNormMode_t argument is CUDNN_BATCHNORM_SPATIAL_PERSISTENT.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

rstatus

Output. Pointer to the user's error code storage.

mode

Input. Remote error query mode.

tag

Input/Output. Currently, this argument should be NULL.

#### Returns

CUDNN_STATUS_SUCCESS

No errors detected (rstatus holds a valid value).

Invalid input argument.

CUDNN_STATUS_INTERNAL_ERROR

A stream blocking synchronization or a non-blocking stream query failed.

CUDNN_STATUS_MAPPING_ERROR

The device cannot access zero-copy memory to report kernel errors.

### 3.2.72. cudnnReduceTensor()

cudnnStatus_t cudnnReduceTensor(
cudnnHandle_t                           handle,
const cudnnReduceTensorDescriptor_t     reduceTensorDesc,
void                                   *indices,
size_t                                  indicesSizeInBytes,
void                                   *workspace,
size_t                                  workspaceSizeInBytes,
const void                             *alpha,
const void                             *A,
const void                             *beta,
const cudnnTensorDescriptor_t           cDesc,
void                                   *C)

This function reduces tensor A by implementing the equation C = alpha * reduce op ( A ) + beta * C, given tensors A and C and scaling factors alpha and beta. The reduction op to use is indicated by the descriptor reduceTensorDesc. Currently-supported ops are listed by the cudnnReduceTensorOp_t enum.

Each dimension of the output tensor C must match the corresponding dimension of the input tensor A or must be equal to 1. The dimensions equal to 1 indicate the dimensions of A to be reduced.

The implementation will generate indices for the min and max ops only, as indicated by the cudnnReduceTensorIndices_t enum of the reduceTensorDesc. Requesting indices for the other reduction ops results in an error. The data type of the indices is indicated by the cudnnIndicesType_t enum; currently only the 32-bit (unsigned int) type is supported.

The indices returned by the implementation are not absolute indices but relative to the dimensions being reduced. The indices are also flattened, meaning, not coordinate tuples.

The data types of the tensors A and C must match if of type double. In this case, alpha and beta and the computation enum of reduceTensorDesc are all assumed to be of type double.

The HALF and INT8 data types may be mixed with the FLOAT data types. In these cases, the computation enum of reduceTensorDesc is required to be of type FLOAT.
Note:

Up to dimension 8, all tensor formats are supported. Beyond those dimensions, this routine is not supported.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

reduceTensorDesc

Input. Handle to a previously initialized reduce tensor descriptor.

indices

Output. Handle to a previously allocated space for writing indices.

indicesSizeInBytes

Input. Size of the above previously allocated space.

workspace

Input. Handle to a previously allocated space for the reduction implementation.

workspaceSizeInBytes

Input. Size of the above previously allocated space.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

Input. Handle to a previously initialized tensor descriptor.

A

Input. Pointer to data of the tensor described by the aDesc descriptor.

C

Input/Output. Pointer to data of the tensor described by the cDesc descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The function executed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimensions of the input tensor and the output tensor are above 8.
• reduceTensorCompType is not set as stated above.

The corresponding dimensions of the input and output tensors all match, or the conditions in the above paragraphs are unmet.

CUDNN_INVALID_VALUE

The allocations for the indices or workspace are insufficient.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.73. cudnnRestoreAlgorithm()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnRestoreAlgorithm(
cudnnHandle_t               handle,
void*                       algoSpace,
size_t                      algoSpaceSizeInBytes,
cudnnAlgorithmDescriptor_t  algoDesc)

This function reads algorithm metadata from the host memory space provided by the user in algoSpace, allowing the user to use the results of RNN finds from previous cuDNN sessions.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

algoDesc

Input. A previously created algorithm descriptor.

algoSpace

Input. Pointer to the host memory to be read.

algoSpaceSizeInBytes

Input. Amount of host memory needed as a workspace to be able to hold the metadata from the specified algoDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The metadata is from a different cuDNN version.

At least one of the following conditions is met:

• One of the arguments is NULL.

### 3.2.74. cudnnRestoreDropoutDescriptor()

cudnnStatus_t cudnnRestoreDropoutDescriptor(
cudnnDropoutDescriptor_t dropoutDesc,
cudnnHandle_t            handle,
float                    dropout,
void                    *states,
size_t                   stateSizeInBytes,
unsigned long long       seed)

This function restores a dropout descriptor to a previously saved-off state.

#### Parameters

dropoutDesc

Input/Output. Previously created dropout descriptor.

handle

Input. Handle to a previously created cuDNN context.

dropout

Input. Probability with which the value from an input tensor is set to 0 when performing dropout.

states

Input. Pointer to GPU memory that holds random number generator states initialized by a prior call to cudnnSetDropoutDescriptor().

stateSizeInBytes

Input. Size in bytes of buffer holding random number generator states.

seed

Input. Seed used in prior calls to cudnnSetDropoutDescriptor() that initialized states buffer. Using a different seed from this has no effect. A change of seed, and subsequent update to random number generator states can be achieved by calling cudnnSetDropoutDescriptor().

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_INVALID_VALUE

States buffer size (as indicated in stateSizeInBytes) is too small.

### 3.2.75. cudnnSaveAlgorithm()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnSaveAlgorithm(
cudnnHandle_t          		handle,
cudnnAlgorithmDescriptor_t 	algoDesc,
void*                  		algoSpace
size_t                 		algoSpaceSizeInBytes)

This function writes algorithm metadata into the host memory space provided by the user in algoSpace, allowing the user to preserve the results of RNN finds after cuDNN exits.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

algoDesc

Input. A previously created algorithm descriptor.

algoSpace

Input. Pointer to the host memory to be written.

algoSpaceSizeInBytes

Input. Amount of host memory needed as a workspace to be able to save the metadata from the specified algoDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions is met:

• One of the arguments is NULL.
• algoSpaceSizeInBytes is too small.

### 3.2.76. cudnnScaleTensor()

cudnnStatus_t cudnnScaleTensor(
cudnnHandle_t                   handle,
const cudnnTensorDescriptor_t   yDesc,
void                           *y,
const void                     *alpha)

This function scales all the elements of a tensor by a given factor.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Pointer to data of the tensor described by the yDesc descriptor.

alpha

Input. Pointer in the host memory to a single value that all elements of the tensor will be scaled with. For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

One of the provided pointers is nil.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.77. cudnnSetActivationDescriptor()

cudnnStatus_t cudnnSetActivationDescriptor(
cudnnActivationDescriptor_t         activationDesc,
cudnnActivationMode_t               mode,
cudnnNanPropagation_t               reluNanOpt,
double                              coef)

This function initializes a previously created generic activation descriptor object.

#### Parameters

activationDesc

Input/Output. Handle to a previously created activation descriptor.

mode

Input. Enumerant to specify the activation mode.

reluNanOpt

Input. Enumerant to specify the Nan propagation mode.

coef

Input. Floating point number. When the activation mode (refer to cudnnActivationMode_t) is set to CUDNN_ACTIVATION_CLIPPED_RELU, this input specifies the clipping threshold; and when the activation mode is set to CUDNN_ACTIVATION_RELU, this input specifies the upper bound.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

mode or reluNanOpt has an invalid enumerant value.

### 3.2.78. cudnnSetActivationDescriptorSwishBeta()

cudnnStatus_t cudnnSetActivationDescriptorSwishBeta(cudnnActivationDescriptor_t activationDesc, double swish_beta)

This function sets the beta parameter of the SWISH activation function to swish_beta.

#### Parameters

activationDesc

Input/Output. Handle to a previously created activation descriptor.

swish_beta

Input. The value to set the SWISH activations' beta parameter to.

#### Returns

CUDNN_STATUS_SUCCESS

The value was set successfully.

The activation descriptor is a NULL pointer.

### 3.2.79. cudnnSetAlgorithmDescriptor()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnSetAlgorithmDescriptor(
cudnnAlgorithmDescriptor_t      algorithmDesc,
cudnnAlgorithm_t                algorithm)

This function initializes a previously created generic algorithm descriptor object.

#### Parameters

algorithmDesc

Input/Output. Handle to a previously created algorithm descriptor.

algorithm

Input. Struct to specify the algorithm.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

### 3.2.80. cudnnSetAlgorithmPerformance()

This function has been deprecated in cuDNN 8.0.

cudnnStatus_t cudnnSetAlgorithmPerformance(
cudnnAlgorithmPerformance_t	    algoPerf,
cudnnAlgorithmDescriptor_t      algoDesc,
cudnnStatus_t                   status,
float                           time,
size_t                          memory)

This function initializes a previously created generic algorithm performance object.

#### Parameters

algoPerf

Input/Output. Handle to a previously created algorithm performance object.

algoDesc

Input. The algorithm descriptor which the performance results describe.

status

Input. The cuDNN status returned from running the algoDesc algorithm.

time

Input. The GPU time spent running the algoDesc algorithm.

memory

Input. The GPU memory needed to run the algoDesc algorithm.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

mode or reluNanOpt has an invalid enumerate value.

### 3.2.81. cudnnSetCallback()

cudnnStatus_t cudnnSetCallback(
void                *udata,
cudnnCallback_t     fptr)

This function sets the internal states of cuDNN error reporting functionality.

#### Parameters

Input. An unsigned integer. The four least significant bits (LSBs) of this unsigned integer are used for switching on and off the different levels of error reporting messages. This applies for both the default callbacks, and for the customized callbacks. The bit position is in correspondence with the enum of cudnnSeverity_t. The user may utilize the predefined macros CUDNN_SEV_ERROR_EN, CUDNN_SEV_WARNING_EN, and CUDNN_SEV_INFO_EN to form the bit mask. When a bit is set to 1, the corresponding message channel is enabled.

For example, when bit 3 is set to 1, the API logging is enabled. Currently, only the log output of level CUDNN_SEV_INFO is functional; the others are not yet implemented. When used for turning on and off the logging with the default callback, the user may pass NULL to udata and fptr. In addition, the environment variable CUDNN_LOGDEST_DBG must be set. For more information, refer to Backward Compatibility and Deprecation Policy in the cuDNN Developer Guide.
• CUDNN_SEV_INFO_EN= 0b1000 (functional).
• CUDNN_SEV_ERROR_EN= 0b0010 (not yet functional).
• CUDNN_SEV_WARNING_EN= 0b0100 (not yet functional).

The output of CUDNN_SEV_FATAL is always enabled and cannot be disabled.

udata

Input. A pointer provided by the user. This pointer will be passed to the user’s custom logging callback function. The data it points to will not be read, nor be changed by cuDNN. This pointer may be used in many ways, such as in a mutex or in a communication socket for the user’s callback function for logging. If the user is utilizing the default callback function, or doesn’t want to use this input in the customized callback function, they may pass in NULL.

fptr
Input. A pointer to a user-supplied callback function. When NULL is passed to this pointer, then cuDNN switches back to the built-in default callback function. The user-supplied callback function prototype must be similar to the following (also defined in the header file):
void customizedLoggingCallback (cudnnSeverity_t sev, void *udata, const cudnnDebug_t *dbg, const char *msg);
• The structure cudnnDebug_t is defined in the header file. It provides the metadata, such as time, time since start, stream ID, process and thread ID, that the user may choose to print or store in their customized callback.
• The variable msg is the logging message generated by cuDNN. Each line of this message is terminated by \0, and the end of the message is terminated by \0\0. Users may select what is necessary to show in the log, and may reformat the string.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

### 3.2.82. cudnnSetDropoutDescriptor()

cudnnStatus_t cudnnSetDropoutDescriptor(
cudnnDropoutDescriptor_t    dropoutDesc,
cudnnHandle_t               handle,
float                       dropout,
void                       *states,
size_t                      stateSizeInBytes,
unsigned long long          seed)

This function initializes a previously created dropout descriptor object. If the states argument is equal to NULL, then the random number generator states won't be initialized, and only the dropout value will be set. The user is expected not to change the memory pointed at by states for the duration of the computation.

When the states argument is not NULL, a cuRAND initialization kernel is invoked by cudnnSetDropoutDescriptor(). This kernel requires a substantial amount of GPU memory for the stack. Memory is released when the kernel finishes. The CUDNN_STATUS_ALLOC_FAILED status is returned when no sufficient free memory is available for the GPU stack.

#### Parameters

dropoutDesc

Input/Output. Previously created dropout descriptor object.

handle

Input. Handle to a previously created cuDNN context.

dropout

Input. The probability with which the value from input is set to zero during the dropout layer.

states

Output. Pointer to user-allocated GPU memory that will hold random number generator states.

stateSizeInBytes

Input. Specifies the size in bytes of the provided memory for the states.

seed

Input. Seed used to initialize random number generator states.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

CUDNN_STATUS_INVALID_VALUE

The sizeInBytes argument is less than the value returned by cudnnDropoutGetStatesSize().

CUDNN_STATUS_ALLOC_FAILED

The function failed to temporarily extend the GPU stack.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

CUDNN_STATUS_INTERNAL_ERROR

Internally used CUDA functions returned an error status.

### 3.2.83. cudnnSetFilter4dDescriptor()

cudnnStatus_t cudnnSetFilter4dDescriptor(
cudnnFilterDescriptor_t    filterDesc,
cudnnDataType_t            dataType,
cudnnTensorFormat_t        format,
int                        k,
int                        c,
int                        h,
int                        w)

This function initializes a previously created filter descriptor object into a 4D filter. The layout of the filters must be contiguous in memory.

Tensor format CUDNN_TENSOR_NHWC has limited support in cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter().

#### Parameters

filterDesc

Input/Output. Handle to a previously created filter descriptor.

datatype

Input. Data type.

format
Input.Type of the filter layout format. If this input is set to CUDNN_TENSOR_NCHW, which is one of the enumerant values allowed by cudnnTensorFormat_t descriptor, then the layout of the filter is in the form of KCRS, where:
• K represents the number of output feature maps
• C is the number of input feature maps
• R is the number of rows per filter
• S is the number of columns per filter

If this input is set to CUDNN_TENSOR_NHWC, then the layout of the filter is in the form of KRSC. For more information, refer to cudnnTensorFormat_t.

k

Input. Number of output feature maps.

c

Input. Number of input feature maps.

h

Input. Height of each filter.

w

Input. Width of each filter.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the parameters k, c, h, w is negative or dataType or format has an invalid enumerant value.

### 3.2.84. cudnnSetFilterNdDescriptor()

cudnnStatus_t cudnnSetFilterNdDescriptor(
cudnnFilterDescriptor_t filterDesc,
cudnnDataType_t         dataType,
cudnnTensorFormat_t     format,
int                     nbDims,
const int               filterDimA[])

This function initializes a previously created filter descriptor object. The layout of the filters must be contiguous in memory.

The tensor format CUDNN_TENSOR_NHWC has limited support in cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter().

#### Parameters

filterDesc

Input/Output. Handle to a previously created filter descriptor.

datatype

Input. Data type.

format

Input.Type of the filter layout format. If this input is set to CUDNN_TENSOR_NCHW, which is one of the enumerant values allowed by cudnnTensorFormat_t descriptor, then the layout of the filter is as follows:

• For N=4, a 4D filter descriptor, the filter layout is in the form of KCRS:
• K represents the number of output feature maps
• C is the number of input feature maps
• R is the number of rows per filter
• S is the number of columns per filter
• For N=3, a 3D filter descriptor, the number S (number of columns per filter) is omitted.
• For N=5 and greater, the layout of the higher dimensions immediately follows RS.

On the other hand, if this input is set to CUDNN_TENSOR_NHWC, then the layout of the filter is as follows:

• For N=4, a 4D filter descriptor, the filter layout is in the form of KRSC.
• For N=3, a 3D filter descriptor, the number S (number of columns per filter) is omitted and the layout of C immediately follows R.
• For N=5 and greater, the layout of the higher dimensions are inserted between S and C. For more information, refer to cudnnTensorFormat_t.
nbDims

Input. Dimension of the filter.

filterDimA

Input. Array of dimension nbDims containing the size of the filter for each dimension.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the elements of the array filterDimA is negative or dataType or format has an invalid enumerant value.

CUDNN_STATUS_NOT_SUPPORTED

The parameter nbDims exceeds CUDNN_DIM_MAX.

### 3.2.85. cudnnSetLRNDescriptor()

cudnnStatus_t cudnnSetLRNDescriptor(
cudnnLRNDescriptor_t   normDesc,
unsigned               lrnN,
double                 lrnAlpha,
double                 lrnBeta,
double                 lrnK)

This function initializes a previously created LRN descriptor object.

Note:
• Macros CUDNN_LRN_MIN_N, CUDNN_LRN_MAX_N, CUDNN_LRN_MIN_K, CUDNN_LRN_MIN_BETA defined in cudnn.h specify valid ranges for parameters.
• Values of double parameters will be cast down to the tensor datatype during computation.

#### Parameters

normDesc

Output. Handle to a previously created LRN descriptor.

lrnN

Input. Normalization window width in elements. The LRN layer uses a window [center-lookBehind, center+lookAhead], where lookBehind = floor( (lrnN-1)/2 ), lookAhead = lrnN-lookBehind-1. So for n=10, the window is [k-4...k...k+5] with a total of 10 samples. For the DivisiveNormalization layer, the window has the same extent as above in all spatial dimensions (dimA[2], dimA[3], dimA[4]). By default, lrnN is set to 5 in cudnnCreateLRNDescriptor().

lrnAlpha

Input. Value of the alpha variance scaling parameter in the normalization formula. Inside the library code, this value is divided by the window width for LRN and by (window width)^#spatialDimensions for DivisiveNormalization. By default, this value is set to 1e-4 in cudnnCreateLRNDescriptor().

lrnBeta

Input. Value of the beta power parameter in the normalization formula. By default, this value is set to 0.75 in cudnnCreateLRNDescriptor().

lrnK

Input. Value of the k parameter in the normalization formula. By default, this value is set to 2.0.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

One of the input parameters was out of valid range as described above.

### 3.2.86. cudnnSetOpTensorDescriptor()

cudnnStatus_t cudnnSetOpTensorDescriptor(
cudnnOpTensorDescriptor_t   opTensorDesc,
cudnnOpTensorOp_t           opTensorOp,
cudnnDataType_t             opTensorCompType,
cudnnNanPropagation_t       opTensorNanOpt)

This function initializes a tensor pointwise math descriptor.

#### Parameters

opTensorDesc

Output. Pointer to the structure holding the description of the tensor pointwise math descriptor.

opTensorOp

Input. Tensor pointwise math operation for this tensor pointwise math descriptor.

opTensorCompType

Input. Computation datatype for this tensor pointwise math descriptor.

opTensorNanOpt

Input. NAN propagation policy.

#### Returns

CUDNN_STATUS_SUCCESS

The function returned successfully.

At least one of the input parameters passed is invalid.

### 3.2.87. cudnnSetPooling2dDescriptor()

cudnnStatus_t cudnnSetPooling2dDescriptor(
cudnnPoolingDescriptor_t    poolingDesc,
cudnnPoolingMode_t          mode,
cudnnNanPropagation_t       maxpoolingNanOpt,
int                         windowHeight,
int                         windowWidth,
int                         verticalStride,
int                         horizontalStride)

This function initializes a previously created generic pooling descriptor object into a 2D description.

#### Parameters

poolingDesc

Input/Output. Handle to a previously created pooling descriptor.

mode

Input. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Input. Enumerant to specify the Nan propagation mode.

windowHeight

Input. Height of the pooling window.

windowWidth

Input. Width of the pooling window.

verticalStride

Input. Pooling vertical stride.

horizontalStride

Input. Pooling horizontal stride.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the parameters windowHeight, windowWidth, verticalStride, horizontalStride is negative or mode or maxpoolingNanOpt has an invalid enumerate value.

### 3.2.88. cudnnSetPoolingNdDescriptor()

cudnnStatus_t cudnnSetPoolingNdDescriptor(
cudnnPoolingDescriptor_t     poolingDesc,
const cudnnPoolingMode_t     mode,
const cudnnNanPropagation_t  maxpoolingNanOpt,
int                          nbDims,
const int                    windowDimA[],
const int                    strideA[])

This function initializes a previously created generic pooling descriptor object.

#### Parameters

poolingDesc

Input/Output. Handle to a previously created pooling descriptor.

mode

Input. Enumerant to specify the pooling mode.

maxpoolingNanOpt

Input. Enumerant to specify the Nan propagation mode.

nbDims

Input. Dimension of the pooling operation. Must be greater than zero.

windowDimA

Input. Array of dimension nbDims containing the window size for each dimension. The value of array elements must be greater than zero.

Input. Array of dimension nbDims containing the padding size for each dimension. Negative padding is allowed.

strideA

Input. Array of dimension nbDims containing the striding size for each dimension. The value of array elements must be greater than zero (meaning, negative striding size is not allowed).

#### Returns

CUDNN_STATUS_SUCCESS

The object was initialized successfully.

CUDNN_STATUS_NOT_SUPPORTED

If (nbDims > CUDNN_DIM_MAX-2).

Either nbDims, or at least one of the elements of the arrays windowDimA or strideA is negative, or mode or maxpoolingNanOpt has an invalid enumerate value.

### 3.2.89. cudnnSetReduceTensorDescriptor()

cudnnStatus_t cudnnSetReduceTensorDescriptor(
cudnnReduceTensorDescriptor_t   reduceTensorDesc,
cudnnReduceTensorOp_t           reduceTensorOp,
cudnnDataType_t                 reduceTensorCompType,
cudnnNanPropagation_t           reduceTensorNanOpt,
cudnnReduceTensorIndices_t      reduceTensorIndices,
cudnnIndicesType_t              reduceTensorIndicesType)

This function initializes a previously created reduce tensor descriptor object.

#### Parameters

reduceTensorDesc

Input/Output. Handle to a previously created reduce tensor descriptor.

reduceTensorOp

Input. Enumerant to specify the reduce tensor operation.

reduceTensorCompType

Input. Enumerant to specify the computation datatype of the reduction.

reduceTensorNanOpt

Input. Enumerant to specify the Nan propagation mode.

reduceTensorIndices

Input. Enumerant to specify the reduced tensor indices.

reduceTensorIndicesType

Input. Enumerant to specify the reduce tensor indices type.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

reduceTensorDesc is NULL (reduceTensorOp, reduceTensorCompType, reduceTensorNanOpt, reduceTensorIndices or reduceTensorIndicesType has an invalid enumerant value).

### 3.2.90. cudnnSetSpatialTransformerNdDescriptor()

cudnnStatus_t cudnnSetSpatialTransformerNdDescriptor(
cudnnSpatialTransformerDescriptor_t     stDesc,
cudnnSamplerType_t                      samplerType,
cudnnDataType_t                         dataType,
const int                               nbDims,
const int                               dimA[])

This function initializes a previously created generic spatial transformer descriptor object.

#### Parameters

stDesc

Input/Output. Previously created spatial transformer descriptor object.

samplerType

Input. Enumerant to specify the sampler type.

dataType

Input. Data type.

nbDims

Input. Dimension of the transformed tensor.

dimA

Input. Array of dimension nbDims containing the size of the transformed tensor for every dimension.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

At least one of the following conditions are met:

• Either stDesc or dimA is NULL.
• Either dataType or samplerType has an invalid enumerant value

### 3.2.91. cudnnSetStream()

cudnnStatus_t cudnnSetStream(
cudnnHandle_t   handle,
cudaStream_t    streamId)

This function sets the user's CUDA stream in the cuDNN handle. The new stream will be used to launch cuDNN GPU kernels or to synchronize to this stream when cuDNN kernels are launched in the internal streams. If the cuDNN library stream is not set, all kernels use the default (NULL) stream. Setting the user stream in the cuDNN handle guarantees the issue-order execution of cuDNN calls and other GPU kernels launched in the same stream.

With CUDA 11.x or later, internal streams have the same priority as the stream set by the last call to this function. In CUDA graph capture mode, CUDA 11.8 or later is required in order for the stream priorities to match.

#### Parameters

handle

Input. Pointer to the cuDNN handle.

streamID

Input. New CUDA stream to be written to the cuDNN handle.

#### Returns

Invalid (NULL) handle.

CUDNN_STATUS_MAPPING_ERROR

Mismatch between the user stream and the cuDNN handle context.

CUDNN_STATUS_SUCCESS

The new stream was set successfully.

### 3.2.92. cudnnSetTensor()

cudnnStatus_t cudnnSetTensor(
cudnnHandle_t                   handle,
const cudnnTensorDescriptor_t   yDesc,
void                           *y,
const void                     *valuePtr)

This function sets all the elements of a tensor to a given value.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

yDesc

Input. Handle to a previously initialized tensor descriptor.

y

Input/Output. Pointer to data of the tensor described by the yDesc descriptor.

valuePtr

Input. Pointer in host memory to a single value. All elements of the y tensor will be set to value[0]. The data type of the element in value[0] has to match the data type of tensor y.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

One of the provided pointers is nil.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.93. cudnnSetTensor4dDescriptor()

cudnnStatus_t cudnnSetTensor4dDescriptor(
cudnnTensorDescriptor_t tensorDesc,
cudnnTensorFormat_t     format,
cudnnDataType_t         dataType,
int                     n,
int                     c,
int                     h,
int                     w)

This function initializes a previously created generic tensor descriptor object into a 4D tensor. The strides of the four dimensions are inferred from the format parameter and set in such a way that the data is contiguous in memory with no padding between dimensions.

Note: The total size of a tensor including the potential padding between dimensions is limited to 2 Giga-elements of type datatype.

#### Parameters

tensorDesc

Input/Output. Handle to a previously created tensor descriptor.

format

Input. Type of format.

datatype

Input. Data type.

n

Input. Number of images.

c

Input. Number of feature maps per image.

h

Input. Height of each feature map.

w

Input. Width of each feature map.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the parameters n, c, h, w was negative or format has an invalid enumerant value or dataType has an invalid enumerant value.

CUDNN_STATUS_NOT_SUPPORTED

The total size of the tensor descriptor exceeds the maximum limit of 2 Giga-elements.

### 3.2.94. cudnnSetTensor4dDescriptorEx()

cudnnStatus_t cudnnSetTensor4dDescriptorEx(
cudnnTensorDescriptor_t     tensorDesc,
cudnnDataType_t             dataType,
int                         n,
int                         c,
int                         h,
int                         w,
int                         nStride,
int                         cStride,
int                         hStride,
int                         wStride)

This function initializes a previously created generic tensor descriptor object into a 4D tensor, similarly to cudnnSetTensor4dDescriptor() but with the strides explicitly passed as parameters. This can be used to lay out the 4D tensor in any order or simply to define gaps between dimensions.

Note:
• At present, some cuDNN routines have limited support for strides. Those routines will return CUDNN_STATUS_NOT_SUPPORTED if a 4D tensor object with an unsupported stride is used. cudnnTransformTensor() can be used to convert the data to a supported layout.
• The total size of a tensor including the potential padding between dimensions is limited to 2 Giga-elements of type datatype.

#### Parameters

tensorDesc

Input/Output. Handle to a previously created tensor descriptor.

datatype

Input. Data type.

n

Input. Number of images.

c

Input. Number of feature maps per image.

h

Input. Height of each feature map.

w

Input. Width of each feature map.

nStride

Input. Stride between two consecutive images.

cStride

Input. Stride between two consecutive feature maps.

hStride

Input. Stride between two consecutive rows.

wStride

Input. Stride between two consecutive columns.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the parameters n, c, h, w or nStride, cStride, hStride, wStride is negative or dataType has an invalid enumerant value.

CUDNN_STATUS_NOT_SUPPORTED

The total size of the tensor descriptor exceeds the maximum limit of 2 Giga-elements.

### 3.2.95. cudnnSetTensorNdDescriptor()

cudnnStatus_t cudnnSetTensorNdDescriptor(
cudnnTensorDescriptor_t tensorDesc,
cudnnDataType_t         dataType,
int                     nbDims,
const int               dimA[],
const int               strideA[])

This function initializes a previously created generic tensor descriptor object.

Note: The total size of a tensor including the potential padding between dimensions is limited to 2 Giga-elements of type datatype. Tensors are restricted to having at least 4 dimensions, and at most CUDNN_DIM_MAX dimensions (defined in cudnn.h). When working with lower dimensional data, it is recommended that the user create a 4D tensor, and set the size along unused dimensions to 1.

#### Parameters

tensorDesc

Input/Output. Handle to a previously created tensor descriptor.

datatype

Input. Data type.

nbDims
Input. Dimension of the tensor.
Note: Do not use 2 dimensions. Due to historical reasons, the minimum number of dimensions in the filter descriptor is three. For more information, refer to cudnnGetRNNLinLayerBiasParams().
dimA

Input. Array of dimension nbDims that contain the size of the tensor for every dimension. The size along unused dimensions should be set to 1. By convention, the ordering of dimensions in the array follows the format - [N, C, D, H, W], with W occupying the smallest index in the array.

strideA

Input. Array of dimension nbDims that contain the stride of the tensor for every dimension. By convention, the ordering of the strides in the array follows the format - [Nstride, Cstride, Dstride, Hstride, Wstride], with Wstride occupying the smallest index in the array.

#### Returns

CUDNN_STATUS_SUCCESS

The object was set successfully.

At least one of the elements of the array dimA was negative or zero, or dataType has an invalid enumerant value.

CUDNN_STATUS_NOT_SUPPORTED

The parameter nbDims is outside the range [4, CUDNN_DIM_MAX], or the total size of the tensor descriptor exceeds the maximum limit of 2 Giga-elements.

### 3.2.96. cudnnSetTensorNdDescriptorEx()

cudnnStatus_t cudnnSetTensorNdDescriptorEx(
cudnnTensorDescriptor_t tensorDesc,
cudnnTensorFormat_t     format,
cudnnDataType_t         dataType,
int                     nbDims,
const int               dimA[])

This function initializes an n-D tensor descriptor.

#### Parameters

tensorDesc

Output. Pointer to the tensor descriptor struct to be initialized.

format

Input. Tensor format.

dataType

Input. Tensor data type.

nbDims
Input. Dimension of the tensor.
Note: Do not use 2 dimensions. Due to historical reasons, the minimum number of dimensions in the filter descriptor is three. For more information, refer to cudnnGetRNNLinLayerBiasParams().
dimA

Input. Array containing the size of each dimension.

#### Returns

CUDNN_STATUS_SUCCESS

The function was successful.

Tensor descriptor was not allocated properly; or input parameters are not set correctly.

CUDNN_STATUS_NOT_SUPPORTED

Dimension size requested is larger than maximum dimension size supported.

### 3.2.97. cudnnSetTensorTransformDescriptor()

cudnnStatus_t cudnnSetTensorTransformDescriptor(
cudnnTensorTransformDescriptor_t transformDesc,
const uint32_t nbDims,
const cudnnTensorFormat_t destFormat,
const uint32_t foldA[],
const cudnnFoldingDirection_t direction);


This function initializes a tensor transform descriptor that was previously created using the cudnnCreateTensorTransformDescriptor() function.

#### Parameters

transformDesc

Output. The tensor transform descriptor to be initialized.

nbDims

Input. The dimensionality of the transform operands. Must be greater than 2. For more information, refer to Tensor Descriptor in the cuDNN Developer Guide.

destFormat

Input. The desired destination format.

Input. An array that contains the amount of padding that should be added before each dimension. Set to NULL for no padding.

Input. An array that contains the amount of padding that should be added after each dimension. Set to NULL for no padding.

foldA[]

Input. An array that contains the folding parameters for each spatial dimension (dimensions 2 and up). Set to NULL for no folding.

direction

Input. Selects folding or unfolding. This input has no effect when folding parameters are all <= 1. For more information, refer to cudnnFoldingDirection_t.

#### Returns

CUDNN_STATUS_SUCCESS

The function was launched successfully.

The parameter transformDesc is NULL, or if direction is invalid, or nbDims is <= 2.

CUDNN_STATUS_NOT_SUPPORTED

If the dimension size requested is larger than maximum dimension size supported (meaning, one of the nbDims is larger than CUDNN_DIM_MAX), or if destFromat is something other than NCHW or NHWC.

### 3.2.98. cudnnSoftmaxForward()

cudnnStatus_t cudnnSoftmaxForward(
cudnnHandle_t                    handle,
cudnnSoftmaxAlgorithm_t          algorithm,
cudnnSoftmaxMode_t               mode,
const void                      *alpha,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *beta,
const cudnnTensorDescriptor_t    yDesc,
void                            *y)
This routine computes the softmax function.
Note: All tensor formats are supported for all modes and algorithms with 4 and 5D tensors. Performance is expected to be highest with NCHW fully-packed tensors. For more than 5 dimensions tensors must be packed in their spatial dimensions.

#### Data Types

This function supports the following data types:
• CUDNN_DATA_FLOAT
• CUDNN_DATA_DOUBLE
• CUDNN_DATA_HALF
• CUDNN_DATA_BFLOAT16
• CUDNN_DATA_INT8

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

algorithm

Input. Enumerant to specify the softmax algorithm.

mode

Input. Enumerant to specify the softmax mode.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The dimensions n, c, h, w of the input tensor and output tensors differ.
• The datatype of the input tensor and output tensors differ.
• The parameters algorithm or mode have an invalid enumerant value.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.99. cudnnSpatialTfGridGeneratorForward()

cudnnStatus_t cudnnSpatialTfGridGeneratorForward(
cudnnHandle_t                               handle,
const cudnnSpatialTransformerDescriptor_t   stDesc,
const void                                 *theta,
void                                       *grid)

This function generates a grid of coordinates in the input tensor corresponding to each pixel from the output tensor.

Note: Only 2d transformation is supported.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

stDesc

Input. Previously created spatial transformer descriptor object.

theta

Input. Affine transformation matrix. It should be of size n*2*3 for a 2d transformation, where n is the number of images specified in stDesc.

grid

Output. A grid of coordinates. It is of size n*h*w*2 for a 2d transformation, where n, h, w is specified in stDesc. In the 4th dimension, the first coordinate is x, and the second coordinate is y.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

At least one of the following conditions are met:

• handle is NULL.
• One of the parameters grid or theta is NULL.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimension of the transformed tensor specified in stDesc > 4.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.100. cudnnSpatialTfSamplerForward()

cudnnStatus_t cudnnSpatialTfSamplerForward(
cudnnHandle_t                              handle,
const cudnnSpatialTransformerDescriptor_t  stDesc,
const void                                *alpha,
const cudnnTensorDescriptor_t              xDesc,
const void                                *x,
const void                                *grid,
const void                                *beta,
cudnnTensorDescriptor_t                    yDesc,
void                                      *y)

This function performs a sampler operation and generates the output tensor using the grid given by the grid generator.

Note: Only 2d transformation is supported.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

stDesc

Input. Previously created spatial transformer descriptor object.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to the previously initialized input tensor descriptor.

x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc.

grid

Input. A grid of coordinates generated by cudnnSpatialTfGridGeneratorForward().

yDesc

Input. Handle to the previously initialized output tensor descriptor.

y

Output. Data pointer to GPU memory associated with the output tensor descriptor yDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The call was successful.

At least one of the following conditions are met:

• handle is NULL.
• One of the parameters x, y or grid is NULL.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimension of transformed tensor > 4.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.101. cudnnTransformFilter()

cudnnStatus_t cudnnTransformFilter(
cudnnHandle_t handle,
const cudnnTensorTransformDescriptor_t transDesc,
const void *alpha,
const cudnnFilterDescriptor_t srcDesc,
const void *srcData,
const void *beta,
const cudnnFilterDescriptor_t destDesc,
void *destData);

This function converts the filter between different formats, data types, or dimensions based on the described transformation. It can be used to convert a filter with an unsupported layout format to a filter with a supported layout format.

This function copies the scaled data from the input filter srcDesc to the output tensor destDesc with a different layout. If the filter descriptors of srcDesc and destDesc have different dimensions, they must be consistent with folding and padding amount and order specified in transDesc.

The srcDesc and destDesc tensors must not overlap in any way (that is, tensors cannot be transformed in place).
Note: When performing a folding transform or a zero-padding transform, the scaling factors (alpha, beta) should be set to (1, 0). However, unfolding transforms support any (alpha, beta) values. This function is thread safe.

#### Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.

transDesc

Input. A descriptor containing the details of the requested filter transformation. For more information, refer to cudnnTensorTransformDescriptor_t.

alpha, beta

Input. Pointers, in the host memory, to the scaling factors used to scale the data in the input tensor srcDesc. beta is used to scale the destination tensor, while alpha is used to scale the source tensor. For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

The beta scaling value is not honored in the folding and zero-padding cases. Unfolding supports any (alpha, beta).

srcDesc, destDesc

Input. Handles to the previously initiated filter descriptors. srcDesc and destDesc must not overlap. For more information, refer to cudnnTensorDescriptor_t.

srcData

Input. Pointers, in the host memory, to the data of the tensor described by srcDesc.

destData

Output. Pointers, in the host memory, to the data of the tensor described by destDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

A parameter is uninitialized or initialized incorrectly, or the number of dimensions is different between srcDesc and destDesc.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. Also, in the folding and padding paths, any value other than A=1 and B=0 will result in a CUDNN_STATUS_NOT_SUPPORTED.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.102. cudnnTransformTensor()

cudnnStatus_t cudnnTransformTensor(
cudnnHandle_t                  handle,
const void                    *alpha,
const cudnnTensorDescriptor_t  xDesc,
const void                    *x,
const void                    *beta,
const cudnnTensorDescriptor_t  yDesc,
void                          *y)

This function copies the scaled data from one tensor to another tensor with a different layout. Those descriptors need to have the same dimensions but not necessarily the same strides. The input and output tensors must not overlap in any way (meaning, tensors cannot be transformed in place). This function can be used to convert a tensor with an unsupported format to a supported one.

#### Parameters

handle

Input. Handle to a previously created cuDNN context.

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the source value with prior value in the destination tensor as follows:
dstValue = alpha[0]*srcValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc

Input. Handle to a previously initialized tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.

x

Input. Pointer to data of the tensor described by the xDesc descriptor.

yDesc

Input. Handle to a previously initialized tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.

y

Output. Pointer to data of the tensor described by the yDesc descriptor.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

The dimensions n, c, h, w or the dataType of the two tensor descriptors are different.

CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 3.2.103. cudnnTransformTensorEx()

cudnnStatus_t cudnnTransformTensorEx(
cudnnHandle_t handle,
const cudnnTensorTransformDescriptor_t transDesc,
const void *alpha,
const cudnnTensorDescriptor_t srcDesc,
const void *srcData,
const void *beta,
const cudnnTensorDescriptor_t destDesc,
void *destData);


This function converts the tensor layouts between different formats. It can be used to convert a tensor with an unsupported layout format to a tensor with a supported layout format.

This function copies the scaled data from the input tensor srcDesc to the output tensor destDesc with a different layout. The tensor descriptors of srcDesc and destDesc should have the same dimensions but need not have the same strides.

The srcDesc and destDesc tensors must not overlap in any way (that is, tensors cannot be transformed in place).
Note: When performing a folding transform or a zero-padding transform, the scaling factors (alpha,beta) should be set to (1, 0). However, unfolding transforms support any (alpha,beta) values. This function is thread safe.

#### Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.

transDesc

Input. A descriptor containing the details of the requested tensor transformation. For more information, refer to cudnnTensorTransformDescriptor_t.

alpha, beta

Input. Pointers, in the host memory, to the scaling factors used to scale the data in the input tensor srcDesc. beta is used to scale the destination tensor, while alpha is used to scale the source tensor. For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

The beta scaling value is not honored in the folding and zero-padding cases. Unfolding supports any (alpha, beta).

srcDesc, destDesc

Input. Handles to the previously initiated tensor descriptors. srcDesc and destDesc must not overlap. For more information, refer to cudnnTensorDescriptor_t.

srcData

Input. Pointers, in the host memory, to the data of the tensor described by srcDesc.

destData

Output. Pointers, in the host memory, to the data of the tensor described by destDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function was launched successfully.

A parameter is uninitialized or initialized incorrectly, or the number of dimensions is different between srcDesc and destDesc.

CUDNN_STATUS_NOT_SUPPORTED

Function does not support the provided configuration. Also, in the folding and padding paths, any value other than A=1 and B=0 will result in a CUDNN_STATUS_NOT_SUPPORTED.

CUDNN_STATUS_EXECUTION_FAILED

Function failed to launch on the GPU.

## 4. cudnn_ops_train.so Library

### 4.1.1. cudnnActivationBackward()

cudnnStatus_t cudnnActivationBackward(
cudnnHandle_t                    handle,
cudnnActivationDescriptor_t      activationDesc,
const void                      *alpha,
const cudnnTensorDescriptor_t    yDesc,
const void                      *y,
const cudnnTensorDescriptor_t    dyDesc,
const void                      *dy,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const void                      *beta,
const cudnnTensorDescriptor_t    dxDesc,
void                            *dx)
This routine computes the gradient of a neuron activation function.
Note:
• In-place operation is allowed for this routine; meaning dy and dx pointers may be equal. However, this requires the corresponding tensor descriptors to be identical (particularly, the strides of the input and output must match for an in-place operation to be allowed).
• All tensor formats are supported for 4 and 5 dimensions, however, the best performance is obtained when the strides of yDesc and xDesc are equal and HW-packed. For more than 5 dimensions the tensors must have their spatial dimensions packed.

#### Parameters

handle

Input. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.

activationDesc

alpha, beta
Input. Pointers to scaling factors (in host memory) used to blend the computation result with prior value in the output layer as follows:
dstValue = alpha[0]*result + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

yDesc

Input. Handle to the previously initialized input tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.

y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc.

dyDesc

Input. Handle to the previously initialized input differential tensor descriptor.

dy

Input. Data pointer to GPU memory associated with the tensor descriptor dyDesc.

xDesc

Input. Handle to the previously initialized output tensor descriptor.

x

Input. Data pointer to GPU memory associated with the output tensor descriptor xDesc.

dxDesc

Input. Handle to the previously initialized output differential tensor descriptor.

dx

Output. Data pointer to GPU memory associated with the output tensor descriptor dxDesc.

#### Returns

CUDNN_STATUS_SUCCESS

The function launched successfully.

At least one of the following conditions are met:

• The strides nStride, cStride, hStride, wStride of the input differential tensor and output differential tensor differ and in-place operation is used.
CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration. See the following for some examples of non-supported configurations:

• The dimensions n, c, h, w of the input tensor and output tensor differ.
• The datatype of the input tensor and output tensor differs.
• The strides nStride, cStride, hStride, wStride of the input tensor and the input differential tensor differ.
• The strides nStride, cStride, hStride, wStride of the output tensor and the output differential tensor differ.
CUDNN_STATUS_EXECUTION_FAILED

The function failed to launch on the GPU.

### 4.1.2. cudnnBatchNormalizationBackward()

cudnnStatus_t cudnnBatchNormalizationBackward(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alphaParamDiff,
const void                      *betaParamDiff,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    dyDesc,
const void                      *dy,
const cudnnTensorDescriptor_t    dxDesc,
void                            *dx,
const cudnnTensorDescriptor_t    bnScaleBiasDiffDesc,
const void                      *bnScale,
void                            *resultBnScaleDiff,
void                            *resultBnBiasDiff,
double                           epsilon,
const void                      *savedMean,
const void                      *savedInvVariance)
This function performs the backward batch normalization layer computation. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015. .
Note:
• Only 4D and 5D tensors are supported.
• The epsilon value has to be the same during training, backpropagation, and inference.
• Higher performance can be obtained when HW-packed tensors are used for all of x, dy, dx.

For more information, refer to cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, refer to cudnnBatchNormMode_t.

Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output dx with a prior value in the destination tensor as follows:
dstValue = alphaDataDiff[0]*resultValue + betaDataDiff[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

*alphaParamDiff, *betaParamDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs resultBnScaleDiff and resultBnBiasDiff with prior values in the destination tensor as follows:
dstValue = alphaParamDiff[0]*resultValue + betaParamDiff[0]*priorDstValue

xDesc, dxDesc, dyDesc

Inputs. Handles to the previously initialized tensor descriptors.

*x

Inputs. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x data.

*dy

Inputs. Data pointer to GPU memory associated with the tensor descriptor dyDesc, for the backpropagated differential dy input.

*dx

Inputs/Outputs. Data pointer to GPU memory associated with the tensor descriptor dxDesc, for the resulting differential output with respect to x.

bnScaleBiasDiffDesc
Input. Shared tensor descriptor for the following five tensors: bnScale, resultBnScaleDiff, resultBnBiasDiff, savedMean, savedInvVariance. The dimensions for this tensor descriptor are dependent on normalization mode. For more information, refer to cudnnDeriveBNTensorDescriptor().
Note: The data type of this tensor descriptor must be float for FP16 and FP32 input tensors, and double for FP64 input tensors.
*bnScale
Input. Pointer in the device memory for the batch normalization scale parameter (in the original paper the quantity scale is referred to as gamma).
Note: The bnBias parameter is not needed for this layer's computation.
resultBnScaleDiff, resultBnBiasDiff
Outputs. Pointers in device memory for the resulting scale and bias differentials computed by this routine. Note that these scale and bias gradients are weight gradients specific to this batch normalization operation, and by definition are not backpropagated.
epsilon

Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*savedMean, *savedInvVariance
Inputs. Optional cache parameters containing saved intermediate results that were computed during the forward pass. For this to work correctly, the layer's x and bnScale data have to remain unchanged until this backward function is called.
Note: Both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.

#### Supported configurations

This function supports the following combinations of data types for various descriptors.
Table 14. Supported configurations
Data Type Configurations xDesc bnScaleBiasMeanVarDesc alpha, beta yDesc
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
PSEUDO_BFLOAT16_CONFIG CUDNN_DATA_BFLOAT16 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_BFLOAT16

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• Any of the pointers alpha, beta, x, dy, dx, bnScale, resultBnScaleDiff, resultBnBiasDiff is NULL.
• The number of xDesc or yDesc or dxDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
• bnScaleBiasDiffDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
• Exactly one of savedMean, savedInvVariance pointers is NULL.
• epsilon value is less than CUDNN_BN_MIN_EPSILON.
• Dimensions or data types mismatch for any pair of xDesc, dyDesc, dxDesc.

### 4.1.3. cudnnBatchNormalizationBackwardEx()

cudnnStatus_t cudnnBatchNormalizationBackwardEx (
cudnnHandle_t                       handle,
cudnnBatchNormMode_t                mode,
cudnnBatchNormOps_t                 bnOps,
const void                          *alphaParamDiff,
const void                          *betaParamDiff,
const cudnnTensorDescriptor_t       xDesc,
const void                          *xData,
const cudnnTensorDescriptor_t       yDesc,
const void                          *yData,
const cudnnTensorDescriptor_t       dyDesc,
const void                          *dyData,
const cudnnTensorDescriptor_t       dzDesc,
void                                *dzData,
const cudnnTensorDescriptor_t       dxDesc,
void                                *dxData,
const cudnnTensorDescriptor_t       dBnScaleBiasDesc,
const void                          *bnScaleData,
const void                          *bnBiasData,
void                                *dBnScaleData,
void                                *dBnBiasData,
double                              epsilon,
const void                          *savedMean,
const void                          *savedInvVariance,
const cudnnActivationDescriptor_t   activationDesc,
void                                *workspace,
size_t                              workSpaceSizeInBytes
void                                *reserveSpace
size_t                              reserveSpaceSizeInBytes);
This function is an extension of the cudnnBatchNormalizationBackward() for performing the backward batch normalization layer computation with a fast NHWC semi-persistent kernel. This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:
• All tensors, namely, x, y, dz, dy, dx must be NHWC-fully packed, and must be of the type CUDNN_DATA_HALF.
• The input parameter mode must be set to CUDNN_BATCHNORM_SPATIAL_PERSISTENT.
• workspace is not NULL.
• Before cuDNN version 8.2.0, the tensor C dimension should always be a multiple of 4. After 8.2.0, the tensor C dimension should be a multiple of 4 only when bnOps is CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION.
• workSpaceSizeInBytes is equal to or larger than the amount required by cudnnGetBatchNormalizationBackwardExWorkspaceSize().
• reserveSpaceSizeInBytes is equal to or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().
• The content in reserveSpace stored by cudnnBatchNormalizationForwardTrainingEx() must be preserved.

If workspace is NULL and workSpaceSizeInBytes of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationBackward.

This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.

This extended function can accept a *workspace pointer to the GPU workspace, and workSpaceSizeInBytes, the size of the workspace, from the user.

The bnOps input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.

Only 4D and 5D tensors are supported. The epsilon value has to be the same during the training, the backpropagation, and the inference.

When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx.

#### Parameters

handle

Input. Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Input. Mode of operation (spatial or per-activation). For more information, refer to cudnnBatchNormMode_t.

bnOps
Input. Mode of operation. Currently, CUDNN_BATCHNORM_OPS_BN_ACTIVATION and CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION are only supported in the NHWC layout. For more information, refer to cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient output dx with a prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

*alphaParamDiff, *betaParamDiff
Inputs. Pointers to scaling factors (in host memory) used to blend the gradient outputs dBnScaleData and dBnBiasData with prior values in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, *x, yDesc, *yData, dyDesc, *dyData

Inputs. Tensor descriptors and pointers in the device memory for the layer's x data, backpropagated gradient input dy, the original forward output y data. yDesc and yData are not needed if bnOps is set to CUDNN_BATCHNORM_OPS_BN, users may pass NULL. For more information, refer to cudnnTensorDescriptor_t.

dzDesc, dxDesc
Inputs. Tensor descriptors and pointers in the device memory for the computed gradient output dz, and dx. dzDesc is not needed when bnOps is CUDNN_BATCHNORM_OPS_BN or CUDNN_BATCHNORM_OPS_BN_ACTIVATION, users may pass NULL. For more information, refer to cudnnTensorDescriptor_t.
*dzData, *dxData
Outputs. Tensor descriptors and pointers in the device memory for the computed gradient output dz, and dx. *dzData is not needed when bnOps is CUDNN_BATCHNORM_OPS_BN or CUDNN_BATCHNORM_OPS_BN_ACTIVATION, users may pass NULL. For more information, refer to cudnnTensorDescriptor_t.
dBnScaleBiasDesc

Input. Shared tensor descriptor for the following six tensors: bnScaleData, bnBiasData, dBnScaleData, dBnBiasData, savedMean, and savedInvVariance. For more information, refer to cudnnDeriveBNTensorDescriptor().

The dimensions for this tensor descriptor are dependent on normalization mode.
Note: The data type of this tensor descriptor must be float for FP16 and FP32 input tensors and double for FP64 input tensors.
*bnScaleData

Input. Pointer in the device memory for the batch normalization scale parameter (in the original paper the quantity scale is referred to as gamma).

*bnBiasData
Input. Pointers in the device memory for the batch normalization bias parameter (in the original paper bias is referred to as beta). This parameter is used only when activation should be performed.
*dBnScaleData, *dBnBiasData
Outputs. Pointers in the device memory for the gradients of bnScaleData and bnBiasData, respectively.
epsilon

Input. Epsilon value used in batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*savedMean, *savedInvVariance
Inputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's x and bnScaleData, bnBiasData data has to remain unchanged until this backward function is called. Note that both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
activationDesc
Input. Descriptor for the activation operation. When the bnOps input is set to either CUDNN_BATCHNORM_OPS_BN_ACTIVATION or CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION then this activation is used, otherwise user may pass NULL.
workspace
Input. Pointer to the GPU workspace. If workspace is NULL and workSpaceSizeInBytes of zero is passed in, then this API will function exactly like the non-extended function cudnnBatchNormalizationBackward().
workSpaceSizeInBytes
Input. The size of the workspace. It must be large enough to trigger the fast NHWC semi-persistent kernel by this function.
*reserveSpace
Input. Pointer to the GPU workspace for the reserveSpace.
reserveSpaceSizeInBytes
Input. The size of the reserveSpace. It must be equal or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().

#### Supported configurations

This function supports the following combinations of data types for various descriptors.
Table 15. Supported configurations
Data Type Configurations xDesc bnScaleBiasMeanVarDesc alpha, beta yDesc
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
PSEUDO_BFLOAT16_CONFIG CUDNN_DATA_BFLOAT16 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_BFLOAT16

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• The number of xDesc or yDesc or dxDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
• dBnScaleBiasDesc dimensions not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
• Exactly one of savedMean, savedInvVariance pointers is NULL.
• epsilon value is less than CUDNN_BN_MIN_EPSILON.
• Dimensions or data types mismatch for any pair of xDesc, dyDesc, dxDesc.

### 4.1.4. cudnnBatchNormalizationForwardTraining()

 cudnnStatus_t cudnnBatchNormalizationForwardTraining(
cudnnHandle_t                    handle,
cudnnBatchNormMode_t             mode,
const void                      *alpha,
const void                      *beta,
const cudnnTensorDescriptor_t    xDesc,
const void                      *x,
const cudnnTensorDescriptor_t    yDesc,
void                            *y,
const cudnnTensorDescriptor_t    bnScaleBiasMeanVarDesc,
const void                      *bnScale,
const void                      *bnBias,
double                           exponentialAverageFactor,
void                            *resultRunningMean,
void                            *resultRunningVariance,
double                           epsilon,
void                            *resultSaveMean,
void                            *resultSaveInvVariance)
This function performs the forward batch normalization layer computation for the training phase. This layer is based on the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, 2015.
Note:
• Only 4D and 5D tensors are supported.
• The epsilon value has to be the same during training, backpropagation, and inference.
• For the inference phase, use cudnnBatchNormalizationForwardInference.
• Higher performance can be obtained when HW-packed tensors are used for both x and y.
Refer to cudnnDeriveBNTensorDescriptor() for the secondary tensor descriptor generation for the parameters used in this function.

#### Parameters

handle

Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Mode of operation (spatial or per-activation). For more information, refer to cudnnBatchNormMode_t.

alpha, beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, yDesc

Tensor descriptors and pointers in device memory for the layer's x and y data. For more information, refer to cudnnTensorDescriptor_t.

*x

Input. Data pointer to GPU memory associated with the tensor descriptor xDesc, for the layer’s x input data.

*y

Input. Data pointer to GPU memory associated with the tensor descriptor yDesc, for the youtput of the batch normalization layer.

bnScaleBiasMeanVarDesc

Shared tensor descriptor desc for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor(). The dimensions for this tensor descriptor are dependent on the normalization mode.

bnScale, bnBias
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper bias is referred to as beta and scale as gamma). Note that bnBias parameter can replace the previous layer's bias parameter for improved efficiency.
exponentialAverageFactor
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
Use a factor=1/(1+n) at N-th call to the function to get Cumulative Moving Average (CMA) behavior such that:
CMA[n] = (x[1]+...+x[n])/n

This is proved below:

CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1)
= ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1)
= CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1)
= CMA[n]*(1-factor) + x(n+1)*factor

resultRunningMean, resultRunningVariance

Inputs/Outputs. Running mean and variance tensors (these have the same descriptor as the bias and scale). Both of these pointers can be NULL but only at the same time. The value stored in resultRunningVariance (or passed as an input in inference mode) is the sample variance and is the moving average of variance[x] where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are not NULL, the tensors should be initialized to some reasonable values or to 0.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

resultSaveMean, resultSaveInvVariance
Outputs. Optional cache to save intermediate results computed during the forward pass. These buffers can be used to speed up the backward pass when supplied to the cudnnBatchNormalizationBackward() function. The intermediate results stored in resultSaveMean and resultSaveInvVariance buffers should not be used directly by the user. Depending on the batch normalization mode, the results stored in resultSaveInvVariance may vary. For the cache to work correctly, the input layer data must remain unchanged until the backward function is called. Note that both parameters can be NULL but only at the same time. In such a case, intermediate statistics will not be saved, and cudnnBatchNormalizationBackward() will have to re-compute them. It is recommended to use this cache as the memory overhead is relatively small because these tensors have a much lower product of dimensions than the data tensors.

#### Supported configurations

This function supports the following combinations of data types for various descriptors.
Table 16. Supported configurations
Data Type Configurations xDesc bnScaleBiasMeanVarDesc alpha, beta yDesc
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
PSEUDO_BFLOAT16_CONFIG CUDNN_DATA_BFLOAT16 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_BFLOAT16

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.

At least one of the following conditions are met:

• One of the pointers alpha, beta, x, y, bnScale, bnBias is NULL.
• The number of xDesc or yDesc tensor descriptor dimensions is not within the range of [4,5] (only 4D and 5D tensors are supported).
• bnScaleBiasMeanVarDesc dimensions are not 1xCx1x1 for 4D and 1xCx1x1x1 for 5D for spatial, and are not 1xCxHxW for 4D and 1xCxDxHxW for 5D for per-activation mode.
• Exactly one of resultSaveMean, resultSaveInvVariance pointers are NULL.
• Exactly one of resultRunningMean, resultRunningInvVariance pointers are NULL.
• epsilon value is less than CUDNN_BN_MIN_EPSILON.
• Dimensions or data types mismatch for xDesc, yDesc.

### 4.1.5. cudnnBatchNormalizationForwardTrainingEx()

 cudnnStatus_t cudnnBatchNormalizationForwardTrainingEx(
cudnnHandle_t                       handle,
cudnnBatchNormMode_t                mode,
cudnnBatchNormOps_t                 bnOps,
const void                          *alpha,
const void                          *beta,
const cudnnTensorDescriptor_t       xDesc,
const void                          *xData,
const cudnnTensorDescriptor_t       zDesc,
const void                          *zData,
const cudnnTensorDescriptor_t       yDesc,
void                                *yData,
const cudnnTensorDescriptor_t       bnScaleBiasMeanVarDesc,
const void                          *bnScaleData,
const void                          *bnBiasData,
double                              exponentialAverageFactor,
void                                *resultRunningMeanData,
void                                *resultRunningVarianceData,
double                              epsilon,
void                                *saveMean,
void                                *saveInvVariance,
const cudnnActivationDescriptor_t   activationDesc,
void                                *workspace,
size_t                              workSpaceSizeInBytes
void                                *reserveSpace
size_t                              reserveSpaceSizeInBytes);

This function is an extension of the cudnnBatchNormalizationForwardTraining() for performing the forward batch normalization layer computation.

This API will trigger the new semi-persistent NHWC kernel when the following conditions are true:
• All tensors, namely, x, y, dz, dy, dx must be NHWC-fully packed and must be of the type CUDNN_DATA_HALF.
• The input parameter mode must be set to CUDNN_BATCHNORM_SPATIAL_PERSISTENT.
• workspace is not NULL.
• Before cuDNN version 8.2.0, the tensor C dimension should always be a multiple of 4. After 8.2.0, the tensor C dimension should be a multiple of 4 only when bnOps is CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION.
• workSpaceSizeInBytes is equal to or larger than the amount required by cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize().
• reserveSpaceSizeInBytes is equal to or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().
• The content in reserveSpace stored by cudnnBatchNormalizationForwardTrainingEx() must be preserved.

If workspace is NULL and workSpaceSizeInBytes of zero is passed in, this API will function exactly like the non-extended function cudnnBatchNormalizationForwardTraining().

This workspace is not required to be clean. Moreover, the workspace does not have to remain unchanged between the forward and backward pass, as it is not used for passing any information.

This extended function can accept a *workspace pointer to the GPU workspace, and workSpaceSizeInBytes, the size of the workspace, from the user.

The bnOps input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.

Only 4D and 5D tensors are supported. The epsilon value has to be the same during the training, the backpropagation, and the inference.

When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx.

#### Parameters

handle

Handle to a previously created cuDNN library descriptor. For more information, refer to cudnnHandle_t.

mode

Mode of operation (spatial or per-activation). For more information, refer to cudnnBatchNormMode_t.

bnOps
Input. Mode of operation for the fast NHWC kernel. For more information, refer to cudnnBatchNormOps_t. This input can be used to set this function to perform either only the batch normalization, or batch normalization followed by activation, or batch normalization followed by element-wise addition and then activation.
*alpha, *beta
Inputs. Pointers to scaling factors (in host memory) used to blend the layer output value with prior value in the destination tensor as follows:
dstValue = alpha[0]*resultValue + beta[0]*priorDstValue

For more information, refer to Scaling Parameters in the cuDNN Developer Guide.

xDesc, *xData, zDesc, *zData, yDesc, *yData

Tensor descriptors and pointers in device memory for the layer's input x and output y, and for the optional z tensor input for residual addition to the result of the batch normalization operation, prior to the activation. The optional zDes and *zData descriptors are only used when bnOps is CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION, otherwise users may pass NULL. When in use, z should have exactly the same dimension as x and the final output y. For more information, refer to cudnnTensorDescriptor_t.

bnScaleBiasMeanVarDesc

Shared tensor descriptor desc for the secondary tensor that was derived by cudnnDeriveBNTensorDescriptor(). The dimensions for this tensor descriptor are dependent on the normalization mode.

*bnScaleData, *bnBiasData
Inputs. Pointers in device memory for the batch normalization scale and bias parameters (in the original paper, bias is referred to as beta and scale as gamma). Note that bnBiasData parameter can replace the previous layer's bias parameter for improved efficiency.
exponentialAverageFactor
Input. Factor used in the moving average computation as follows:
runningMean = runningMean*(1-factor) + newMean*factor
Use a factor=1/(1+n) at N-th call to the function to get Cumulative Moving Average (CMA) behavior such that:
CMA[n] = (x[1]+...+x[n])/n

This is proved below:

Writing
CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1)
= ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1)
= CMA[n]*(1-1/(n+1))+x[n+1]*1/(n+1)
= CMA[n]*(1-factor) + x(n+1)*factor

*resultRunningMeanData, *resultRunningVarianceData

Inputs/Outputs. Pointers to the running mean and running variance data. Both these pointers can be NULL but only at the same time. The value stored in resultRunningVarianceData (or passed as an input in inference mode) is the sample variance and is the moving average of variance[x] where the variance is computed either over batch or spatial+batch dimensions depending on the mode. If these pointers are not NULL, the tensors should be initialized to some reasonable values or to 0.

epsilon

Input. Epsilon value used in the batch normalization formula. Its value should be equal to or greater than the value defined for CUDNN_BN_MIN_EPSILON in cudnn.h. The same epsilon value should be used in forward and backward functions.

*saveMean, *saveInvVariance
Outputs. Optional cache parameters containing saved intermediate results computed during the forward pass. For this to work correctly, the layer's x and bnScaleData, bnBiasData data has to remain unchanged until this backward function is called. Note that both these parameters can be NULL but only at the same time. It is recommended to use this cache since the memory overhead is relatively small.
activationDesc
Input. The tensor descriptor for the activation operation. When the bnOps input is set to either CUDNN_BATCHNORM_OPS_BN_ACTIVATION or CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION then this activation is used, otherwise user may pass NULL.
*workspace, workSpaceSizeInBytes
Inputs. *workspace is a pointer to the GPU workspace, and workSpaceSizeInBytes is the size of the workspace. When *workspace is not NULL and *workSpaceSizeInBytes is large enough, and the tensor layout is NHWC and the data type configuration is supported, then this function will trigger a new semi-persistent NHWC kernel for batch normalization. The workspace is not required to be clean. Also, the workspace does not need to remain unchanged between the forward and backward passes.
*reserveSpace
Input. Pointer to the GPU workspace for the reserveSpace.
reserveSpaceSizeInBytes
Input. The size of the reserveSpace. Must be equal or larger than the amount required by cudnnGetBatchNormalizationTrainingExReserveSpaceSize().

#### Supported configurations

This function supports the following combinations of data types for various descriptors.
Table 17. Supported configurations
Data Type Configurations xDesc bnScaleBiasMeanVarDesc alpha, beta yDesc
PSEUDO_HALF_CONFIG CUDNN_DATA_HALF CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_HALF
FLOAT_CONFIG CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT
DOUBLE_CONFIG CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE CUDNN_DATA_DOUBLE
PSEUDO_BFLOAT16_CONFIG CUDNN_DATA_BFLOAT16 CUDNN_DATA_FLOAT CUDNN_DATA_FLOAT CUDNN_DATA_BFLOAT16

#### Returns

CUDNN_STATUS_SUCCESS

The computation was performed successfully.

CUDNN_STATUS_NOT_SUPPORTED

The function does not support the provided configuration.