cuDNN Release 8.x.x

cuDNN Release 8.0.5

This is the cuDNN 8.0.5 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • RNN now supports zero-length sequences within the batch when the RNN data layout is CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED. For more information, see cudnnSetRNNDataDescriptor().
  • Users can now set the environment variable CUDNN_CONV_WSCAP_DBG to a value in MiB to limit the workspace size returned by cudnnConvolutionForwardGetWorkspaceSize(), cudnnConvolutionBackwardDataGetWorkspaceSize(), and cudnnConvolutionBackwardFilterGetWorkspaceSize(). Limiting the workspace might result in performance lost.
  • Significant performance improvements were made for RTX 3090 for many models on many configurations.
  • Performance improvements were made:
    • For EfficientNet when run using NHWC FP16 Tenor Core configurations on V100 and A100 GPU architectures.
    • For PilotNet, AH-Net, MobileNet V3 on V100 and A100 GPU architectures.
    • For various 3-D convolution cases on RTX 8000.
  • Support for the 3D NDHWC layout was added in cudnnConvolutionBackwardFilter().
  • Added instructions for installing cuDNN using the Package Manager for Linux and RHEL users. For step-by-step instructions, see Package Manager Installation in the cuDNN Installation Guide.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnBackendFinalize(descriptor), where descriptor is of type CUDNN_BACKEND_ENGINE_DESCRIPTOR() or CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR(), might result in a hang if the operation graph has backward filter operation and the user links against libcudnn.so (cudnn64.dll on Windows). This issue has been fixed in this release.
  • Call to cudnnConvolutionBiasActivationForward() might result in a memory leak in release 8.0.1. This issue has been fixed.
  • Performance regression on the U-Net Industrial network on Volta for certain batch sizes has been fixed.
  • cudnnRNN*() with LSTM mode may produce incorrect results on the cy outputs when clipping is enabled on all GPUs. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • cudnnRNNForward* with LSTM mode may produce incorrect results in case of clipping when CUDNN_RNN_ALGO_PERSIST_STATIC is used. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData() or cudnnRNNBackwardDataEx()may produce non-deterministic outputs when running configurations such as hiddenSize=128 or less, LSTM cell type, and FP32 with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there was a known ~6% performance regression on Inception V3 and ResNet-50 models when run using NHWC FP16 configurations on various Turing and Titan V architectures. This issue has been fixed in this release.
  • Compared to cuDNN v8.0.3, there was a known ~18% performance regression on the U-Net Industrial model when run using NCHW TF32 configurations on V100 and A100 GPU architectures. This issue has been fixed in this release.
  • Updated: November 25, 2020

    When calling cudnnConvolutionBiasActivationForward() with INT8x4 or INT8x32 IO tensors, it could result in CUDNN_STATUS_BAD_PARAM in 8.0.4. This issue has been fixed in this release.

Known Issues

  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED in cudnn built against CUDA 11.0. This issue has been fixed with cuDNN built against CUDA 11.1.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • Users have reported that in RNN training with non-zero dropout rate, and if the RNN network is unidirectional, the output of cudnnRNNBackwardWeights() may be non-deterministic. We are still investigating this issue.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Nano & TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Nano.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (e.g. cudnnOpsInferVersionChec()) to load the kernels in the sub-library prior to opening graph capture.

cuDNN Release 8.0.4

This is the cuDNN 8.0.4 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
GA102 support with improved convolution performance
Now includes convolution heuristics targeting the NVIDIA GA102 GPU. (not applicable for Jetson platforms)
RNN API v8 sample
The new RNN sample illustrating the usage of the new RNN version 8 API has been added. The sample's workflow consists of the several routines to create RNN descriptors, create RNN data descriptors, set up weight space, and compute routines. The sample takes several input parameters which can set up different RNN configurations and input data specifications (data type, cell mode, bias mode etc.).
RNN functional and performance improvements
ARM Server Base System Architecture (SBSA)
Added support for ARM SBSA for Linux.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.4 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN 8.0.4 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.4 compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.4 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.

Deprecated Features

The following features are deprecated in cuDNN 8.0.4:
  • Support for Ubuntu 18.04 ppc64le builds will be dropped post cuDNN 8.0.4.

Fixed Issues

  • cudnnConvolutionBackwardFilter() and cudnnGetConvolutionBackwardFilterWorkspaceSize() can result in a segmentation fault in multi-threaded usage due to a race condition. This issue has been fixed in this release.
  • The libfreeimage.a library in the RHEL 8 ppc64le RPM package was for the wrong architecture. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData() or cudnnRNNBackwardDataEx() may return CUDNN_STATUS_INTERNAL_ERROR, NaN-s, or non-deterministic finite values when CUDNN_RNN_ALGO_PERSIST_STATIC was selected. These issues occurred mainly on smaller GPUs, such as Turing with 30 or 36 SMs and smaller hiddenSize values. Most of those issues have been fixed in this release. However, configurations such as hiddenSize=128, LSTM, FP32 with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION may still output non-deterministic results.
  • There was an issue in upgrading the cuDNN version using the RPM and Debian packages in the 8.0.3 version. This issue has been fixed in this release.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • cuDNN exhibited performance regressions for GoogLeNet and U-Net on V100. This issue has been fixed in this release.
  • cuDNN exhibited performance regressions for VGG16 on GA100. This issue has been fixed in this release.
  • The performance regression across Tacotron2 and WaveGlow seen on the Turing architecture have been fixed.
  • The performance regressions in the FastPitch network seen on the Volta and Turing architecture have been fixed.
  • The cuDNN API unconditionally triggers CUDA context initialization. This causes unnecessary host-side performance overhead. This is an issue that was introduced in cuDNN version 8.0.2. This issue has been fixed in this release.
  • Some ResNet-50 and SSD mixed precision inference use-cases may have performance regressions compared to cuDNN 7.6 on V100. V-Net 3D models might have performance regressions on Turing based architectures. This issue has been fixed in this release.
  • Previous cuDNN 8 releases exhibited performance regressions when compared to version 7.6, for some important convolutional networks on the Pascal GPU architecture. In particular, the performance regressions of ResNet-50 seen previously on Pascal with cuDNN versions 8.0.3 and earlier, are fixed with this release.
  • cudnnConvolutionBiasActivationForward() could result in incorrect results when the alpha2 value is zero and the device buffer zData contains NaN. This issue has been fixed in this release.
  • When using cudnnRNN*Ex() APIs, if the layout of RNN data is CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED would be returned. This issue has been fixed in this release. cuDNN supports arbitrary batch size.
  • When the user upgraded from cuDNN 8.0.2 to 8.0.3 through the Debian or RPM package, users had to manually uninstall the old libcudnn8-doc package before they installed libcudnn8-samples_*.deb/rpm, otherwise a file conflict could happen. This has been fixed and is no longer the case in the 8.0.4 release.
  • Performance regressions on Turing, Volta and Pascal architectures for True Half convolutions have been resolved.
  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED in cudnn built against cuda 11.0. This issue has been fixed with cuDNN built against CUDA 11.1.

Known Issues

  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED. This issue affects earlier cuDNN 8.0.1 Preview and cuDNN 8.0.2 releases built against CUDA 11.0.
  • There is a known minor performance regression on small batch sizes for ResNet-50 native FP32 inference that exists on the NVIDIA Ampere GPU architecture.

cuDNN Release 8.0.3

This is the cuDNN 8.0.3 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

cuDNN Backend API
Documentation for the cuDNN Backend API has been included in this release. Users specify the computational case, set up an execution plan for it, and execute the computation via numerous descriptors. The typical use pattern for a descriptor with attributes consists of the following sequence of API calls:
  1. cudnnBackendCreateDescriptor() creates a descriptor of a specified type.
  2. cudnnBackendSetAttribute() sets the values of a settable attribute for the descriptor. All required attributes must be set before the next step.
  3. cudnnBackendFinalize() finalizes the descriptor.
  4. cudnnBackendGetAttribute() gets the values of an attribute from a finalized descriptor.

For more information, refer to the cuDNN Backend API section in the cuDNN API Reference.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.3 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN 8.0.3 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.3 compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.3 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.

Fixed Issues

  • For cudnnConvolutionBackwardFilter, the 3D convolution table, wDesc: _NCHW, _ALGO_1 and FFT_TILING had incorrect data fields. This has been fixed in this release.
  • In prior versions of cuDNN, cudnnPoolingForward() with pooling mode CUDNN_POOLING_MAX might return incorrect result when one of the spatial dimensions has negative padding and the output tensor is larger than the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim(). This issue has been fixed in this release.
  • In cudnnPoolingForward() with average-pooling, when the output tensor data is INT8 type, it is possible for some pixels result to be off by 1. Note that cudnnPoolingForward() rounds to the nearest-even integer. This issue has been fixed in this release.
  • The performance of cudnnConvolutionBiasActivationForward() for INT8x4 use cases on Volta and Turing, INT8x32 use cases on Turing, FP32 and pseudo-FP16 use cases on Volta, Turing, and Ampere GPU architecture have been improved.
  • We have updated our public headers to fully reflect the documented dependencies between the 6 sub-libraries.
  • There were libcudnn_ops/cnn/adv_infer/train_static.a binaries in the cuDNN Debian and tgz packages. Users were advised not to link against those and link against libcudnn_static.a instead. Those binaries have been removed from the release packages.
  • On Volta and Pascal architectures, performance regressions were present for various TRUE_HALF convolutions. This has been fixed in this release.
  • In prior versions of cuDNN, API functions cudnnGetConvolution*Algorithm_v7() return a workspace size in the result for algo1 that is inconsistent with the result of the corresponding cudnnGet*Workspace() calls if the math type of the convolution descriptor is set to CUDNN_FMA_MATH. This issue has been fixed in this release.
  • The new RNN APIs: cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() were available as a preview in the cuDNN 8.0.2 release. They no longer hold preview status.
  • When using cudnnRNN*Ex() APIs, if the user planned to use CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, the user would have had to call cudnnSetRNNPaddingMode() to set the mode to CUDNN_RNN_PADDED_IO_ENABLED after initializing an RNNDescriptor but before calling cudnnGetRNNWorkspaceSize(). Not doing this would result in CUDNN_STATUS_EXECUTION_FAILED. We’ve added internal checks to return CUDNN_STATUS_BAD_PARAM to prevent hitting EXECUTION_FAILED.
  • When cudnnBatchNormalizationForwardTrainingEx() is called with NHWC tensors with pseudo-half configuration, under rare occasions the kernel would produce incorrect results, including possible NaNs in the results. This has been fixed in this release. This issue affects earlier releases since 7.4.1.
  • Fused convolution-scale-bias-activation with per-channel α1 and α2 scaling gives incorrect results when the reorder type in the convolution descriptor is set to CUDNN_NO_REORDER. This is an issue in cuDNN version 8.0.2 This issue has been fixed in this release.
  • On NVIDIA Ampere GA100, cudnnConvolutionBackwardData() for Tensor Core enabled problems with half input and output could, in rare cases, could produce incorrect results; the same could happen for users of cudnnBackendExecute() using engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 57 for backward data. This has been fixed in this release. (not applicable for Jetson platforms)
  • There was a performance regression in MaskRCNN inference with automatic mixed precision on V100. This has been fixed in this release.
  • Two dimensional forward convolutions using algo1 may segfault when the filter size is large. For example, we’ve observed this issue when the filter width and height are more than or equal to 363. This has been fixed in this release.
  • For some 3D spatial non-Tensor-Core convolutions on Maxwell, Pascal, Volta, and Turing architectures, cudnnBackwardFilter() can return incorrect results when the convolution width padding exceeds the value (filterWidth - 1)/2. Likewise, users of cudnnBackendExecute() can experience the same issue when using the engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 for backward filter. The issue affecting cudnnBackwardFilter() has been fixed in this release. With cudnnBackendFinalize(), an engine descriptor with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 and a backward filter operation that satisfies the above condition will return CUDNN_STATUS_NOT_SUPPORTED.

Known Issues

  • Occasionally, inaccurate results were observed in outputs of the cudnnRNNBackwardWeights() and cudnnRNNBackwardWeightsEx() functions when the RNN cell type was GRU and the NVIDIA Ampere GPU architecture was used with FP32 I/O and mathType of CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Users may switch to CUDNN_FMA_MATH as a temporary workaround. This issue is being investigated.

  • cudnnRNN*() with LSTM mode may produce inaccurate results on the cy outputs when clipping is enabled on all GPUs. This issue exists in previous cuDNN releases as well.

  • On Volta and Pascal architectures, performance regressions may be present for various TRUE_HALF convolutions.

  • When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED. This issue also affects earlier releases cuDNN 8.0.1 Preview and cuDNN 8.0.2.
  • Some ResNet-50 and SSD mixed precision inference use-cases may have performance regressions compared to cuDNN 7.6 on V100. V-Net 3D models might have performance regressions on Turing based architectures.

  • When using cudnnRNN*Ex() APIs, if the user used CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED would be returned.

  • Documentation of the Backend API is not complete. The CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR and CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR descriptor types will be documented in a future release.

  • The conv_sample_v8.0 sample is not included in the Debian and RPM packages. This will be fixed in a future release.

  • The libfreeimage.a library in the RHEL 8 ppc64le RPM is for the wrong architecture. This will be fixed in a future release.

  • When the user is upgrading from cuDNN 8.0.2 to 8.0.3 through the Debian or RPM package, before installing libcudnn8-samples_*.deb/rpm, users should manually uninstall the old libcudnn8-doc package, otherwise a file conflict may happen.

cuDNN Release 8.0.2

This is the cuDNN 8.0.2 release notes and first GA release of cuDNN 8.x. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

cuDNN 8.0.1 Preview and 8.0.0 Preview

The key features mentioned in cuDNN 8.0.1 Preview and 8.0.0 Preview are now GA quality in this release.

Added new API functions to the documentation

cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() are now documented in the cudnn_adv_train.so Library. For a list of functions and data types that were added in this release, see API Changes For cuDNN 8.0.2.

TF32 performance
  • TF32 for 3D convolutions and deconvolution performance is significantly better, up to 3.9x, compared to cuDNN 8.0.1.
  • TF32 for grouped convolutions on A100 were improved up to 1.5x performance compared to cuDNN 8.0.1 on ResNext convolution layers and up to 3x the performance compared to V100 with cuDNN v7.6. (not applicable for Jetson platforms)

The above performance improvements were measured using only cuDNN operations. The observed performance improvements will depend on a number of factors, such as non-cuDNN operations, kernel run time, and model architecture type.

Performance improvements

This release includes performance improvements on all architectures for 2D and 3D grouped convolutions compared with version 7.6. Additionally, we improved kernel selection heuristics on several known Deep Learning GitHub Examples (also known as model scripts).

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.2 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.2 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.2 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.2 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.

  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.

  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.

Fixed Issues

The following issues have been fixed in this release:
  • The implementation of cuDNNLRNCrossChannelBackward() for even-sized normalization windows was incorrect in all previous releases. This issue has been fixed in this release.

  • There isn’t a dedicated API to query the supported or the most performant algo for cudnnConvolutionBiasActivationForward() in cuDNN. It is not recommended to query w via cudnnGetConvolutionForwardAlgorithm_v7. Instead, we recommend using the cuDNN version 8 backend API. The number of supported engines can be queried using enum CUDNN_ATTR_OPERATIONGRAPH_ENGINE_GLOBAL_COUNT from an operation graph descriptor via cudnnBackendGetAttribute().

  • A memcheck error may have occurred on cuDNN version 7.x builds when calling cudnnConvolutionBackwardFilter () on Volta or Turing GPUs. This issue has been fixed in this release.

  • Various convolutions which exhibited sub-optimal performance on GA100 GPU are now achieving ideal performance. (not applicable for Jetson platforms)

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() were missing in past releases. This issue has been fixed in this release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() have been added to this release.

  • cuDNN 8.0.1 built with Windows and CUDA 11.0 RC had reduced performance on 2D, 3D, and grouped convolutions compared to Linux. This issue has been fixed in this release. (not applicable for Jetson platforms)

  • There was a known issue in cuDNN 8.0.1 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users would see the library emit an internal error or incorrectly state that a shared library was missing. This issue has been fixed in this release.

  • When using an RPM file on RedHat for installation, upgrading from cuDNN v7 to cuDNN v8 directly or indirectly via TensorRT 7.1.3 would cause installation errors. This issue has been fixed in this release.

  • The implementation of cuDNNLRNCrossChannelBackward was inconsistent with the implementation of cuDNNLRNCrossChannelForward and returned incorrect results when the normalization window was even. This issue has been fixed in this release.

  • RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, used an incorrect default down-conversion on GPUs with CUDA SM version SM80 (NVIDIA Ampere GPU family) when supplied input data and weights have the CUDNN_DATA_FLOAT type and cudnnMathType_t set via cudnnSetRNNMatrixMathType() is CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Instead of using the default TF32 computation when Tensor Cores are used, a down-conversion to FP16 (half-precision) was performed; same as in the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This introduced a lower dynamic range of intermediate data but possibly faster execution. To disable the automatic down-conversion of CUDNN_DATA_FLOAT weights and data in RNN APIs, the user needed to set the environmental variable NVIDIA_TF32_OVERRIDE to 0 (notice this would have disabled the use of TF32 in the entire library, which might have a performance impact on CNNs that are not affected by this issue). Another workaround was to assign the CUDNN_FMA_MATH mode to the cudnnMathType_t argument in cudnnSetRNNMatrixMathType(). Due to this, the A100 GPU TF32 feature was not accessible for RNNs in cuDNN v8.0.1. This issue has been fixed in this release. (not applicable for Jetson platforms)

  • cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152. This issue exists for all cuDNN 8.0.x releases. This issue has been fixed in this release.

  • Since version 8.0.0 Preview, cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter() erroneously returned CUDNN_STATUS_INTERNAL_ERROR when the workspace size argument value was less than the required workspace size as returned by their respective cudnnGetWorkspace() API. This issue has been fixed and CUDNN_STATUS_BAD_PARAMS is returned as documented.

Known Issues

  • In this release, the performance of cudnnConvolutionBiasActivationForward() for true-half use cases on Pascal, INT8x4 use cases on Volta, and Turing, compared to version 7.6 is still lower. In addition, FP32 and pseudo-FP16 performance on Volta, Turing and the NVIDIA Ampere GPU architecture is still not fully optimized.

  • The new RNN APIs: cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() are available as a preview in the cuDNN 8.0.2 release.

  • Occasionally, inaccurate results were observed in outputs of the cudnnRNNBackwardWeights() and cudnnRNNBackwardWeightsEx() functions when the RNN cell type was GRU and the NVIDIA Ampere GPU architecture was used with FP32 I/O and mathType of CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Users may switch to CUDNN_FMA_MATH as a temporary workaround. This issue is being investigated.

  • cudnnRNN*() with LSTM mode may produce inaccurate results on the cy outputs when clipping is enabled on all GPUs. This issue exists in previous cuDNN releases as well.

  • On Volta and Pascal architectures, performance regressions may be present for TRUE_HALF convolution backward filter.

  • When using cudnnRNN*Ex() APIs, if the user uses CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED may be returned.

  • Currently, there are libcudnn_ops/cnn/adv_infer/train_static.a binaries in the cuDNN Debian and tgz packages. Users are advised not to link against those and link against libcudnn_static.a instead. Those binaries will be removed from the release packages in the next release.

  • When using cudnnRNN*Ex() APIs, if the user plans to use CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, the user should call cudnnSetRNNPaddingMode() to set the mode to CUDNN_RNN_PADDED_IO_ENABLED after initializing an RNNDescriptor but before calling cudnnGetRNNWorkspaceSize(). Not doing this may result in CUDNN_STATUS_EXECUTION_FAILED.

  • Updated: August 24, 2020

    Fused convolution-scale-bias-activation with per-channel α1 and α2 scaling gives incorrect results when the reorder type in the convolution descriptor is set to CUDNN_NO_REORDER.

  • Updated: August 24, 2020

    When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED.

  • Updated: August 24, 2020

    For some 3D spatial non-Tensor-Core convolutions on Maxwell, Pascal, Volta, and Turing architectures, cudnnBackwardFilter() can return incorrect results when the convolution width padding exceeds the value (filterWidth - 1)/2. Likewise, users of cudnnBackendExecute() can experience the same issue when using the engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 for backward filter. The issue affecting cudnnBackwardFilter() has been fixed in this release. With cudnnBackendFinalize(), an engine descriptor with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 and a backward filter operation that satisfies the above condition will return CUDNN_STATUS_NOT_SUPPORTED.

cuDNN Release 8.0.1 Preview

Attention: This is the cuDNN 8.0.1 Preview release. This Preview release is for early testing and feedback, therefore, for production use of cuDNN, continue to use cuDNN 7.6.5. This release is subject to change based on ongoing performance tuning and functional testing. For feedback on the new backend API and deprecations, email cudnn@nvidia.com.

These release notes are applicable to JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

  • Added new kernels to improve the performance of fusion.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.1.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.1 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • Some data types are not widely supported by all cuDNN API. For example, CUDNN_DATA_INT8x4 is not supported by many functions. In such cases, support is available by using cudnnTransformTensor() to transform the tensors from the desired type to a type supported by the API. For example, a user is able to transform input tensors from CUDNN_DATA_INT8x4 to CUDNN_DATA_INT8, run the desired API and then transform output tensors from CUDNN_DATA_INT8 to CUDNN_DATA_INT8x4. Note that this transformation will incur an extra round trip to memory.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.1 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.1 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.1 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.

Fixed Issues

The following issues have been fixed in this release:

  • The dimA and strideA parameters in cudnnSetTensorNdDescriptor() do not document the tensor layout. The documentation has been updated to include this information.

  • cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures. This has been fixed in 8.0.1 Preview.

  • cuDNN 8.0.0 Preview removed a restriction on convolution backward filter for output filter with odd products of dimensions (N*C*D*H*W) for a kernel in algo0 for pre-Volta GPUs. This can potentially lead to an illegal memory access error. This restriction is restored in cuDNN 8.0.1 Preview. cuDNN will use a kernel that does not have this restriction for this computation case.

  • Fixed performance issues for pre-Vola architectures for convolutions (except when the compute type is half).

  • Mitigated the performance regression to less than 10% end-to-end.

Known Issues

  • On pre-Volta, there are significant performance issues on convolution layers when the compute type is half.

  • Sub-optimal performance is present in this release for all INT8 convolutions for all GPUs.

  • The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

  • There are some peer-to-peer documentation links that are broken within the cuDNN API Reference. These links will be fixed in the next release.

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() are missing in this release and will be added in the GA release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() functions will be implemented in the next release.

  • cuDNN 8.0.1 Preview build with Windows and CUDA 11.0 RC has reduced performance on 2D, 3D, and grouped convolutions compared to Linux.

  • There is a known issue in cuDNN 8.0.1 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

  • When FFT Tiled aglo (i.e., CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING in forward convolution or CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING for backward data) is used for 3D convolution, an intermittent silent failure might happen due to an incorrect stream used for kernel execution. In some cases, this might be manifested as undefined values seen in the output.

  • The implementation of cuDNNLRNCrossChannelBackward is inconsistent with the implementation of cuDNNLRNCrossChannelForward and returns incorrect results when the normalization window is even. This will be fixed in a future release.

  • RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, use an incorrect default down-conversion on GPUs with CUDA SM version SM80 (NVIDIA Ampere GPU family) when supplied input data and weights have the CUDNN_DATA_FLOAT type and cudnnMathType_t set via cudnnSetRNNMatrixMathType() is CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Instead of using the default TF32 computation when Tensor Cores are used, a down-conversion to FP16 (half-precision) is performed; same as in the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This introduces a lower dynamic range of intermediate data but possibly faster execution. To disable the automatic down-conversion of CUDNN_DATA_FLOAT weights and data in RNN APIs, set the environmental variable NVIDIA_TF32_OVERRIDE to 0 (notice this will disable the use of TF32 in the entire library, which might have a performance impact on CNNs that are not affected by this issue). Another workaround is to assign the CUDNN_FMA_MATH mode to the cudnnMathType_t argument in cudnnSetRNNMatrixMathType(). Due to this, the A100 TF32 feature is not accessible for RNNs in cuDNN v8.0.1.

  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • Updated: August 24, 2020

    cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152.

  • Updated: August 24, 2020

    When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED.

cuDNN Release 8.0.0 Preview

Attention: This is the cuDNN 8.0.0 Preview release. This Preview release is for early testing and feedback, therefore, for production use of cuDNN, continue to use cuDNN 7.6.5. This release is subject to change based on ongoing performance tuning and functional testing. For feedback on the new backend API and deprecations, email cudnn@nvidia.com.
These release notes are applicable to JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).
Note: cuDNN 8.0.0 passed GA quality testing and validation for TensorRT and JetPack users.

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:

cuDNN library
  • The cuDNN library has been split into the following libraries:
    • cudnn_ops_infer - This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common machine learning algorithms such as batch normalization, softmax, dropout, etc.

    • cudnn_ops_train - This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, etc. The cudnn_ops_train library depends on cudnn_ops_infer.

    • cudnn_cnn_infer - This entity contains all routines related to convolutional neural networks needed at inference time. The cudnn_cnn_infer library depends on cudnn_ops_infer.

    • cudnn_cnn_train - This entity contains all routines related to convolutional neural networks needed during training time. The cudnn_cnn_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_cnn_infer.

    • cudnn_adv_infer - This entity contains all other features and algorithms. This includes RNNs, CTC loss, and multi-head attention. The cudnn_adv_infer library depends on cudnn_ops_infer.

    • cudnn_adv_train - This entity contains all the training counterparts of cudnn_adv_infer. The cudnn_adv_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_adv_infer.

    • cudnn - This is an optional shim layer between the application layer and the cuDNN code. This layer opportunistically opens the correct library for the API at runtime.

  • cuDNN does not support mixing sub-library versions. If there is a mismatch in the cuDNN version numbers in the cuDNN sub-library header files, the build will crash. The versions need to match on the major number and minor number, as well as the patch level.

  • The cuDNN sub-libraries must be installed under a single directory.

Multiple dynamic libraries
In order to link against a subset of cuDNN, you need to know which subset of the API you are using and then link against the appropriate cuDNN sub-components. The cuDNN sub-components are as follows:
  • cudnn_ops_infer.so
  • cudnn_ops_train.so
  • cudnn_cnn_infer.so
  • cudnn_cnn_train.so
  • cudnn_adv_infer.so
  • cudnn_adv_train.so
cuDNN linking options
There are two different linking options:
  • Linking against individual sub-libraries: Users who link against individual sub-libraries must be able to identify the API exposed by each cuDNN sub-library. Users also need to know the hierarchy of the different cuDNN sub-libraries. Each .so or .a needs to be specified explicitly in the user’s linking command, as well as any external dependencies cuDNN require. For more information, refer to the Limitations section below.

  • Linking against the full cuDNN (compatibility option): This would allow users to use -lcudnn. libcudnn.so is provided as a shim layer that would open the appropriate cuDNN sub-library for any particular cuDNN API call. While libcudnn.a is largely unchanged, it is a statically linked file for all of cuDNN.

cuDNN loading options
For users who want a smaller memory footprint, there are 2 ways of loading the library.
  • Cherry-pick loading: Each sub-library is loaded only when accessed. This will cause the first reference to that sub-library to take a long time but will ensure the user isn’t loading more libraries than they need.

  • All access loading: All available cuDNN sub-libraries are loaded early during runtime.

New API functions

For a list of functions and data types that were added in this release, see API Changes For cuDNN 8.0.0.

General Support of CUDA Graph Capture
CUDA Graphs are now supported for all functions in this release; with the following restrictions.
  • CUDA Toolkit 10.2 or higher is required
  • cuDNN 8.0.0 graphs are captured via the CUDA graph-capture APIs
  • any non-default use of textures by users of cuDNN needs to be disabled prior to capture

cuDNN 8.0.0 does not at this time offer API support to add operations to an existing CUDA graph directly; however, the captured graph may be added to an existing graph through the existing CUDA Graphs API.

Regarding texture usage, cuDNN 8.0.0 by default will not enable texture usage; expert users may enable texture usage where allowed, but that usage will prevent a successful CUDA Graph capture until disabled. In order for cuDNN 8.0.0 to be graph-capture compatible library-wide, the cuDNN 8.0.0 CTC API was updated as described elsewhere.

The usual restrictions for CUDA Graphs apply in addition to these restrictions here.

New APIs for convolution

A new set of API functions to provide a brand new approach to cuDNN that offers more fine-grain control of performance, numerical properties, etc.. for convolution. Using this API, users directly access various engines that compute convolution forward propagation, backward data, backward filter, and generic support for fusion starting with a limited support in this cuDNN 8.0.0 release and expanding support in follow-up releases. Each engine has performance tuning knobs such as GEMM tiling and split-K. Users can use this API to fine-tune their network by querying cuDNN’s heuristics, or doing their own, to find the most optimal engine configuration with which cuDNN computes each network layer.

NVIDIA Ampere GPU architecture support (not applicable for Jetson platforms)
  • Added support for A100 GPU based on NVIDIA Ampere architecture.
  • cuDNN 8.0.0 has seen significant improvements when using A100 GPUs compared to Volta V100 with cuDNN 7.6.
  • Added support for Tensor Float 32 (TF32) for 1D and 2D convolutions. Full support for TF32 will come in future releases such as grouped convolutions and 3D convolutions in addition to further performance tuning.
  • Increased performance for the legacy Tensor Cores (mixed precision for 1D, 2D, 3D, and grouped convolutions.
Turing and Volta architecture improvements
  • New kernels for Tensor Cores and heuristics update for 1D convolution resulting in performance improvements for speech networks such as Jasper and Tacotron2 and WaveGlow, in addition to support for grouped 1D conv (QuartzNet).
  • Added 3D convolutions support of NHWC and improved heuristics and kernels for Tensor Cores in NCHW resulting in performance improvements for VNet, UNet-Medical and UNet-Industrial. Additionally, FP16 3D convolutions are supported as well.
  • Better utilization of Tensor Cores and heuristics for grouped convolutions result in improvements for ResNext.
  • More tuning for vision networks like ResNet-50 ([MXNet] [PyTorch] [TensorFlow]) and SSD ([PyTorch] [TensorFlow]) with new updated heuristics.
Operation fusion

Operation fusion can be achieved via the backend API. The general workflow is similar to running unfused operations, except that instead of creating a single operation Operation Graph, the user may specify a multi-operation Operation Graph. For more information, see Operation Fusion Via The Backend API in the cuDNN Developer Guide.

Depthwise convolution extension

We’ve extended the fprop and dgrad NHWC depthwise kernels to support more combinations (filter sizes/strides) such as 5x5/1x1, 5x5/2x2, 7x7/1x1, 7x7/2x2 (in addition to what we already have, 1x1/1x1, 3x3/1x1, 3x3/2x2), which provides good performance.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.0.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.0 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • Some data types are not widely supported by all cuDNN API. For example, CUDNN_DATA_INT8x4 is not supported by many functions. In such cases, support is available by using cudnnTransformTensor() to transform the tensors from the desired type to a type supported by the API. For example, a user is able to transform input tensors from CUDNN_DATA_INT8x4 to CUDNN_DATA_INT8, run the desired API and then transform output tensors from CUDNN_DATA_INT8 to CUDNN_DATA_INT8x4. Note that this transformation will incur an extra round trip to memory.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.0 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.0 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.0 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

Deprecated Features

The following features are deprecated in cuDNN 8.0.0:
  • Support for Ubuntu 14.04 has been deprecated in this release. Upgrade to 16.04 or 18.04 for continued support.

  • Support for Mac OS X has been deprecated in this release. Operating systems that are currently supported are Linux and Windows.

  • cuDNN version 8 introduces a new API deprecation policy to enable a faster pace of innovation. A streamlined, two-step, deprecation policy will be used for all API changes starting with cuDNN version 8. For details about this new deprecation policy, see Backward Compatibility And Deprecation Policy in the cuDNN Developer Guide.

  • Removed and deprecated API changes. For a list of removed and deprecated APIs, see API Changes For cuDNN 8.0.0.

Fixed Issues

The following issues have been fixed in this release:

  • There is a known issue in that cudnnDestroy() does not destroy all that cudnnCreate() created. Calling cudnnDestroy() after cudnnCreate() has a memory leak in some tests of about 1.6 MB on host memory. This issue has been fixed in cuDNN 8.0.0.

  • Starting in cuDNN 7.6.1, when using the experimental multi-head attention API, it is possible that the forward and backward paths produce different results for the BERT model, when the batch size is greater than one and/or the number of heads is greater than one. This issue has been fixed in cuDNN 8.0.0.

  • The description of cudnnSetCTCLossDescriptorEx() is not clear. This issue has been fixed in cuDNN 8.0.0.

  • Documentation affecting 1x1 convolution functions is not clear, for example cudnnFindConvolutionBackwardDataAlgorithm(). This issue has been fixed in cuDNN 8.0.0.

  • cuDNN forward convolution with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM does not propagate NANs in weights. This issue has been fixed in cuDNN 8.0.0.

  • Document mathematical definitions of all operations in cuDNN. We include full mathematical descriptions for the convolution functions.

  • The functions cudnnGetConvolutionForwardAlgorithm_v7() and cudnnGetConvolutionForwardWorkspaceSize() may return CUDNN_STATUS_SUCCESS while the execution of the same convolution returns CUDNN_STATUS_NOT_SUPPORTED. Similar issues may also happen for convolutionBackwardData() and convolutionBackwardFilter(). This issue is present in cuDNN 7.2.2 library and later versions. This has been fixed in cuDNN 8.0.0.

  • Algorithms returned by cudnnGetConvolution*Algorithm() may, in some limited use cases, fail to execute when they are actually run. This is a cuDNN library-wide issue and applies for convolution forward, convolution backward data, and convolution backward filter operations. This issue is also present in versions prior to cuDNN 8.0.0 EA.

  • cuDNN does not support CUDA graphs. When launching a CUDA graph constructed via a stream capture that includes a cudnnConvolutionForward() operation, you may see cudaErrorLaunchFailure error. This is because CUDA graphs were not supported. The user can proceed.

  • There was a known performance drop in 3D convolutions for some cases on Turing GPUs since cuDNN 7.4.2. This has been fixed on T4. (not applicable for Jetson platforms)

  • There are rare cases where cudnnConvolution* will return STATUS_NOT_SUPPORTED when cudnn*GetWorkspaceSize might return success for a given algorithm. This has been fixed in cuDNN 8.0.0.

  • In previous versions of cuDNN, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM did not propagate NaN values in some cases. This is fixed in the current release. Users desiring the old behavior can configure ReLU activation and set the floor to be -Inf.

  • The multiHeadAttention sample code was added to the cuDNN 7.6.3 release. The sample code includes a simple NumPy/Autograd reference model of the multi-head attention block that computes the forward response and all derivatives. The test code demonstrates how to use the multi-head attention API, access attention weights, and sequence data.

  • Updated: July 22, 2020

    In version 7.6.x, cudnnConvolutionBackwardData() with PSEUDO_HALF_CONFIG with CUDNN_TENSOR_OP_MATH or FLOAT_CONFIG with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION returns incorrect results in 3D convolution when the filter size of the w dimension is 1 and padding of the w dimension is 0. This issue has been fixed in this release.

Known Issues

  • Performance regressions on V100 are observed in this release on SSD inference use cases if not using TensorRT.

  • There are significant performance regressions on pre-Volta GPUs and some Turing GPUs based on the TU102 architecture. This performance regression is not applicable to T4, JetPack, and Tegra.

  • Sub-optimal performance is present in this release for all INT8 convolutions for all GPUs.

  • The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur.

  • There are some peer-to-peer documentation links that are broken within the cuDNN API Reference. These links will be fixed in the next release.

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() are missing in this release and will be added in the GA release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() functions will be implemented in the next release.

  • cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures. This will be fixed in the next release.

  • cuDNN 8.0.0 Preview build with Windows and CUDA 11.0 RC has reduced performance on 2D, 3D, and grouped convolutions compared to Linux.

  • Updated: June 12, 2020

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019
    When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

  • Updated: July 22, 2020

    cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter() erroneously returns CUDNN_STATUS_INTERNAL_ERROR when the workspace size argument value is less than the required workspace size as returned by their respective cudnnGetWorkspace() API.

  • Updated: August 24, 2020

    cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152.