cuDNN Release 7.6.0

This is the cuDNN 7.6.0 release notes. This release includes fixes from the previous cuDNN v7.x.x releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • A new API is introduced for fused ops, which can accelerate many use cases in ResNet-like networks. With this new API it is now possible to execute various fused operations such as apply per channel scale and bias, perform activation, compute convolution, and generate batchnorm statistics. Below is a list of supported datatype and functions in this API:
    Datatypes:
    • cudnnFusedOpsVariantParamPack_t
    • cudnnFusedOpsConstParamPack_t
    • cudnnFusedOpsPlan_t
    • cudnnFusedOps_t
    • cudnnFusedOpsConstParamLabel_t
    • cudnnFusedOpsPointerPlaceHolder_t
    • cudnnFusedOpsVariantParamLabel_t
    Functions:
    • cudnnCreateFusedOpsConstParamPack
    • cudnnDestroyFusedOpsConstParamPack
    • cudnnSetFusedOpsConstParamPackAttribute
    • cudnnGetFusedOpsConstParamPackAttribute
    • cudnnCreateFusedOpsVariantParamPack
    • cudnnDestroyFusedOpsVariantParamPack
    • cudnnSetFusedOpsVariantParamPackAttribute
    • cudnnGetFusedOpsVariantParamPackAttribute
    • cudnnCreateFusedOpsPlan
    • cudnnDestroyFusedOpsPlan
    • cudnnMakeFusedOpsPlan
    • cudnnFusedOpsExecute
  • Improved the performance of grouped convolution layers in ResNeXt-50, for cudnnConvolutionBackwardData() in the configuration below:
    • On NVIDIA Volta (compute capability 7.0)
    • Algorithm is CUDNN_CONVOLUTION_BWD_DATA_ALGO_1
    • Stride of 1
    • Math type is CUDNN_TENSOR_OP_MATH or CUDNN_TENSOROP_MATH_ALLOW_CONVERSION
    • Tensor format for filter is NHWC
    • Input and outputs are in FP16 and computation is in FP32
  • A new API is introduced to enhance the inference time. With this new API it is now possible to separate the filter layout transformation that was applied on every call, which in turn leads to inference time enhancement. Below is a list of supported datatype and functions in this API.
    • cudnnReorderType_t
    • cudnnReorderFilterAndBias
    • cudnnSetConvolutionReorderType
    • cudnnGetConvolutionReorderType
  • Performance is enhanced (by selecting a faster kernel) on NVIDIA T4 cards for INT8x4 and INT8x32.

Fixed Issues

The following issues have been fixed in this release:

  • In cuDNN 7.5.0 and cuDNN 7.5.1, a bug in the cudnnRNNBackwardData() function affected the thread synchronization. This effect is limited to only the first iteration of the loop, and only in some paths. This occurs when using the function with the CUDNN_RNN_ALGO_PERSIST_STATIC method. This is fixed in cuDNN 7.6.0.

Known Issues

The following issues and limitations exist in this release:

  • The cudnnConvolutionBackwardData() function for CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 fails with CUDNN_STATUS_NOT_SUPPORTED when the input size is large.
  • A general known issue for cuDNN library: the Tensor pointers and the filter pointers require at a minimum 4-byte alignment, including for FP16 or INT8 data.
  • On RHEL7 only, the /usr/src/cudnn_samples_v7/samples_common.mk file is missing. This will prevent compiling the cuDNN samples. The workaround is to copy the below contents into “samples_common.mk” text file and place this file in the “/usr/src/cudnn_samples_v7/” directory, so that the /usr/src/cudnn_samples_v7/samples_common.mk file exists.
    # Setting SMS for all samples
    # architecture
    
    ifneq ($(TARGET_ARCH), ppc64le)
    CUDA_VERSION := $(shell cat $(CUDA_PATH)/include/cuda.h |grep "define CUDA_VERSION" |awk '{print $$3}')
    else
    CUDA_VERSION := $(shell cat $(CUDA_PATH)/targets/ppc64le-linux/include/cuda.h |grep "define CUDA_VERSION" |awk '{print $$3}')
    endif
    
    #Link against cublasLt for CUDA 10.1 and up.
    CUBLASLT:=false
    ifeq ($(shell test $(CUDA_VERSION) -ge 10010; echo $$?),0)
    CUBLASLT:=true
    endif
    $(info Linking agains cublasLt = $(CUBLASLT))
    
    ifeq ($(CUDA_VERSION),8000 )
    SMS_VOLTA =
    else
    ifneq ($(TARGET_ARCH), ppc64le)
    ifeq ($(CUDA_VERSION), $(filter $(CUDA_VERSION), 9000 9010 9020))
    SMS_VOLTA ?= 70
    else
    ifeq ($(TARGET_OS), darwin)
    SMS_VOLTA ?= 70
    else
    SMS_VOLTA ?= 70 72 75
    endif #ifneq ($(TARGET_OS), darwin)
    endif #ifeq ($(CUDA_VERSION), $(filter $(CUDA_VERSION), 9000 9010 9020))
    else
    SMS_VOLTA ?= 70
    endif #ifneq ($(TARGET_ARCH), ppc64le)
    endif #ifeq ($(CUDA_VERSION),8000 )
    SMS ?= 30 35 50 53 60 61 62 $(SMS_VOLTA)