NVIDIA cuDNN is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines applied frequently in DNN applications:
cuDNN convolution routines aim for performance that is competitive with the fastest GEMM (matrix multiply)-based implementations of such routines while using significantly less memory.
cuDNN features customizable data layouts supporting flexible dimension ordering, striding, and subregions for the 4D tensors used as inputs and outputs in all of its routines. This flexibility allows easy integration into any neural network implementation and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.
cuDNN offers a context-based API that allows for easy multi-threading and (optional) interoperability with CUDA streams.