cuDNN Overview
cuDNN's convolution routines aim for performance that is competitive with the fastest GEMM (matrix multiply)-based implementations of such routines while using significantly less memory.
cuDNN features customizable data layouts supporting flexible dimension ordering, striding, and subregions for the 4D tensors used as inputs and outputs in all of its routines. This flexibility allows easy integration into any neural network implementation, and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.
cuDNN offers a context-based API that allows for easy multi-threading and (optional) interoperability with CUDA streams.