The NVIDIA® CUDA® Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

Documentation Center
These documents provide information regarding the current NVIDIA CUDA release.
CUDA Math Libraries
The cuBLAS library is an implementation of Basic Linear Algebra Subprograms (BLAS) on the NVIDIA CUDA runtime. It enables the user to access the computational resources of NVIDIA GPUs.
The NVIDIA CUDA Fast Fourier Transform (cuFFT) library consists of two components: cuFFT and cuFFTW. The cuFFT library provides high performance on NVIDIA GPUs, and the cuFFTW library is a porting tool to use the Fastest Fourier Transform in the West (FFTW) on NVIDIA GPUs.
The cuFFT Device Extensions (cuFFTDx) library enables you to perform Fast Fourier Transform (FFT) calculations inside your CUDA kernel. Fusing FFT with other operations can decrease the latency and improve the performance of your application.
The NVIDIA CUDA Random Number Generation (cuRAND) library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers.
The cuSOLVER library is a high-level package based on cuBLAS and cuSPARSE libraries. It provides Linear Algebra Package (LAPACK)-like features such as common matrix factorization and triangular solve routines for dense matrices.
The cuSPARSE library contains a set of basic linear algebra subroutines used for handling sparse matrices. It’s implemented on the NVIDIA CUDA runtime and is designed to be called from C and C++.
The cuTENSOR library is a first-of-its-kind, GPU-accelerated tensor linear algebra library, providing high-performance tensor contraction, reduction, and element-wise operations. cuTENSOR is used to accelerate applications in the areas of deep learning training and inference, computer vision, quantum chemistry, and computational physics.
NVIDIA Performance Primitives (NPP) is a library of functions for performing CUDA-accelerated 2D image and signal processing. This library is widely applicable for developers in these areas and is written to maximize flexibility while maintaining high performance.
The nvJPEG Library provides high-performance, GPU-accelerated JPEG encoding and decoding functionality. This library is intended for image formats commonly used in deep learning and hyperscale multimedia applications.
The nvJPEG2000 library provides high-performance, GPU-accelerated JPEG2000 decoding functionality. This library is intended for JPEG2000 formatted images commonly used in deep learning, medical imaging, remote sensing, and digital cinema applications.
The nvTIFF library accelerates the decoding and encoding of TIFF images compressed with LZW on NVIDIA GPUs. The library is built on the CUDA ® platform and is supported on Volta+ GPU architectures.
NVIDIA cuSOLVERMp is a high-performance, distributed-memory, GPU-accelerated library that provides tools for solving dense linear systems and eigenvalue problems.
NVIDIA GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.

These archives provide access to previously released CUDA documentation versions.

Find archived online documentation for CUDA Toolkit.