NVIDIA CUDA

The NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.


Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.
The cuBLAS library is an implementation of Basic Linear Algebra Subprograms (BLAS) on the NVIDIA CUDA runtime. It enables the user to access the computational resources of NVIDIA GPUs.
The NVIDIA CUDA Fast Fourier Transform (cuFFT) library consists of two components: cuFFT and cuFFTW. The cuFFT library provides high performance on NVIDIA GPUs, and the cuFFTW library is a porting tool to use the Fastest Fourier Transform in the West (FFTW) on NVIDIA GPUs.
The cuFFT Device Extensions (cuFFTDx) library enables you to perform Fast Fourier Transform (FFT) calculations inside your CUDA kernel. Fusing FFT with other operations can decrease the latency and improve the performance of your application.
The NVIDIA CUDA Random Number Generation (cuRAND) library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers.
The cuSPARSE library contains a set of basic linear algebra subroutines used for handling sparse matrices. It’s implemented on the NVIDIA CUDA runtime and is designed to be called from C and C++.
The nvCOMP library provides fast lossless data compression and decompression using a GPU. It features generic compression interfaces to enable developers to use high-performance GPU compressors in their applications.
The cuTENSOR library is a first-of-its-kind, GPU-accelerated tensor linear algebra library, providing high-performance tensor contraction, reduction, and element-wise operations. cuTENSOR is used to accelerate applications in the areas of deep learning training and inference, computer vision, quantum chemistry, and computational physics.
NVIDIA Performance Primitives (NPP) is a library of functions for performing CUDA-accelerated 2D image and signal processing. This library is widely applicable for developers in these areas and is written to maximize flexibility while maintaining high performance.
The nvJPEG Library provides high-performance, GPU-accelerated JPEG encoding and decoding functionality. This library is intended for image formats commonly used in deep learning and hyperscale multimedia applications.
The nvJPEG2000 library provides high-performance, GPU-accelerated JPEG2000 decoding functionality. This library is intended for JPEG2000 formatted images commonly used in deep learning, medical imaging, remote sensing, and digital cinema applications.
The nvTIFF library accelerates the decoding and encoding of TIFF images compressed with LZW on NVIDIA GPUs. The library is built on the CUDA ® platform and is supported on Volta+ GPU architectures.
The cuSOLVER library is a high-level package based on cuBLAS and cuSPARSE libraries. It provides Linear Algebra Package (LAPACK)-like features such as common matrix factorization and triangular solve routines for dense matrices.
The cuPQC library enables you to execute Post-Quantum Cryptography (PQC) algorithms directly within your CUDA kernels. Fusing PQC operations with other calculations can reduce the latecy and improve the performance of your application.
The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.
NVIDIA cuBLASMp is a high-performance, multi-process, GPU-accelerated library for distributed basic dense linear algebra.
NVIDIA cuDSS (Preview) is a library of GPU-accelerated linear solvers with sparse matrices. It provides algorithms for solving linear systems of the following type: AX=B with a sparse matrix A, right-hand side B and unknown solution X (could be a matrix or a vector). The cuDSS functionality allows flexibility in matrix properties and solver configuration, as well as execution parameters like CUDA streams.
cuEquivariance is a Python library designed to facilitate the construction of high-performance equivariant neural networks using segmented tensor products. cuEquivariance provides a comprehensive API for describing segmented tensor products and optimized CUDA kernels for their execution. Additionally, cuEquivariance offers bindings for both PyTorch and JAX, ensuring broad compatibility and ease of integration.
NVIDIA cuQuantum SDK is a high-performance library for quantum information science and beyond.
The nvImageCodec is a library of accelerated codecs with unified interface. It is designed as a framework for extension modules which delivers codec plugins.
nvmath-python is a Python library to enable cutting edge performance, productivity, and interoperability within the Python computational ecosystem through NVIDIA’s high-performance math libraries.
NVIDIA cuSOLVERMp is a high-performance, distributed-memory, GPU-accelerated library that provides tools for solving dense linear systems and eigenvalue problems.
The cuSPARSELt library provides high-performance, structured, matrix-dense matrix multiplication functionality. cuSPARSELt allows users to exploit the computational resources of the latest NVIDIA GPUs.
NVIDIA GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.
Find archived online documentation for CUDA Toolkit. These archives provide access to previously released CUDA documentation versions.