This guide provides a detailed discussion of the CUDA programming model and programming interface. It then describes the hardware
implementation, and provides guidance on how to achieve maximum performance. The Appendixes include a list of all CUDA-enabled
devices, detailed description of all extensions to the C language, listings of supported mathematical functions, C++ features
supported in host and device code, details on texture fetching, technical specifications of various devices, and concludes
by introducing the low-level driver API.
This guide presents established
parallelization and optimization techniques and explains coding
metaphors and idioms that can greatly simplify programming for
CUDA-capable GPU architectures. The intent is to provide guidelines for
obtaining the best performance from NVIDIA GPUs using the CUDA
This application note is intended to help developers ensure that their NVIDIA CUDA applications will run effectively on GPUs
based on the NVIDIA Kepler Architecture. This document provides guidance to ensure that your software applications are compatible
Kepler is NVIDIA's next-generation architecture for CUDA compute applications. Applications that follow the best practices
for the Fermi architecture should typically see speedups on the Kepler architecture without any code changes. This guide summarizes
the ways that an application can be fine-tuned to gain additional speedups by leveraging Kepler architectural features.
This guide provides detailed instructions on the
use of PTX, a low-level parallel thread execution virtual machine and
instruction set architecture (ISA). PTX exposes the GPU as a
data-parallel computing device.
This document shows how to inline PTX (parallel
thread execution) assembly language statements into CUDA code. It
describes available assembler statement parameters and constraints, and
the document also provides a list of some pitfalls that you may
The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows
the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across
NVIDIA NPP is a library of functions for performing CUDA accelerated
processing. The initial set of functionality in the library focuses on
imaging and video processing and is widely applicable for developers in
these areas. NPP will evolve over time to encompass more of the compute
heavy tasks in a variety of problem domains. The NPP library is written
to maximize flexibility, while maintaining high performance.
This document contains a complete listing of the code samples that are
included with the NVIDIA CUDA Toolkit. It describes each code sample,
lists the minimum GPU specification, and provides links to the source
code and white papers if available.
This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation
driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for
defining macros and include/library paths, and for steering the compilation process.
The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging
CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.
CUDA-MEMCHECK is a suite of run time tools capable of precisely detecting
out of bounds and misaligned memory access errors, checking device
allocation leaks, reporting hardware errors and identifying shared memory data
A number of issues related to floating point accuracy and compliance are
a frequent source of confusion on both CPUs and GPUs. The purpose of this
white paper is to discuss the most common issues related to NVIDIA GPUs
and to supplement the documentation in the CUDA C Programming Guide.
In this white paper we show how to use the
CUSPARSE and CUBLAS libraries to achieve a 2x speedup over CPU in the
incomplete-LU and Cholesky preconditioned iterative methods. We focus on
the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative
methods, that can be used to solve large sparse nonsymmetric and
symmetric positive definite linear systems, respectively. Also, we
comment on the parallel sparse triangular solve, which is an essential
building block in these algorithms.
NVVM IR is a compiler IR (internal
representation) based on the LLVM IR. The NVVM IR is designed to
represent GPU compute kernels (for example, CUDA kernels). High-level
language front-ends, like the CUDA C compiler front-end, can generate
A tool for Kepler-class GPUs and CUDA 5.0
enabling a direct path for communication between the GPU and a peer
device on the PCI Express bus when the devices share the same upstream
root complex using standard features of PCI Express. This document
introduces the technology and describes the steps necessary to enable a
RDMA for GPUDirect connection to NVIDIA GPUs within the Linux device