Contents
- 1. Introduction
 - 2. Programming Model
 - 
3. Programming Interface
- 3.1. Compilation with NVCC
 - 
3.2. CUDA Runtime
- 3.2.1. Initialization
 - 3.2.2. Device Memory
 - 
3.2.3. Device Memory L2 Access Management
- 3.2.3.1. L2 Cache Set-Aside for Persisting Accesses
 - 3.2.3.2. L2 Policy for Persisting Accesses
 - 3.2.3.3. L2 Access Properties
 - 3.2.3.4. L2 Persistence Example
 - 3.2.3.5. Reset L2 Access to Normal
 - 3.2.3.6. Manage Utilization of L2 set-aside cache
 - 3.2.3.7. Query L2 cache Properties
 - 3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access
 
 - 3.2.4. Shared Memory
 - 3.2.5. Distributed Shared Memory
 - 3.2.6. Page-Locked Host Memory
 - 3.2.7. Memory Synchronization Domains
 - 
3.2.8. Asynchronous Concurrent Execution
- 3.2.8.1. Concurrent Execution between Host and Device
 - 3.2.8.2. Concurrent Kernel Execution
 - 3.2.8.3. Overlap of Data Transfer and Kernel Execution
 - 3.2.8.4. Concurrent Data Transfers
 - 3.2.8.5. Streams
 - 3.2.8.6. Programmatic Dependent Launch and Synchronization
 - 
3.2.8.7. CUDA Graphs
- 3.2.8.7.1. Graph Structure
 - 3.2.8.7.2. Creating a Graph Using Graph APIs
 - 3.2.8.7.3. Creating a Graph Using Stream Capture
 - 3.2.8.7.4. CUDA User Objects
 - 3.2.8.7.5. Updating Instantiated Graphs
 - 3.2.8.7.6. Using Graph APIs
 - 3.2.8.7.7. Device Graph Launch
 - 3.2.8.7.8. Conditional Graph Nodes
 
 - 3.2.8.8. Events
 - 3.2.8.9. Synchronous Calls
 
 - 3.2.9. Multi-Device System
 - 3.2.10. Unified Virtual Address Space
 - 3.2.11. Interprocess Communication
 - 3.2.12. Error Checking
 - 3.2.13. Call Stack
 - 3.2.14. Texture and Surface Memory
 - 3.2.15. Graphics Interoperability
 - 
3.2.16. External Resource Interoperability
- 
3.2.16.1. Vulkan Interoperability
- 3.2.16.1.1. Matching device UUIDs
 - 3.2.16.1.2. Importing Memory Objects
 - 3.2.16.1.3. Mapping Buffers onto Imported Memory Objects
 - 3.2.16.1.4. Mapping Mipmapped Arrays onto Imported Memory Objects
 - 3.2.16.1.5. Importing Synchronization Objects
 - 3.2.16.1.6. Signaling/Waiting on Imported Synchronization Objects
 
 - 3.2.16.2. OpenGL Interoperability
 - 
3.2.16.3. Direct3D 12 Interoperability
- 3.2.16.3.1. Matching Device LUIDs
 - 3.2.16.3.2. Importing Memory Objects
 - 3.2.16.3.3. Mapping Buffers onto Imported Memory Objects
 - 3.2.16.3.4. Mapping Mipmapped Arrays onto Imported Memory Objects
 - 3.2.16.3.5. Importing Synchronization Objects
 - 3.2.16.3.6. Signaling/Waiting on Imported Synchronization Objects
 
 - 
3.2.16.4. Direct3D 11 Interoperability
- 3.2.16.4.1. Matching Device LUIDs
 - 3.2.16.4.2. Importing Memory Objects
 - 3.2.16.4.3. Mapping Buffers onto Imported Memory Objects
 - 3.2.16.4.4. Mapping Mipmapped Arrays onto Imported Memory Objects
 - 3.2.16.4.5. Importing Synchronization Objects
 - 3.2.16.4.6. Signaling/Waiting on Imported Synchronization Objects
 
 - 3.2.16.5. NVIDIA Software Communication Interface Interoperability (NVSCI)
 
 - 
3.2.16.1. Vulkan Interoperability
 
 - 3.3. Versioning and Compatibility
 - 3.4. Compute Modes
 - 3.5. Mode Switches
 - 3.6. Tesla Compute Cluster Mode for Windows
 
 - 4. Hardware Implementation
 - 5. Performance Guidelines
 - 6. CUDA-Enabled GPUs
 - 
7. C++ Language Extensions
- 7.1. Function Execution Space Specifiers
 - 7.2. Variable Memory Space Specifiers
 - 7.3. Built-in Vector Types
 - 7.4. Built-in Variables
 - 7.5. Memory Fence Functions
 - 7.6. Synchronization Functions
 - 7.7. Mathematical Functions
 - 
7.8. Texture Functions
- 
7.8.1. Texture Object API
- 7.8.1.1. tex1Dfetch()
 - 7.8.1.2. tex1D()
 - 7.8.1.3. tex1DLod()
 - 7.8.1.4. tex1DGrad()
 - 7.8.1.5. tex2D()
 - 7.8.1.6. tex2D() for sparse CUDA arrays
 - 7.8.1.7. tex2Dgather()
 - 7.8.1.8. tex2Dgather() for sparse CUDA arrays
 - 7.8.1.9. tex2DGrad()
 - 7.8.1.10. tex2DGrad() for sparse CUDA arrays
 - 7.8.1.11. tex2DLod()
 - 7.8.1.12. tex2DLod() for sparse CUDA arrays
 - 7.8.1.13. tex3D()
 - 7.8.1.14. tex3D() for sparse CUDA arrays
 - 7.8.1.15. tex3DLod()
 - 7.8.1.16. tex3DLod() for sparse CUDA arrays
 - 7.8.1.17. tex3DGrad()
 - 7.8.1.18. tex3DGrad() for sparse CUDA arrays
 - 7.8.1.19. tex1DLayered()
 - 7.8.1.20. tex1DLayeredLod()
 - 7.8.1.21. tex1DLayeredGrad()
 - 7.8.1.22. tex2DLayered()
 - 7.8.1.23. tex2DLayered() for Sparse CUDA Arrays
 - 7.8.1.24. tex2DLayeredLod()
 - 7.8.1.25. tex2DLayeredLod() for sparse CUDA arrays
 - 7.8.1.26. tex2DLayeredGrad()
 - 7.8.1.27. tex2DLayeredGrad() for sparse CUDA arrays
 - 7.8.1.28. texCubemap()
 - 7.8.1.29. texCubemapGrad()
 - 7.8.1.30. texCubemapLod()
 - 7.8.1.31. texCubemapLayered()
 - 7.8.1.32. texCubemapLayeredGrad()
 - 7.8.1.33. texCubemapLayeredLod()
 
 
 - 
7.8.1. Texture Object API
 - 
7.9. Surface Functions
- 
7.9.1. Surface Object API
- 7.9.1.1. surf1Dread()
 - 7.9.1.2. surf1Dwrite
 - 7.9.1.3. surf2Dread()
 - 7.9.1.4. surf2Dwrite()
 - 7.9.1.5. surf3Dread()
 - 7.9.1.6. surf3Dwrite()
 - 7.9.1.7. surf1DLayeredread()
 - 7.9.1.8. surf1DLayeredwrite()
 - 7.9.1.9. surf2DLayeredread()
 - 7.9.1.10. surf2DLayeredwrite()
 - 7.9.1.11. surfCubemapread()
 - 7.9.1.12. surfCubemapwrite()
 - 7.9.1.13. surfCubemapLayeredread()
 - 7.9.1.14. surfCubemapLayeredwrite()
 
 
 - 
7.9.1. Surface Object API
 - 7.10. Read-Only Data Cache Load Function
 - 7.11. Load Functions Using Cache Hints
 - 7.12. Store Functions Using Cache Hints
 - 7.13. Time Function
 - 7.14. Atomic Functions
 - 7.15. Address Space Predicate Functions
 - 7.16. Address Space Conversion Functions
 - 7.17. Alloca Function
 - 7.18. Compiler Optimization Hint Functions
 - 7.19. Warp Vote Functions
 - 7.20. Warp Match Functions
 - 7.21. Warp Reduce Functions
 - 7.22. Warp Shuffle Functions
 - 7.23. Nanosleep Function
 - 7.24. Warp Matrix Functions
 - 7.25. DPX
 - 
7.26. Asynchronous Barrier
- 7.26.1. Simple Synchronization Pattern
 - 7.26.2. Temporal Splitting and Five Stages of Synchronization
 - 7.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
 - 7.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
 - 7.26.5. Spatial Partitioning (also known as Warp Specialization)
 - 7.26.6. Early Exit (Dropping out of Participation)
 - 7.26.7. Completion Function
 - 7.26.8. Memory Barrier Primitives Interface
 
 - 7.27. Asynchronous Data Copies
 - 
7.28. Asynchronous Data Copies using 
cuda::pipeline - 7.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
 - 7.30. Encoding a Tensor Map on Device
 - 7.31. Profiler Counter Function
 - 7.32. Assertion
 - 7.33. Trap function
 - 7.34. Breakpoint Function
 - 7.35. Formatted Output
 - 7.36. Dynamic Global Memory Allocation and Operations
 - 7.37. Execution Configuration
 - 7.38. Launch Bounds
 - 7.39. Maximum Number of Registers per Thread
 - 7.40. #pragma unroll
 - 7.41. SIMD Video Instructions
 - 7.42. Diagnostic Pragmas
 
 - 
8. Cooperative Groups
- 8.1. Introduction
 - 8.2. What’s New in Cooperative Groups
 - 8.3. Programming Model Concept
 - 8.4. Group Types
 - 8.5. Group Partitioning
 - 8.6. Group Collectives
 - 8.7. Grid Synchronization
 - 8.8. Multi-Device Synchronization
 
 - 
9. CUDA Dynamic Parallelism
- 9.1. Introduction
 - 9.2. Execution Environment and Memory Model
 - 
9.3. Programming Interface
- 9.3.1. CUDA C++ Reference
 - 9.3.2. Device-side Launch from PTX
 - 9.3.3. Toolkit Support for Dynamic Parallelism
 
 - 9.4. Programming Guidelines
 - 9.5. CDP2 vs CDP1
 - 
9.6. Legacy CUDA Dynamic Parallelism (CDP1)
- 9.6.1. Execution Environment and Memory Model (CDP1)
 - 
9.6.2. Programming Interface (CDP1)
- 9.6.2.1. CUDA C++ Reference (CDP1)
 - 9.6.2.2. Device-side Launch from PTX (CDP1)
 - 9.6.2.3. Toolkit Support for Dynamic Parallelism (CDP1)
 
 - 9.6.3. Programming Guidelines (CDP1)
 
 
 - 10. Virtual Memory Management
 - 
11. Stream Ordered Memory Allocator
- 11.1. Introduction
 - 11.2. Query for Support
 - 11.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
 - 11.4. Memory Pools and the cudaMemPool_t
 - 11.5. Default/Implicit Pools
 - 11.6. Explicit Pools
 - 11.7. Physical Page Caching Behavior
 - 11.8. Resource Usage Statistics
 - 11.9. Memory Reuse Policies
 - 11.10. Device Accessibility for Multi-GPU Support
 - 11.11. IPC Memory Pools
 - 11.12. Synchronization API Actions
 - 11.13. Addendums
 
 - 12. Graph Memory Nodes
 - 13. Mathematical Functions
 - 
14. C++ Language Support
- 14.1. C++11 Language Features
 - 14.2. C++14 Language Features
 - 14.3. C++17 Language Features
 - 14.4. C++20 Language Features
 - 
14.5. Restrictions
- 14.5.1. Host Compiler Extensions
 - 14.5.2. Preprocessor Symbols
 - 14.5.3. Qualifiers
 - 14.5.4. Pointers
 - 14.5.5. Operators
 - 14.5.6. Run Time Type Information (RTTI)
 - 14.5.7. Exception Handling
 - 14.5.8. Standard Library
 - 14.5.9. Namespace Reservations
 - 
14.5.10. Functions
- 14.5.10.1. External Linkage
 - 14.5.10.2. Implicitly-declared and explicitly-defaulted functions
 - 14.5.10.3. Function Parameters
 - 14.5.10.4. Static Variables within Function
 - 14.5.10.5. Function Pointers
 - 14.5.10.6. Function Recursion
 - 14.5.10.7. Friend Functions
 - 14.5.10.8. Operator Function
 - 14.5.10.9. Allocation and Deallocation Functions
 
 - 14.5.11. Classes
 - 14.5.12. Templates
 - 14.5.13. Trigraphs and Digraphs
 - 14.5.14. Const-qualified variables
 - 14.5.15. Long Double
 - 14.5.16. Deprecation Annotation
 - 14.5.17. Noreturn Annotation
 - 14.5.18. [[likely]] / [[unlikely]] Standard Attributes
 - 14.5.19. const and pure GNU Attributes
 - 14.5.20. __nv_pure__ Attribute
 - 14.5.21. Intel Host Compiler Specific
 - 
14.5.22. C++11 Features
- 14.5.22.1. Lambda Expressions
 - 14.5.22.2. std::initializer_list
 - 14.5.22.3. Rvalue references
 - 14.5.22.4. Constexpr functions and function templates
 - 14.5.22.5. Constexpr variables
 - 14.5.22.6. Inline namespaces
 - 14.5.22.7. thread_local
 - 14.5.22.8. __global__ functions and function templates
 - 14.5.22.9. __managed__ and __shared__ variables
 - 14.5.22.10. Defaulted functions
 
 - 14.5.23. C++14 Features
 - 14.5.24. C++17 Features
 - 14.5.25. C++20 Features
 
 - 14.6. Polymorphic Function Wrappers
 - 14.7. Extended Lambdas
 - 14.8. Code Samples
 
 - 15. Texture Fetching
 - 16. Compute Capabilities
 - 
17. Driver API
- 17.1. Context
 - 17.2. Module
 - 17.3. Kernel Execution
 - 17.4. Interoperability between Runtime and Driver APIs
 - 
17.5. Driver Entry Point Access
- 17.5.1. Introduction
 - 17.5.2. Driver Function Typedefs
 - 17.5.3. Driver Function Retrieval
 - 
17.5.4. Potential Implications with cuGetProcAddress
- 17.5.4.1. Implications with cuGetProcAddress vs Implicit Linking
 - 17.5.4.2. Compile Time vs Runtime Version Usage in cuGetProcAddress
 - 17.5.4.3. API Version Bumps with Explicit Version Checks
 - 17.5.4.4. Issues with Runtime API Usage
 - 17.5.4.5. Issues with Runtime API and Dynamic Versioning
 - 17.5.4.6. Issues with Runtime API allowing CUDA Version
 - 17.5.4.7. Implications to API/ABI
 
 - 17.5.5. Determining cuGetProcAddress Failure Reasons
 
 
 - 18. CUDA Environment Variables
 - 
19. Unified Memory Programming
- 
19.1. Unified Memory Introduction
- 19.1.1. System Requirements for Unified Memory
 - 
19.1.2. Programming Model
- 19.1.2.1. Allocation APIs for System-Allocated Memory
 - 19.1.2.2. Allocation API for CUDA Managed Memory: 
cudaMallocManaged() - 19.1.2.3. Global-Scope Managed Variables Using 
__managed__ - 19.1.2.4. Difference between Unified Memory and Mapped Memory
 - 19.1.2.5. Pointer Attributes
 - 19.1.2.6. Runtime detection of Unified Memory Support Level
 - 19.1.2.7. GPU Memory Oversubscription
 - 19.1.2.8. Performance Hints
 
 
 - 
19.2. Unified memory on devices with full CUDA Unified Memory support
- 19.2.1. System-Allocated Memory: in-depth examples
 - 19.2.2. Performance Tuning
 
 - 
19.3. Unified memory on devices without full CUDA Unified Memory support
- 19.3.1. Unified memory on devices with only CUDA Managed Memory support
 - 
19.3.2. Unified memory on Windows or devices with compute capability 5.x
- 19.3.2.1. Data Migration and Coherency
 - 19.3.2.2. GPU Memory Oversubscription
 - 19.3.2.3. Multi-GPU
 - 
19.3.2.4. Coherency and Concurrency
- 19.3.2.4.1. GPU Exclusive Access To Managed Memory
 - 19.3.2.4.2. Explicit Synchronization and Logical GPU Activity
 - 19.3.2.4.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
 - 19.3.2.4.4. Stream Association Examples
 - 19.3.2.4.5. Stream Attach With Multithreaded Host Programs
 - 19.3.2.4.6. Advanced Topic: Modular Programs and Data Access Constraints
 - 19.3.2.4.7. Memcpy()/Memset() Behavior With Stream-associated Unified Memory
 
 
 
 
 - 
19.1. Unified Memory Introduction
 - 20. Lazy Loading
 - 21. Notices