Contents
- 1. Introduction
- 2. Programming Guide
- 2.1. CUDA Fortran Host and Device Code
- 2.2. CUDA Fortran Kernels
- 2.3. Thread Blocks
- 2.4. Memory Hierarchy
- 2.5. Subroutine / Function Qualifiers
- 2.6. Variable Qualifiers
- 2.7. Datatypes in Device Subprograms
- 2.8. Predefined Variables in Device Subprograms
- 2.9. Execution Configuration
- 2.10. Asynchronous Concurrent Execution
- 2.11. Kernel Loop Directive
- 2.12. Using Fortran Modules
- 2.13. CUDA Fortran Conditional Compilation
- 2.14. Building a CUDA Fortran Program
- 2.15. Managed and Unified Memory Options and Interoperability
- 3. Reference
- 3.1. New Subroutine and Function Attributes
- 3.2. Variable Attributes
- 3.3. Allocating Device Memory, Pinned Memory, and Managed Memory
- 3.3.1. Allocating Device Memory
- 3.3.2. Allocating Device Memory Using Runtime Routines
- 3.3.3. Allocate Pinned Memory
- 3.3.4. Allocating Managed Memory
- 3.3.5. Allocating Managed Memory Using Runtime Routines
- 3.3.6. Allocating Device Memory Asynchronously
- 3.3.7. Allocating Device Memory Asynchronously Using Runtime Routines
- 3.3.8. Controlling Device Data is Managed
- 3.4. Data transfer between host and device memory
- 3.5. Invoking a kernel subroutine
- 3.6. Device code
- 3.6.1. Datatypes Allowed
- 3.6.2. Built-in Variables
- 3.6.3. Fortran Intrinsics
- 3.6.4. Synchronization Functions
- 3.6.5. Warp-Vote Operations
- 3.6.6. Load and Store Functions Using Cache Hints
- 3.6.7. Load and Store Functions Using Bulk TMA Operations
- 3.6.8. Atomic Functions
- 3.6.9. Fortran I/O
- 3.6.10. PRINT Example
- 3.6.11. Shuffle Functions
- 3.6.12. Restrictions
- 3.7. Host code
- 3.8. Fortran Device Modules
- 3.9. Fortran Host Modules
- 3.9.1. Overloaded Fortran Reduction Intrinsics in GPU_REDUCTIONS and CUDAFOR
- 3.9.2. Fortran Sorting Subroutines Module
- 3.9.3. Overloaded Fortran Reduction Intrinsics in CUTENSOREX
- 3.9.4. Overloaded Fortran Array Intrinsics in CUTENSOREX
- 3.9.4.1. Fortran MERGE Intrinsic Function
- 3.9.4.2. Fortran PACK Intrinsic Function
- 3.9.4.3. Fortran PACKLOC Function
- 3.9.4.4. Fortran UNPACK Intrinsic Function
- 3.9.4.5. Fortran COUNT_PREFIX Intrinsic Function
- 3.9.4.6. Fortran SUM_PREFIX Intrinsic Function
- 3.9.4.7. Fortran RESHAPE Intrinsic Function
- 3.9.4.8. Fortran TRANSPOSE Intrinsic Function
- 3.9.4.9. Fortran SPREAD Intrinsic Function
- 3.9.4.10. Fortran MATMUL Intrinsic Function
- 3.9.4.11. Fortran DOT_PRODUCT Intrinsic Function
- 3.9.4.12. Fortran RANDOM_NUMBER Intrinsic Function
- 3.9.5. Other CUDA Library Host Modules
- 4. Runtime APIs
- 4.1. Initialization
- 4.2. Device Management
- 4.2.1. cudaChooseDevice
- 4.2.2. cudaDeviceGetAttribute
- 4.2.3. cudaDeviceGetCacheConfig
- 4.2.4. cudaDeviceGetLimit
- 4.2.5. cudaDeviceGetSharedMemConfig
- 4.2.6. cudaDeviceGetStreamPriorityRange
- 4.2.7. cudaDeviceReset
- 4.2.8. cudaDeviceSetCacheConfig
- 4.2.9. cudaDeviceSetLimit
- 4.2.10. cudaDeviceSetSharedMemConfig
- 4.2.11. cudaDeviceSynchronize
- 4.2.12. cudaGetDevice
- 4.2.13. cudaGetDeviceCount
- 4.2.14. cudaGetDeviceProperties
- 4.2.15. cudaSetDevice
- 4.2.16. cudaSetDeviceFlags
- 4.2.17. cudaSetValidDevices
- 4.3. Thread Management
- 4.4. Error Handling
- 4.5. Stream Management
- 4.5.1. cudaforGetDefaultStream
- 4.5.2. cudaforSetDefaultStream
- 4.5.3. cudaStreamAttachMemAsync
- 4.5.4. cudaStreamCreate
- 4.5.5. cudaStreamCreateWithFlags
- 4.5.6. cudaStreamCreateWithPriority
- 4.5.7. cudaStreamDestroy
- 4.5.8. cudaStreamGetPriority
- 4.5.9. cudaStreamQuery
- 4.5.10. cudaStreamSynchronize
- 4.5.11. cudaStreamWaitEvent
- 4.6. Event Management
- 4.7. Execution Control
- 4.8. Occupancy
- 4.9. Memory Management
- 4.9.1. cudaFree
- 4.9.2. cudaFreeArray
- 4.9.3. cudaFreeAsync
- 4.9.4. cudaFreeHost
- 4.9.5. cudaGetSymbolAddress
- 4.9.6. cudaGetSymbolSize
- 4.9.7. cudaHostAlloc
- 4.9.8. cudaHostGetDevicePointer
- 4.9.9. cudaHostGetFlags
- 4.9.10. cudaHostRegister
- 4.9.11. cudaHostUnregister
- 4.9.12. cudaMalloc
- 4.9.13. cudaMallocArray
- 4.9.14. cudaMallocAsync
- 4.9.15. cudaMallocManaged
- 4.9.16. cudaMallocPitch
- 4.9.17. cudaMalloc3D
- 4.9.18. cudaMalloc3DArray
- 4.9.19. cudaMemAdvise
- 4.9.20. cudaMemcpy
- 4.9.21. cudaMemcpyArrayToArray
- 4.9.22. cudaMemcpyAsync
- 4.9.23. cudaMemcpyFromArray
- 4.9.24. cudaMemcpyFromSymbol
- 4.9.25. cudaMemcpyFromSymbolAsync
- 4.9.26. cudaMemcpyPeer
- 4.9.27. cudaMemcpyPeerAsync
- 4.9.28. cudaMemcpyToArray
- 4.9.29. cudaMemcpyToSymbol
- 4.9.30. cudaMemcpyToSymbolAsync
- 4.9.31. cudaMemcpy2D
- 4.9.32. cudaMemcpy2DArrayToArray
- 4.9.33. cudaMemcpy2DAsync
- 4.9.34. cudaMemcpy2DFromArray
- 4.9.35. cudaMemcpy2DToArray
- 4.9.36. cudaMemcpy3D
- 4.9.37. cudaMemcpy3DAsync
- 4.9.38. cudaMemGetInfo
- 4.9.39. cudaMemPrefetchAsync
- 4.9.40. cudaMemset
- 4.9.41. cudaMemsetAsync
- 4.9.42. cudaMemset2D
- 4.9.43. cudaMemset3D
- 4.10. Unified Addressing and Peer Device Memory Access
- 4.11. Version Management
- 4.12. Profiling Management
- 4.13. CUDA Graph Management
- 5. Examples