Contents
- 1. Getting Started
- 2. Use Command-line Options
- 3. Multicore CPU Optimization
- 3.1. Overview of Optimization
- 3.2. Getting Started with Optimization
- 3.3. Local and Global Optimization
- 3.4. Loop Unrolling using -Munroll
- 3.5. Vectorization using -Mvect
- 3.6. Interprocedural Analysis and Optimization using -Mipa
- 3.6.1. Building a Program Without IPA – Single Step
- 3.6.2. Building a Program Without IPA – Several Steps
- 3.6.3. Building a Program Without IPA Using Make
- 3.6.4. Building a Program with IPA
- 3.6.5. Building a Program with IPA – Single Step
- 3.6.6. Building a Program with IPA – Several Steps
- 3.6.7. Building a Program with IPA Using Make
- 3.6.8. Questions about IPA
- 4. Using Function Inlining
- 5. Using GPUs
- 5.1. Overview
- 5.2. Terminology
- 5.3. Execution Model
- 5.4. Memory Model
- 5.5. Fortran pointers in device code
- 5.6. Calling routines in a compute kernel
- 5.7. Supported Processors and GPUs
- 5.8. CUDA Versions
- 5.9. Compute Capability
- 5.10. PTX JIT Compilation
- 6. Using OpenACC
- 6.1. OpenACC Programming Model
- 6.2. Compiling an OpenACC Program
- 6.3. OpenACC for Multicore CPUs
- 6.4. OpenACC with CUDA Unified Memory
- 6.5. OpenACC Error Handling
- 6.6. OpenACC and CUDA Graphs
- 6.7. Host and Device Trip Count Options
- 6.8. Environment Variables
- 6.9. Profiling Accelerator Kernels
- 6.10. OpenACC Runtime Libraries
- 6.11. Supported Intrinsics
- 7. Using OpenMP
- 7.1. Environment Variables
- 7.2. Fallback Mode
- 7.3. Loop
- 7.4. OpenMP Subset
- 7.5. Using metadirective
- 7.6. Mapping target constructs to CUDA streams
- 7.7. Noncontiguous Array Sections
- 7.8. OpenMP with CUDA Unified Memory
- 7.9. Multiple Device Support
- 7.10. Interoperability with CUDA
- 7.11. Interoperability with Other OpenMP Compilers
- 7.12. GNU STL
- 8. Using Stdpar
- 8.1. GPU Memory Modes
- 8.2. Stdpar C++
- 8.2.1. Introduction to Stdpar C++
- 8.2.2. NVC++ Compiler Parallel Algorithms Support
- 8.2.3. Stdpar C++ Simple Example
- 8.2.4. OpenACC Implementation of Parallel Algorithms
- 8.2.5. Coding Guidelines for GPU-accelerating Parallel Algorithms
- 8.2.6. NVC++ Experimental Features
- 8.2.7. Stdpar C++ Larger Example: LULESH
- 8.2.8. Interoperability with OpenACC
- 8.2.9. Getting Started with Parallel Algorithms for GPUs
- 8.3. Stdpar Fortran
- 9. PCAST
- 10. Using MPI
- 11. Creating and Using Libraries
- 11.1. Using builtin Math Functions in C++ and C
- 11.2. Using System Library Routines
- 11.3. Creating and Using Shared Object Files on Linux
- 11.4. Using LIB3F
- 11.5. LAPACK, BLAS and FFTs
- 11.6. Linking with ScaLAPACK
- 11.7. The C++ Standard Template Library
- 11.8. NVIDIA Performance Libraries (NVPL)
- 11.9. Linking with the nvmalloc Library
- 12. Environment Variables
- 13. Distributing Files - Deployment
- 14. Inter-language Calling
- 15. Programming Considerations for 64-Bit Environments
- 15.1. Data Types in the 64-Bit Environment
- 15.2. Large Static Data in Linux
- 15.3. Large Dynamically Allocated Data
- 15.4. 64-Bit Array Indexing
- 15.5. Compiler Options for 64-bit Programming
- 15.6. Practical Limitations of Large Array Programming
- 15.7. Medium Memory Model and Large Array in C
- 15.8. Medium Memory Model and Large Array in Fortran
- 15.9. Large Array and Small Memory Model in Fortran
- 16. C++ and C Inline Assembly and Intrinsics