Preface

This guide is part of a set of manuals that describe how to use the NVIDIA HPC Fortran, C++ and C compilers. These compilers include the NVFORTRAN, NVC++ and NVC compilers. They work in conjunction with an assembler, linker, libraries and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. You can use the NVIDIA HPC compilers to develop, optimize and parallelize applications for NVIDIA GPUs and x86-64, OpenPOWER and Arm Server multicore CPUs.

The NVIDIA HPC Compilers User’s Guide provides operating instructions for the NVIDIA HPC compilers command-level development environment. The NVIDIA HPC Compilers Reference Manual contains details concerning the NVIDIA compilers' interpretation of the Fortran, C++ and C language standards, implementation of language extensions, and command-level compilation. Users are expected to have previous experience with or knowledge of the Fortran, C++ and C programming languages. These guides do not teach the Fortran, C++ or C programming languages.

Audience Description

This manual is intended for scientists and engineers using the NVIDIA HPC compilers. To use these compilers, you should be aware of the role of high-level languages, such as Fortran, C++ and C as well as parallel programming models such as CUDA, OpenACC and OpenMP in the software development process, and you should have some level of understanding of programming. The NVIDIA HPC compilers are available on a variety of NVIDIA GPUs and x86-64, OpenPOWER and Arm CPU-based platforms and operating systems. You need to be familiar with the basic commands available on your system.

Compatibility and Conformance to Standards

Your system needs to be running a properly installed and configured version of the NVIDIA HPC compilers. For information on installing NVIDIA HPC compilers, refer to the Release Notes and Installation Guide included with your software.

For further information, refer to the following:

  • American National Standard Programming Language FORTRAN, ANSI X3. -1978 (1978).
  • ISO/IEC 1539-1 : 1991, Information technology – Programming Languages – Fortran, Geneva, 1991 (Fortran 90).
  • ISO/IEC 1539-1 : 1997, Information technology – Programming Languages – Fortran, Geneva, 1997 (Fortran 95).
  • ISO/IEC 1539-1 : 2004, Information technology – Programming Languages – Fortran, Geneva, 2004 (Fortran 2003).
  • ISO/IEC 1539-1 : 2010, Information technology – Programming Languages – Fortran, Geneva, 2010 (Fortran 2008).
  • ISO/IEC 1539-1 : 2018, Information technology – Programming Languages – Fortran, Geneva, 2018 (Fortran 2018).
  • Fortran 95 Handbook Complete ISO/ANSI Reference, Adams et al, The MIT Press, Cambridge, Mass, 1997.
  • The Fortran 2003 Handbook, Adams et al, Springer, 2009.
  • OpenACC Application Program Interface, Version 2.7, November 2018, http://www.openacc.org.
  • OpenMP Application Program Interface, Version 5.0, November 2018, http://www.openmp.org.
  • Programming in VAX Fortran, Version 4.0, Digital Equipment Corporation (September, 1984).
  • IBM VS Fortran, IBM Corporation, Rev. GC26-4119.
  • Military Standard, Fortran, DOD Supplement to American National Standard Programming Language Fortran, ANSI x.3-1978, MIL-STD-1753 (November 9, 1978).
  • American National Standard Programming Language C, ANSI X3.159-1989.
  • ISO/IEC 9899:1990, Information technology – Programming Languages – C, Geneva, 1990 (C90).
  • ISO/IEC 9899:1999, Information technology – Programming Languages – C, Geneva, 1999 (C99).
  • ISO/IEC 9899:2011, Information Technology – Programming Languages – C, Geneva, 2011 (C11).
  • ISO/IEC 14882:2011, Information Technology – Programming Languages – C++, Geneva, 2011 (C++11).
  • ISO/IEC 14882:2014, Information Technology – Programming Languages – C++, Geneva, 2014 (C++14).
  • ISO/IEC 14882:2017, Information Technology – Programming Languages – C++, Geneva, 2017 (C++17).

Organization

This guide contains the essential information on how to use the NVIDIA HPC compilers and is divided into these sections:

Getting Started provides an introduction to the NVIDIA HPC compilers and describes their use and overall features.

Use Command-line Options provides an overview of the command-line options as well as task-related lists of options.

Multicore CPU Optimization describes multicore CPU optimizations and related compiler options.

Using Function Inlining describes how to use function inlining and shows how to create an inline library.

Using OpenMP describes how to use OpenMP for multicore CPU programming.

Using OpenACC describes how to use an NVIDIA GPU and gives an introduction to using OpenACC.

Using Stdpar describes how to use C++/Fortran Standard Language Parallelism for programming an NVIDIA GPU or multicore CPU.

PCAST describes how to use the Parallel Compiler Assisted Testing features of the HPC Compilers.

Using MPI describes how to use MPI with the NVIDIA HPC compilers.

Creating and Using Libraries discusses NVIDIA HPC compiler support libraries, shared object files, and environment variables that affect the behavior of the compilers.

Environment Variables describes the environment variables that affect the behavior of the NVIDIA HPC compilers.

Distributing Files – Deployment describes the deployment of your files once you have built, debugged and compiled them successfully.

Inter-language Calling provides examples showing how to place C language calls in a Fortran program and Fortran language calls in a C program.

Programming Considerations for 64-Bit Environments discusses issues of which programmers should be aware when targeting 64-bit processors.

C++ and C Inline Assembly and Intrinsics describes how to use inline assembly code in C++ and C programs, as well as how to use intrinsic functions that map directly to assembly machine instructions.

Hardware and Software Constraints

This guide describes versions of the NVIDIA HPC compilers that target NVIDIA GPUs and x86-64, OpenPOWER and Arm CPUs. Details concerning environment-specific values and defaults and system-specific features or limitations are presented in the release notes delivered with the NVIDIA HPC compilers.

Conventions

This guide uses the following conventions:

italic
is used for emphasis.
Constant Width
is used for filenames, directories, arguments, options, examples, and for language statements in the text, including assembly language statements.
Bold
is used for commands.
[ item1 ]
in general, square brackets indicate optional items. In this case item1 is optional. In the context of p/t-sets, square brackets are required to specify a p/t-set.
{ item2 | item 3 }
braces indicate that a selection is required. In this case, you must select either item2 or item3.
filename ...
ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed.
FORTRAN
Fortran language statements are shown in the text of this guide using a reduced fixed point size.
C++ and C
C++ and C language statements are shown in the test of this guide using a reduced fixed point size.

Terms

A number of terms related to systems, processors, compilers and tools are used throughout this guide. For example:

accelerator FMA -mcmodel=medium shared library
AVX host -mcmodel=small SIMD
CUDA hyperthreading (HT) MPI SSE
device large arrays MPICH static linking
driver linux86-64 NUMA x86-64
DWARF LLVM OpenPOWER Arm
dynamic library multicore ppc64le Aarch64

The following table lists the NVIDIA HPC compilers and their corresponding commands:

Table 1. NVIDIA HPC Compilers and Commands
Compiler or Tool Language or Function Command
NVFORTRAN ISO/ANSI Fortran 2003 nvfortran
NVC++ ISO/ANSI C++17 with GNU compatibility nvc++
NVC ISO/ANSI C11 nvc

In general, the designation NVFORTRAN is used to refer to the NVIDIA Fortran compiler, and nvfortran is used to refer to the command that invokes the compiler. A similar convention is used for each of the NVIDIA HPC compilers.

For simplicity, examples of command-line invocation of the compilers generally reference the nvfortran command, and most source code examples are written in Fortran. Use of NVC⁠+⁠+ and NVC is consistent with NVFORTRAN, though there are command-line options and features of these compilers that do not apply to NVFORTRAN, and vice versa.

There are a wide variety of x86-64 CPUs in use. Most of these CPUs are forward-compatible, but not backward-compatible, meaning that code compiled to target a given processor will not necessarily execute correctly on a previous-generation processor.

A table listing the processor options that NVIDIA HPC compilers support is available in the Release Notes. The table also includes the features utilized by the compilers that distinguish them from a compatibility standpoint.

In this manual, the convention is to use "x86-64" to specify the group of CPUs that are x86-compatible, 64-bit enabled, and run a 64-bit operating system. x86-64 processors can differ in terms of their support for various prefetch, SSE and AVX instructions. Where such distinctions are important with respect to a given compiler option or feature, it is explicitly noted in this manual.

1. Getting Started

This section describes how to use the NVIDIA HPC compilers.

1.1. Overview

The command used to invoke a compiler, such as the nvfortran command, is called a compiler driver. The compiler driver controls the following phases of compilation: preprocessing, compiling, assembling, and linking. Once a file is compiled and an executable file is produced, you can execute, debug, or profile the program on your system.

In general, using an NVIDIA HPC compiler involves three steps:

  1. Produce program source code in a file containing a .f extension or another appropriate extension, as described in Input Files. This program may be one that you have written or one that you are modifying.
  2. Compile the program using the appropriate compiler command.
  3. Execute, debug, or profile the executable file on your system.

You might also want to deploy your application, though this is not a required step.

The NVIDIA HPC compilers allow many variations on these general program development steps. These variations include the following:

  • Stop the compilation after preprocessing, compiling or assembling to save and examine intermediate results.
  • Provide options to the driver that control compiler optimization or that specify various features or limitations.
  • Include as input intermediate files such as preprocessor output, compiler output, or assembler output.

1.2. Creating an Example

Let's look at a simple example of using the NVIDIA Fortran compiler to create, compile, and execute a program that prints:

hello
  1. Create your program. For this example, suppose you enter the following simple Fortran program in the file hello.f:
    print *, "hello"
    end 
  2. Compile the program. When you created your program, you called it hello.f. In this example, we compile it from a shell command prompt using the default nvfortran driver option. Use the following syntax:
    $ nvfortran hello.f 

    By default, the executable output is placed in the file a.out. However, you can specify an output file name by using the -⁠o option.

    To place the executable output in the file hello, use this command:

    $ nvfortran -o hello hello.f 
  3. Execute the program. To execute the resulting hello program, simply type the filename at the command prompt and press the Return or Enter key on your keyboard:
    $ hello
    Below is the expected output:
    hello

1.3. Invoking the Command-level NVIDIA HPC Compilers

To translate and link a Fortran, C, or C⁠+⁠+ program, the nvfortran, nvc and nvc⁠+⁠+ commands do the following:

  1. Preprocess the source text file.
  2. Check the syntax of the source text.
  3. Generate an assembly language file.
  4. Pass control to the subsequent assembly and linking steps.

1.3.1. Command-line Syntax

The compiler command-line syntax, using nvfortran as an example, is:

nvfortran [options] [path]filename [...]
Where:
options
is one or more command-line options, all of which are described in detail in Use Command-line Options.
path
is the pathname to the directory containing the file named by filename. If you do not specify the path for a filename, the compiler uses the current directory. You must specify the path separately for each filename not in the current directory.
filename
is the name of a source file, preprocessed source file, assembly-language file, object file, or library to be processed by the compilation system. You can specify more than one [path]filename.

1.3.2. Command-line Options

The command-line options control various aspects of the compilation process. For a complete alphabetical listing and a description of all the command-line options, refer to Use Command-Line Options.

The following list provides important information about proper use of command-line options.
  • Command-line options and their arguments are case sensitive.
  • The compiler drivers recognize characters preceded by a hyphen (-⁠) as command-line options. For example, the -⁠Mlist option specifies that the compiler creates a listing file.

    Note: The convention for the text of this manual is to show command-line options using a dash instead of a hyphen; for example, you see -⁠Mlist.
  • The order of options and the filename is flexible. That is, you can place options before and after the filename argument on the command line. However, the placement of some options is significant, such as the -⁠l option, in which the order of the filenames determines the search order.

    Note: If two or more options contradict each other, the last one in the command line takes precedence.
  • You may write linker options into a text file prefixed with the '@' symbol, e.g. @file, and pass that file to the compiler as an option. The contents of @file are passed to the linker.

    $ echo "foo.o bar.o" > ./option_file.rsp                                          
    $ nvc++ @./option_files.rsp
                                            

    The above will pass "foo.o bar.o" to the compiler as linker arguments.

1.4. Filename Conventions

The NVIDIA HPC compilers use the filenames that you specify on the command line to find and to create input and output files. This section describes the input and output filename conventions for the phases of the compilation process.

1.4.1. Input Files

You can specify assembly-language files, preprocessed source files, Fortran/C/C++ source files, object files, and libraries as inputs on the command line. The compiler driver determines the type of each input file by examining the filename extensions.

The drivers use the following conventions:

filename.f
indicates a Fortran source file.
filename.F
indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed).
filename.FOR
indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed).
filename.F90
indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed).
filename.F95
indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed).
filename.f90
indicates a Fortran 90/95 source file that is in freeform format.
filename.f95
indicates a Fortran 90/95 source file that is in freeform format.
filename.cuf
indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions.
filename.CUF
indicates a Fortran 90/95 source file in free format with CUDA Fortran extensions and that can contain macros and preprocessor directives (to be preprocessed).
filename.c
indicates a C source file that can contain macros and preprocessor directives (to be preprocessed).
filename.C
indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).
filename.i
indicates a preprocessed C or C++ source file.
filename.cc
indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).
filename.cpp
indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed).
filename.s
indicates an assembly-language file.
filename.o
(Linux) indicates an object file.
filename.a
(Linux) indicates a library of object files.
filename.so
(Linux only) indicates a library of shared object files.

The driver passes files with .s extensions to the assembler and files with .o, .so and .a extensions to the linker. Input files with unrecognized extensions, or no extension, are also passed to the linker.

Files with a .F (Capital F) or .FOR suffix are first preprocessed by the Fortran compilers and the output is passed to the compilation phase. The Fortran preprocessor functions like cpp for C programs, but is built in to the Fortran compilers rather than implemented through an invocation of cpp. This design ensures consistency in the preprocessing step regardless of the type or revision of operating system under which you are compiling.

Any input files not needed for a particular phase of processing are not processed. For example, if on the command line you specify an assembly-language file (filename.s) and the -⁠S option to stop before the assembly phase, the compiler takes no action on the assembly language file. Processing stops after compilation and the assembler does not run. In this scenario, the compilation must have been completed in a previous pass which created the .s file. For a complete description of the -⁠S option, refer to Output Files.

In addition to specifying primary input files on the command line, code within other files can be compiled as part of include files using the INCLUDE statement in a Fortran source file or the preprocessor #include directive in Fortran source files that use a .F extension or C++ and C source files.

When linking a program with a library, the linker extracts only those library components that the program needs. The compiler drivers link in several libraries by default. For more information about libraries, refer to Create and Use Libraries.

1.4.2. Output Files

By default, an executable output file produced by one of the NVIDIA HPC compilers is placed in the file a.out. As the Hello example shows, you can use the -⁠o option to specify the output file name.

If you use option -⁠F (Fortran only), -⁠P (C/C++ only), -⁠S or -⁠c, the compiler produces a file containing the output of the last completed phase for each input file, as specified by the option supplied.

The output file is a preprocessed source file, an assembly-language file, or an unlinked object file respectively. Similarly, the -⁠E option does not produce a file, but displays the preprocessed source file on the standard output. Using any of these options, the -⁠o option is valid only if you specify a single input file. If no errors occur during processing, you can use the files created by these options as input to a future invocation of any of the NVIDIA compiler drivers.

The following table lists the stop-after options and the output files that the compilers create when you use these options. It also indicates the accepted input files.

Table 2. Option Descriptions
Option Stop After Input Output
-E preprocessing Source files preprocessed file to standard out
-F preprocessing Source files. This option is not valid for nvc or nvc⁠+⁠+. preprocessed file (.f)
-P preprocessing Source files. This option is not valid for nvfortran. preprocessed file (.i)
-S compilation Source files or preprocessed files assembly-language file (.s)
-c assembly Source files, or preprocessed files, or assembly-language files unlinked object file (.o or .obj)
none linking Source files, or preprocessed files, assembly-language files, object files, or libraries executable file (a.out)

If you specify multiple input files or do not specify an object filename, the compiler uses the input filenames to derive corresponding default output filenames of the following form, where filename is the input filename without its extension:

filename.f
indicates a preprocessed file, if you compiled a Fortran file using the -⁠F option.
filename.i
indicates a preprocessed file, if you compiled using the -⁠P option.
filename.lst
indicates a listing file from the -⁠Mlist option.
filename.o or filename.obj
indicates a object file from the -⁠c option.
filename.s
indicates an assembly-language file from the -⁠S option.
Note: Unless you specify otherwise, the destination directory for any output file is the current working directory. If the file exists in the destination directory, the compiler overwrites it.

The following example demonstrates the use of output filename extensions.

$ nvfortran -c proto.f proto1.F 

This produces the output files proto.o and proto1.o, which are binary object files. Prior to compilation, the file proto1.F is preprocessed because it has a .F filename extension.

1.5. Fortran, C++ and C Data Types

The NVIDIA Fortran, C++ and C compilers recognize scalar and aggregate data types. A scalar data type holds a single value, such as the integer value 42 or the real value 112.6. An aggregate data type consists of one or more scalar data type objects, such as an array of integer values.

1.6. Platform-specific considerations

The NVIDIA HPC Compilers are supported on x86-64, OpenPOWER and 64-bit Arm multicore CPUs running Linux.

1.6.1. Using the NVIDIA HPC Compilers on Linux

Linux Header Files

The Linux system header files contain many GNU gcc extensions. The NVIDIA HPC C++ and C compilers support many of these extensions and can compile most programs that the GNU compilers can compile. A few header files not interoperable with the NVIDIA compilers have been rewritten.

If you are using the NVIDIA HPC C++ or C compilers, please make sure that the supplied versions of these include files are found before the system versions. This hierarchy happens by default unless you explicitly add a -⁠I option that references one of the system include directories.

1.7. Site-Specific Customization of the Compilers

If you are using the NVIDIA HPC Compilers and want all your users to have access to specific libraries or other files, there are special files that allow you to customize the compilers for your site.

1.7.1. Use siterc Files

The NVIDIA HPC Compiler command-level drivers utilize a file named siterc to enable site-specific customization of the behavior of the NVIDIA compilers. The siterc file is located in the bin subdirectory of the NVIDIA HPC Compilers installation directory. Using siterc, you can control how the compiler drivers invoke the various components in the compilation tool chain.

1.7.2. Using User rc Files

In addition to the siterc file, user rc files can reside in a given user's home directory, as specified by the user's HOME environment variable. You can use these files to control the respective NVIDIA HPC Compilers. All of these files are optional.

On Linux, these files are named .mynvfortranrc,.mynvcrc, and .mynvc++rc.

The following examples show how you can use these rc files to tailor a given installation for a particular purpose on Linux_x86_64 targets. The process is similar with obvious substitutions for ppc64le and aarch64 targets.

Table 3. Examples of Using siterc and User rc Files
To do this... Add the line shown to the indicated file(s)
Make available to all linux compilations the libraries found in /⁠opt/newlibs/64 set SITELIB=/opt/newlibs/64; to /opt/nv/Linux_x86_64/24.11/compilers/bin/siterc
Add to all linux compilations a new library path: /opt/local/fast append SITELIB=/opt/local/fast; to /opt/nv/Linux_x86_64/24.11/compilers/bin/siterc
With linux compilations, change -⁠Mmpi to link in /opt/mympi/64/libmpix.a set MPILIBDIR=/opt/mympi/64;set MPILIBNAME=mpix; to /opt/nv/Linux_x86_64/24.11/compilers/bin/siterc
Build a Fortran executable for linux that resolves shared objects in the relative directory ./REDIST set RPATH=./REDIST; to ~/.mynvfortranrc

1.8. Common Development Tasks

Now that you have a brief introduction to the compiler, let's look at some common development tasks that you might wish to perform.

  • When you compile code you can specify a number of options on the command line that define specific characteristics related to how the program is compiled and linked, typically enhancing or overriding the default behavior of the compiler. For a list of the most common command line options and information on all the command line options, refer to Use Command-line Options.
  • Code optimization for multicore CPUs allows the compiler to organize your code for efficient execution. While possibly increasing compilation time and making the code more difficult to debug, these techniques typically produce code that runs significantly faster than code that does not use them. For more information on optimization refer to Multicore CPU Optimization.
  • Function inlining, a special type of optimization, replaces a call to a function or a subroutine with the body of the function or subroutine. This process can speed up execution by eliminating parameter passing and the function or subroutine call and return overhead. In addition, function inlining allows the compiler to optimize the function with the rest of the code. However, function inlining may also result in much larger code size with no increase in execution speed. For more information on function inlining, refer to Using Function Inlining.
  • A library is a collection of functions or subprograms used to develop software. Libraries contain "helper" code and data, which provide services to independent programs, allowing code and data to be shared and changed in a modular fashion. The functions and programs in a library are grouped for ease of use and linking. When creating your programs, it is often useful to incorporate standard libraries or proprietary ones. For more information on this topic, refer to Creating and Using Libraries.
  • Environment variables define a set of dynamic values that can affect the way running processes behave on a computer. It is often useful to use these variables to set and pass information that alters the default behavior of the NVIDIA HPC Compilers and the executables which they generate. For more information on these variables, refer to Environment Variables.
  • Deployment, though possibly an infrequent task, can present some unique issues related to concerns of porting the code to other systems. Deployment, in this context, involves distribution of a specific file or set of files that are already compiled and configured. The distribution must occur in such a way that the application executes accurately on another system which may not be configured exactly the same as the system on which the code was created. For more information on what you might need to know to successfully deploy your code, refer to Distributing Files – Deployment.
  • An intrinsic is a function available in a given language whose implementation is handled specially by the compiler. Intrinsics make using processor-specific enhancements easier because they provide a C++ and C language interface to assembly instructions. In doing so, the compiler manages details that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data.

2. Use Command-line Options

A command line option allows you to control specific behavior when a program is compiled and linked. This section describes the syntax for properly using command-line options and provides a brief overview of a few of the more common options.

2.1. Command-line Option Overview

Before looking at all the command-line options, first become familiar with the syntax for these options. There are a large number of options available to you, yet most users only use a few of them. So, start simple and progress into using the more advanced options.

By default, the NVIDIA HPC Compilers generate code that is optimized for the type of processor on which compilation is performed, the compilation host. Before adding options to your command-line, review Help with Command-line Options and Frequently-used Options.

2.1.1. Command-line Options Syntax

On a command-line, options need to be preceded by a hyphen (-⁠). If the compiler does not recognize an option, you get an unknown switch error. The error can be downgraded to a warning by adding the -noswitcherror option.

This document uses the following notation when describing options:

[item]

Square brackets indicate that the enclosed item is optional.

{item | item}

Braces indicate that you must select one and only one of the enclosed items. A vertical bar (|) separates the choices.

...

Horizontal ellipses indicate that zero or more instances of the preceding item are valid.

2.1.2. Command-line Suboptions

Some options accept several suboptions. You can specify these suboptions either by using the full option statement multiple times or by using a comma-separated list for the suboptions.

The following two command lines are equivalent:

nvfortran -Mvect=simd -Mvect=noaltcode
nvfortran -Mvect=simd,noaltcode

2.1.3. Command-line Conflicting Options

Some options have an opposite or negated counterpart. For example, both -⁠Mvect and -⁠Mnovect are available. -⁠Mvect enables vectorization and -⁠Mnovect disables it. If you used both of these commands on a command line, they would conflict.

Note: When you use conflicting options on a command line, the last encountered option takes precedence over any previous one.

The conflicting options rule is important for a number of reasons.

  • Some options, such as -⁠fast, include other options. Therefore, it is possible for you to be unaware that you have conflicting options.
  • You can use this rule to create makefiles that apply specific flags to a set of files, as shown in the following example.

Example: Makefiles with Options

In this makefile fragment, CCFLAGS uses vectorization. CCNOVECTFLAGS uses the flags defined for CCFLAGS but disables vectorization.

CCFLAGS=c -Mvect=simd
CCNOVECTFLAGS=$(CCFLAGS) -Mnovect

2.2. Help with Command-line Options

If you are just getting started with the NVIDIA HPC Compilers, it is helpful to know which options are available, when to use them, and which options most users find effective.

Using -help

The -⁠help option is useful because it provides information about all options supported by a given compiler.

You can use -⁠help in one of three ways:

  • Use -⁠help with no parameters to obtain a list of all the available options with a brief one-line description of each.
  • Add a parameter to -⁠help to restrict the output to information about a specific option. The syntax for this usage is:

    -help <command line option>

    Suppose you use the following command to restrict the output to information about the -⁠fast option:

    $ nvfortran -help -fast

    The output you see is similar to:

    -fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre

    In the following example, we add the -⁠help parameter to restrict the output to information about the help command. The usage information for -⁠help shows how groups of options can be listed or examined according to function.

    $ nvfortran -help -help
            -help[=groups|asm|debug|language|linker|opt|other|overall|phase|prepro|
             suffix|switch|target|variable]
  • Add a parameter to -⁠help to restrict the output to a specific set of options or to a building process. The syntax for this usage is this:

    -help=<subgroup>

2.3. Getting Started with Performance

This section provides a quick overview of a few of the command-line options that are useful in improving multicore CPU performance.

2.3.1. Using -fast

The NVIDIA HPC Compilers implement a wide range of options that allow users a fine degree of control on each optimization phase. When it comes to optimization of code, the quickest way to start is to use the option -⁠fast. These options create a generally optimal set of flags. They incorporate optimization options to enable use of vector streaming SIMD instructions for 64-bit targets. They enable vectorization with SIMD instructions, cache alignment, and flush to zero mode.

Note: The contents of the -⁠fast option are host-dependent. Further, you should use these options on both compile and link command lines.

The following table shows the typical -⁠fast options.

Table 4. Typical -⁠fast Options
Use this option... To do this...
-O2 Specifies a code optimization level of 2.
-Munroll=c:1 Unrolls loops, executing multiple instances of the original loop during each iteration.
-Mnoframe Do not generate code to set up a stack frame. Note: With this option, a stack trace does not work.
-Mlre Enable loop-carried redundancy elimination.
-Mpre Enable partial redundancy elimination

On most modern CPUs the -⁠fast also includes the options shown in this table:

Table 5. Additional -⁠fast Options
Use this option... To do this...
-Mvect=simd Generates packed SIMD instructions.
-Mcache_align Aligns long objects on cache-line boundaries.
-Mflushz Sets flush-to-zero mode.
-M[no]vect Controls automatic vector pipelining.

To see the specific behavior of -⁠fast for your target, use the following command:

$ nvfortran -help -fast

2.3.2. Other Performance-Related Options

While -⁠fast is designed to be the quickest route to best performance, it is limited to routine boundaries. Depending on the nature and writing style of the source code, the compiler often can perform further optimization by knowing the global context of usage of a given routine. For instance, determining the possible value range of actual parameters of a routine could enable a loop to be vectorized; similarly, determining static occurrence of calls helps to decide which routine is beneficial to inline.

These types of global optimizations are under control of Interprocedural Analysis (IPA) in NVIDIA HPC Compilers. Option -⁠Mipa enables Interprocedural Analysis. -⁠Mipa=fast is the recommended option to get best performances for global optimization. You can also add the suboption inline to enable automatic global inlining across files. You might consider using -⁠Mipa=fast,inline. This option for interprocedural analysis and global optimization can improve performance.

For more information on optimization, refer to Multicore CPU Optimization. For specific information about these options, refer to the ‘Optimization Controls’ section of the HPC Compilers Reference Guide.

2.4. Frequently-used Options

In addition to overall performance, there are a number of other options that many users find useful when getting started. The following table provides a brief summary of these options.

Table 6. Commonly Used Command-Line Options
Use this option... To do this...
-⁠acc Enable parallelization using OpenACC directives. By default the compilers will parallelize and offload OpenACC regions to an NVIDIA GPU. Use -⁠acc=multicore to parallelize OpenACC regions for execution on all the cores of a multicore CPU.
-⁠fast This option creates a generally optimal set of flags for targets that support SIMD capability. It incorporates optimization options to enable use of vector streaming SIMD instructions, cache alignment and flushz.
-⁠g Instructs the compiler to include symbolic debugging information in the object module; sets the optimization level to zero unless a -⁠O option is present on the command line. Conversely, to prevent the generation of DWARF information, use the -⁠Mnodwarf option.
-⁠gopt Instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when -⁠g is not specified.
-⁠gpu Control the type of GPU for which code is generated, the version of CUDA to be targeted, and several other aspects of GPU code generation.
-⁠help Provides information about available options.
-⁠mcmodel=medium Enables medium=model code generation for 64-bit targets, which is useful when the data space of the program exceeds 4GB.
-⁠mp Enable parallelization using OpenMP directives. By default the compilers will parallelize OpenMP regions for execution on all the cores of a multicore CPU. Use -⁠mp=gpu to parallelize OpenMP regions for offload to an NVIDIA GPU.
-⁠Mconcur Instructs the compiler to enable auto-concurrentization of loops. If specified, the compiler uses multiple CPU cores to execute loops that it determines to be parallelizable; thus, loop iterations are split to execute optimally in a multithreaded execution context.
-⁠Minfo Instructs the compiler to produce information on standard error.
-⁠Minline Enables function inlining.
-⁠Mipa=fast,inline Enables interprocedural analysis and optimization. Also enables automatic procedure inlining.
-⁠Mkeepasm Keeps the generated assembly files.
-⁠Munroll Invokes the loop unroller to unroll loops, executing multiple instances of the loop during each iteration. This also sets the optimization level to 2 if the level is set to less than 2, or if no -⁠O or -⁠g options are supplied.
-⁠M[no]vect Enables [Disables] the code vectorizer.
--⁠[no_]exceptions Removes exception handling from user code. For C⁠+⁠+, declares that the functions in this file generate no C⁠+⁠+ exceptions, allowing more optimal code generation.
-⁠o Names the output file.
-⁠O <level> Specifies code optimization level where <level> is 0, 1, 2, 3, or 4.
-⁠stdpar Enable parallelization and offloading of Standard C⁠+⁠+ and Fortran parallel constructs to NVIDIA GPUs; default is -⁠stdpar=gpu.
-⁠tp <target> Specify a CPU target other than the compilation host CPU.
-⁠Wl, <option> Compiler driver passes the specified options to the linker.

2.5. Floating-point Subnormal

Starting with the 22.7 release of the NV HPC SDK the default setting of how floating-point denormal (IEEE 754 terminology "subnormal") values are processed at runtime across both x86_64 and aarch64 processors has been changed to be more consistent.

Denormal values can be both operands to, and results of, floating-point operations. The x86_64 ISA differentiate between the two categories, operands and results, and use the terminology "daz" denormals are zeros for operands, and "flushz" flush to zero for results. The Arm V8 ISA as defined can differentiate between the two categories, but currently the processors that NV HPC SDK support only have a single setting for both operands and results and is defined as "fz" in the floating-point status and control register.

The NV HPC SDK C, C++, and Fortran compilers have command line switches -⁠M[no]daz and -⁠M[no]flushz, which when specified for the C/C++ main function or the Fortran main program affect how denormals are handled by the processor at runtime. The values of these two command line switches are passed to the runtime library to configure the floating-point status and control register at program startup.

NV HPC SDK supports x86_64 processors from both Intel and AMD, and ArmV8.1 and later processors. The following table summarizes the default settings of the -⁠Mdaz and -⁠Mflushz command line switches pre and post the 22.7 release.

Table 7. Default settings of -⁠Mdaz and -⁠Mflushz
  Pre 22.7 defaults 22.7 defaults
Intel

-Mdaz

-Mnoflushz

-Mdaz

-Mflushz

AMD

-Mnodaz

-Mnoflushz

-Mdaz

-Mflushz

Arm processors

-Mnodaz

-Mdaz

With the NV HPC SDK 22.7 release, the default handling of denormals operands and results is to treat them as zero, as if the main function/program were compiled with -⁠Mdaz-⁠Mflushz. Consequently, these changes can potentially affect applications that are dependent on subnormal values being non-zero.

Along with the change to the default treatment of denormal values, users now have the ability to configure the floating-point status and control register through the NVCOMPILER_FPU_STATE environment variable - effectively overriding how the program was originally compiled. For further information, see the description of the NVCOMPILER_FPU_STATE environment variable.

3. Multicore CPU Optimization

Source code that is readable, maintainable, and produces correct results is not always organized for efficient execution. Normally, the first step in the program development process involves producing code that executes and produces the correct results. This first step usually involves compiling without much worry about optimization. After code is compiled and debugged, code optimization and parallelization become an issue.

Invoking one of the NVIDIA HPC Compiler commands with certain options instructs the compiler to generate optimized code. Optimization is not always performed since it increases compilation time and may make debugging difficult. However, optimization produces more efficient code that usually runs significantly faster than code that is not optimized.

The compilers optimize code according to the specified optimization level. You can use a number of options to specify the optimization levels, including -⁠O, -⁠Mvect, -⁠Mipa and -⁠Mconcur. In addition, you can use several of the -⁠M<nvflag> switches to control specific types of optimization.

This chapter describes the overall effect of the optimization options supported by the NVIDIA HPC Compilers, and basic usage of several options.

3.1. Overview of Optimization

In general, optimization involves using transformations and replacements that generate more efficient code. This is done by the compiler and involves replacements that are independent of the particular target processor's architecture as well as replacements that take advantage of the x86-64OpenPOWER architecture, instruction set and registers.

For discussion purposes, we categorize optimization:

3.1.1. Local Optimization

A basic block is a sequence of statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching, except at the end. Local optimization is performed on a block-by-block basis within a program’s basic blocks.

The NVIDIA HPC Compilers perform many types of local optimization including: algebraic identity removal, constant folding, common sub-expression elimination, redundant load and store elimination, scheduling, strength reduction, and peephole optimizations.

3.1.2. Global Optimization

This optimization is performed on a subprogram/function over all its basic blocks. The optimizer performs control-flow and data-flow analysis for an entire program unit. All loops, including those formed by ad hoc branches such as IFs or GOTOs, are detected and optimized.

Global optimization includes: constant propagation, copy propagation, dead store elimination, global register allocation, invariant code motion, and induction variable elimination.

3.1.3. Loop Optimization: Unrolling, Vectorization and Parallelization

The performance of certain classes of loops may be improved through vectorization or unrolling options. Vectorization transforms loops to improve memory access performance and make use of packed SSEvector instructions which perform the same operation on multiple data items concurrently. Unrolling replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization, vectorization and scheduling of instructions. Performance for loops on systems with multiple processors may also improve using the parallelization features of the NVIDIA HPC Compilers.

3.1.4. Interprocedural Analysis (IPA) and Optimization

Interprocedural analysis (IPA) allows use of information across function call boundaries to perform optimizations that would otherwise be unavailable. For example, if the actual argument to a function is in fact a constant in the caller, it may be possible to propagate that constant into the callee and perform optimizations that are not valid if the dummy argument is treated as a variable. A wide range of optimizations are enabled or improved by using IPA, including but not limited to data alignment optimizations, argument removal, constant propagation, pointer disambiguation, pure function detection, F90/F95 array shape propagation, data placement, empty function removal, automatic function inlining, inlining of functions from pre-compiled libraries, and interprocedural optimization of functions from pre-compiled libraries.

3.1.5. Function Inlining

This optimization allows a call to a function to be replaced by a copy of the body of that function. This optimization will sometimes speed up execution by eliminating the function call and return overhead. Function inlining may also create opportunities for other types of optimization. Function inlining is not always beneficial. When used improperly it may increase code size and generate less efficient code.

3.2. Getting Started with Optimization

The first concern should be getting the program to execute and produce correct results. To get the program running, start by compiling and linking without optimization. Add -⁠O0 to the compile line to select no optimization; or add -⁠g to debug the program easily and isolate any coding errors exposed during porting.

To get started quickly with optimization, a good set of options to use with any of the NVIDIA HPC compilers is -⁠fast. For example:

$ nvfortran -fast -Mipa=fast,inline prog.f

For all of the NVIDIA HPC Fortran, C++ and C compilers, the -⁠fast -⁠Mipa=fast,inline options generally produce code that is well-optimized without the possibility of significant slowdowns due to pathological cases.

  • The-⁠fast option is an aggregate option that includes a number of individual NVIDIA compiler options; which compiler options are included depends on the target for which compilation is performed.
  • The -⁠Mipa=fast,inline option invokes interprocedural analysis (IPA), including several IPA suboptions. The inline suboption enables automatic inlining with IPA. If you do not wish to use automatic inlining, you can compile with -⁠Mipa=fast and use several IPA suboptions without inlining.

These aggregate options incorporate a generally optimal set of flags for targets that support SIMD capability, including vectorization with SIMD instructions, cache alignment, and flushz.

The following table shows the typical -⁠fast options.

Table 8. Typical -⁠fast Options
Use this option... To do this...
-⁠O2 Specifies a code optimization level of 2 and -⁠Mvect=SIMD.
-⁠Munroll=c:1 Unrolls loops, executing multiple instances of the original loop during each iteration.
-⁠Mnoframe Indicates to not generate code to set up a stack frame. Note With this option, a stack trace does not work.
-⁠Mlre Indicates loop-carried redundancy elimination.
-⁠Mautoinline Enables automatic function inlining in C & C++.
-⁠Mpre Indicates partial redundancy elimination

On modern multicore CPUs the -⁠fast also typically includes the options shown in the following table:

Table 9. Additional -⁠fast Options
Use this option... To do this...
-⁠Mvect=simd Generates packed SSE and AVX instructions.
-⁠Mcache_align Aligns long objects on cache-line boundaries.
-⁠Mflushz Sets flush-to-zero mode.

By experimenting with individual compiler options on a file-by-file basis, further significant performance gains can sometimes be realized. However, depending on the coding style, individual optimizations can sometimes cause slowdowns, and must be used carefully to ensure performance improvements.

There are other useful command line options related to optimization and parallelization, such as -⁠help, -⁠Minfo, -⁠Mneginfo, -⁠dryrun, and -⁠v.

3.2.1. -help

As described in Help with Command-Line Options, you can see a specification of any command-line option by invoking any of the NVIDIA HPC Compilers with -⁠help in combination with the option in question, without specifying any input files.

For example, you might want information on -⁠O:

$ nvfortran -help -O

The resulting output is similar to this:

-O Set opt level. All -O1 optimizations plus traditional scheduling and
  global scalar optimizations performed

Or you can see the full functionality of -⁠help itself, which can return information on either an individual option or groups of options:

$ nvfortran -help -help

The resulting output is similar to this:

-help[=groups|asm|debug|language|linker|opt|other|overall|
phase|prepro|suffix|switch|target|variable]
Show compiler switches

3.2.2. -Minfo

You can use the -⁠Minfo option to display compile-time optimization listings. When this option is used, the NVIDIA HPC Compilers issue informational messages to standard error (stderr) as compilation proceeds. From these messages, you can determine which loops are optimized using unrolling, SIMD vectorization, parallelization, GPU offloading, interprocedural optimizations and various miscellaneous optimizations. You can also see where and whether functions are inlined.

3.2.3. -Mneginfo

You can use the -⁠Mneginfo option to display informational messages to standard error (stderr) that explain why certain optimizations are inhibited.

3.2.4. -dryrun

The -⁠dryrun option can be useful as a diagnostic tool if you need to see the steps used by the compiler driver to preprocess, compile, assemble and link in the presence of a given set of command line inputs. When you specify the -⁠dryrun option, these steps are printed to standard error (stderr) but are not actually performed. For example, you can use this option to inspect the default and user-specified libraries that are searched during the link phase, and the order in which they are searched by the linker.

3.2.5. -v

The -⁠v option is similar to -⁠dryrun, except each compilation step is performed and not simply printed.

3.3. Local and Global Optimization

This section describes local and global optimization.

3.3.1. -Msafeptr

The -⁠Msafeptr option can significantly improve performance of C++ and C programs in which there is known to be no pointer aliasing. For obvious reasons, this command-line option must be used carefully. There are a number of suboptions for -⁠Msafeptr:

  • -Msafeptr=all – All pointers are safe. Equivalent to the default setting: -⁠Msafeptr.
  • -Msafeptr=arg – Function formal argument pointers are safe. Equivalent to -⁠Msafeptr=dummy.
  • -Msafeptr=global – Global pointers are safe.
  • -Msafeptr=local – Local pointers are safe. Equivalent to -⁠Msafeptr=auto.
  • -Msafeptr=static – Static local pointers are safe.

If your C++ or C program has pointer aliasing and you also want automating inlining, then compiling with -⁠Mipa=fast or -⁠Mipa=fast,inline includes pointer aliasing optimizations. IPA may be able to optimize some of the alias references in your program and leave intact those that cannot be safely optimizied.

3.3.2. -O

Using the NVIDIA HPC Compiler commands with the -⁠O<level> option (the capital O is for Optimize), you can specify any integer level from 0 to 4.

-O0

Level zero specifies no optimization. A basic block is generated for each language statement. At this level, the compiler generates a basic block for each statement.

Performance will almost always be slowest using this optimization level. This level is useful for the initial execution of a program. It is also useful for debugging, since there is a direct correlation between the program text and the code generated. To enable debugging, include -⁠g on your compile line.

-O1

Level one specifies local optimization. Scheduling of basic blocks is performed. Register allocation is performed.

Local optimization is a good choice when the code is very irregular, such as code that contains many short statements containing IF statements and does not contain loops (DO or DO WHILE statements ). Although this case rarely occurs, for certain types of code, this optimization level may perform better than level-two (-⁠O2).

-O

When no level is specified, level two global optimizations are performed, including traditional scalar optimizations, induction recognition, and loop invariant motion. No SIMD vectorization is enabled.

-O2

Level two specifies global optimization. This level performs all level-one local optimization as well as level two global optimization described in -⁠O. In addition, more advanced optimizations such as SIMD code generation, cache alignment, and partial redundancy elimination are enabled.

-O3

Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable.

-O4

Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions.

Types of Optimizations

The NVIDIA HPC Compilers perform many different types of local optimizations, including but not limited to:

  • Algebraic identity removal
  • Constant folding
  • Common subexpression elimination
  • Local register optimization
  • Peephole optimizations
  • Redundant load and store elimination
  • Strength reductions

Level-two optimization (-⁠O2 or -⁠O) specifies global optimization. The -⁠fast option generally specifies global optimization; however, the -⁠fast switch varies from release to release, depending on a reasonable selection of switches for any one particular release. The -⁠O or -⁠O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular.

The NVIDIA HPC Compilers perform many different types of global optimizations, including but not limited to:

  • Branch to branch elimination
  • Constant propagation
  • Copy propagation
  • Dead store elimination
  • Global register allocation
  • Induction variable elimination
  • Invariant code motion

You can explicitly select the optimization level on the command line. For example, the following command line specifies level-two optimization which results in global optimization:

$ nvfortran -O2 prog.f

The default optimization level changes depending on which options you select on the command line. For example, when you select the -⁠g debugging option, the default optimization level is set to level-zero (-⁠O0). However, if you need to debug optimized code, you can use the -⁠gopt option to generate debug information without perturbing optimization. For a description of the default levels, refer to Default Optimization Levels.

The -⁠fast option includes -⁠O2 on all targets. If you want to override the default for -⁠fast with -⁠O3 while maintaining all other elements of -⁠fast, simply compile as follows:

$ nvfortran -fast -O3 prog.f

3.4. Loop Unrolling using -Munroll

This optimization unrolls loops, which reduces branch overhead, and can improve execution speed by creating better opportunities for instruction scheduling. A loop with a constant count may be completely unrolled or partially unrolled. A loop with a non-constant count may also be unrolled. A candidate loop must be an innermost loop containing one to four blocks of code.

The following example shows the use of the -⁠Munroll option:

$ nvfortran -Munroll prog.f

The -⁠Munroll option is included as part of -⁠fast on all targets. The loop unroller expands the contents of a loop and reduces the number of times a loop is executed. Branching overhead is reduced when a loop is unrolled two or more times, since each iteration of the unrolled loop corresponds to two or more iterations of the original loop; the number of branch instructions executed is proportionately reduced. When a loop is unrolled completely, the loop’s branch overhead is eliminated altogether.

Loop unrolling may be beneficial for the instruction scheduler. When a loop is completely unrolled or unrolled two or more times, opportunities for improved scheduling may be presented. The code generator can take advantage of more possibilities for instruction grouping or filling instruction delays found within the loop.

Examples Showing Effect of Unrolling

The following side-by-side examples show the effect of code unrolling on a segment that computes a dot product.

Note: This example is only meant to represent how the compiler can transform the loop; it is not meant to imply that the programmer needs to manually change code. In fact, manually unrolling your code can sometimes inhibit the compiler’s analysis and optimization.
Table 10. Example of Effect of Code Unrolling
Dot Product Code Unrolled Dot Product Code
 REAL*4 A(100), B(100), Z
  INTEGER I
  DO I=1, 100
    Z = Z + A(i) * B(i)
  END DO
 END
 REAL*4 A(100), B(100), Z
  INTEGER I
  DO I=1, 100, 2
    Z = Z + A(i) * B(i)
    Z = Z + A(i+1) * B(i+1)
  END DO
 END

Using the -⁠Minfo option, the compiler informs you when a loop is being unrolled. For example, a message similar to the following, indicating the line number, and the number of times the code is unrolled, displays when a loop is unrolled:

dot:
  5, Loop unrolled 5 times

Using the c:<m> and n:<m> sub-options to -⁠Munroll, or using -⁠Mnounroll, you can control whether and how loops are unrolled on a file-by-file basis. For more information on -⁠Munroll, refer to Use Command-line Options.

3.5. Vectorization using -Mvect

The -⁠Mvect option is included as part of -⁠fast on all multicore CPU targets. If your program contains computationally-intensive loops, the -⁠Mvect option may be helpful. If in addition you specify -⁠Minfo, and your code contains loops that can be vectorized, the compiler reports relevant information on the optimizations applied.

When an NVIDIA HPC Compiler command is invoked with the -⁠Mvect option, the vectorizer scans code searching for loops that are candidates for high-⁠level transformations such as loop distribution, loop interchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop, with optimized code sequences or function calls). When the vectorizer finds vectorization opportunities, it internally rearranges or replaces sections of loops (the vectorizer changes the code generated; your source code’s loops are not altered). In addition to performing these loop transformations, the vectorizer produces extensive data dependence information for use by other phases of compilation and detects opportunities to use vector or packed SIMD instructions on processors where these are supported.

The -⁠Mvect option can speed up code which contains well-behaved countable loops which operate on large floating point arrays in Fortran and their C++ and C counterparts. However, it is possible that some codes will show a decrease in performance when compiled with the -⁠Mvect option due to the generation of conditionally executed code segments, inability to determine data alignment, and other code generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled with this option enabled.

3.5.1. Vectorization Sub-options

The vectorizer performs high-level loop transformations on countable loops. A loop is countable if the number of iterations is set only before loop execution and cannot be modified during loop execution. Some of the vectorizer transformations can be controlled by arguments to the -⁠Mvect command line option. The following sections describe the arguments that affect the operation of the vectorizer. In addition, some of these vectorizer operations can be controlled from within code using directives and pragmas.

The vectorizer performs the following operations:

  • Loop interchange
  • Loop splitting
  • Loop fusion
  • Generation of SIMD instructions on CPUs where these are supported
  • Generation of prefetch instructions on processors where these are supported
  • Loop iteration peeling to maximize vector alignment
  • Alternate code generation

The following table lists and briefly describes some of the -⁠Mvect suboptions.

Table 11. -Mvect Suboptions
Use this option ... To instruct the vectorizer to do this ...
-Mvect=altcode Generate appropriate code for vectorized loops.
-Mvect=[no]assoc Perform[disable] associativity conversions that can change the results of a computation due to a round-off error. For example, a typical optimization is to change one arithmetic operation to another arithmetic operation that is mathematically correct, but can be computationally different and generate faster code. This option is provided to enable or disable this transformation, since a round-off error for such associativity conversions may produce unacceptable results.
-Mvect=fuse Enable loop fusion.
-Mvect=gather Enable vectorization of indirect array references.
-Mvect=idiom Enable idiom recognition.
-Mvect=levels:<n> Set the maximum next level of loops to optimize.
-Mvect=nocond Disable vectorization of loops with conditions.
-Mvect=partial Enable partial loop vectorization via inner loop distribution.
-Mvect=prefetch Automatically generate prefetch instructions when vectorizable loops are encountered, even in cases where SSESIMD instructions are not generated.
-Mvect=short Enable short vector operations.
-Mvect=simd Automatically generate packed SSE (Streaming SIMD Extensions)SIMD, and prefetch instructions when vectorizable loops are encountered. SIMD instructions, first introduced on Pentium III and AthlonXP processors, operate on single-precision floating-point data.
-Mvect=sizelimit:n Limit the size of vectorized loops.
-Mvect=sse Equivalent to -⁠Mvect=simd.
-Mvect=uniform Perform consistent optimizations in both vectorized and residual loops. Be aware that this may affect the performance of the residual loop.
Note: Inserting no in front of an option disables the option. For example, to disable the generation of SIMD instructions, compile with -⁠Mvect=nosimd.

3.5.2. Vectorization Example Using SIMD Instructions

One of the most important vectorization options is -Mvect=simd. When you use this option, the compiler automatically generates SIMD vector instructions, where possible, when targeting processors on which these instructions are supported. This process can improve performance by several factors compared with the equivalent scalar code. All of the NVIDIA HPC Fortran, C++ and C compilers support this capability.

In the program in Vector operation using SIMD instructions, the vectorizer recognizes the vector operation in subroutine 'loop' when either compiler switch -⁠Mvect=simd or -⁠fast is used. This example shows the compilation, informational messages, and runtime results using SIMD instructions on an Intel Core i7 7800X Skylake system, along with issues that affect SIMD performance.

Loops vectorized using SIMD instructions operate much more efficiently when processing vectors that are aligned to a cache-line boundary. You can cause unconstrained data objects of size 16 bytes or greater to be cache-aligned by compiling with the -⁠Mcache_align switch. An unconstrained data object is a data object that is not a common block member and not a member of an aggregate data structure.

Note: For stack-based local variables to be properly aligned, the main program or function must be compiled with -⁠Mcache_align.

The -⁠Mcache_align switch has no effect on the alignment of Fortran allocatable or automatic arrays. If you have arrays that are constrained, such as vectors that are members of Fortran common blocks, you must specifically pad your data structures to ensure proper cache alignment. You can use -⁠Mcache_align for only the beginning address of each common block to be cache-aligned.

The following examples show the results of compiling the sample code in Vector operation using SIMD instructions both with and without the option -⁠Mvect=simd .

Vector operation using SIMD instructions

program vector_op
 parameter (N = 9999)
 real*4 x(N), y(N), z(N), W(N)
 do i = 1, n
   y(i) = i
   z(i) = 2*i
   w(i) = 4*i
 enddo
 do j = 1, 200000
   call loop(x,y,z,w,1.0e0,N)
 enddo
 print *, x(1),x(771),x(3618),x(6498),x(9999)
end
subroutine loop(a,b,c,d,s,n)
 integer i, n
 real*4 a(n), b(n), c(n), d(n),s
 do i = 1, n
   a(i) = b(i) + c(i) - s * d(i)
 enddo
end

Assume the preceding program is compiled as follows, where -Mvect=nosimd disables SIMD vectorization:

% nvfortran -fast -Mvect=nosimd -Minfo vadd.f -Mfree -o vadd
vector_op:
      4, Loop unrolled 16 times
         Generated 1 prefetches in scalar loop
      9, Loop not vectorized/parallelized: contains call
loop:
     18, Loop unrolled 8 times
         FMA (fused multiply-add) instruction(s) generated 

The following output shows a sample result if the generated executable is run and timed on an Intel Core i7 7800X Skylake system:

$ /bin/time vadd
   -1.000000       -771.0000       -3618.000       -6498.000
   -9999.000
0.99user 0.01system 0:01.18elapsed 84%CPU (0avgtext+0avgdata 3120maxresident)k
7736inputs+0outputs (4major+834minor)pagefaults 0swaps
$ /bin/time vadd
   -1.000000       -771.0000       -3618.000       -6498.000
   -9999.000
2.31user 0.00system 0:02.57elapsed 89%CPU (0avgtext+0avgdata 6976maxresident)k
8192inputs+0outputs (4major+149minor)pagefaults 0swaps

Now, recompile with vectorization enabled, and you see results similar to these:

% nvfortran -fast -Minfo vadd.f -Mfree -o vadd
vector_op:
      4, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Residual loop unrolled 7 times (completely unrolled)
         Generated 1 prefetches in scalar loop
      9, Loop not vectorized/parallelized: contains call
loop:
     18, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 3 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 3 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 3 prefetch instructions for the loop
         FMA (fused multiply-add) instruction(s) generated 

Notice the informational messages for the loop at line 18. The first line of the message indicates that two alternate versions of the loop were generated. The loop count and alignments of the arrays determine which of these versions is executed. The next several lines indicate the loop was vectorized and that prefetch instructions have been generated for three loads to minimize latency of data transfers from main memory.

Executing again, you should see results similar to the following:

$ /bin/time vadd-simd
   -1.000000       -771.0000       -3618.000       -6498.000
   -9999.000
0.27user 0.00system 0:00.29elapsed 93%CPU (0avgtext+0avgdata 3124maxresident)k
0inputs+0outputs (0major+838minor)pagefaults 0swaps
$ /bin/time vadd-simd
   -1.000000       -771.0000       -3618.000       -6498.000
   -9999.000
0.62user 0.00system 0:00.65elapsed 95%CPU (0avgtext+0avgdata 6976maxresident)k
0inputs+0outputs (0major+151minor)pagefaults 0swaps

The SIMD result is 3.7 times faster than the equivalent non-SIMD version of the program.

Speed-up realized by a given loop or program can vary widely based on a number of factors:

  • When the vectors of data are resident in the data cache, performance improvement using SIMD instructions is most effective.
  • If data is aligned properly, performance will be better in general than when using SIMD operations on unaligned data.
  • If the compiler can guarantee that data is aligned properly, even more efficient sequences of SIMD instructions can be generated.
  • The efficiency of loops that operate on single-precision data can be higher. SIMD instructions can operate on four single-precision elements concurrently, but only two double-precision elements.
Note: Compiling with -⁠Mvect=simd can result in numerical differences from the executables generated with less optimization. Certain vectorizable operations, for example dot products, are sensitive to order of operations and the associative transformations necessary to enable vectorization (or parallelization).

3.6. Interprocedural Analysis and Optimization using -Mipa

The NVIDIA HPC Fortran, C++ and C compilers use interprocedural analysis (IPA) that results in minimal changes to makefiles and the standard edit-build-run application development cycle. Other than adding -⁠Mipa to the command line, no other changes are required. For reference and background, the process of building a program without IPA is described later in this section, followed by the minor modifications required to use IPA with the NVIDIA compilers. While the NVC compiler is used here to show how IPA works, similar capabilities apply to each of the NVIDIA HPC Fortran, C++ and C compilers.

3.6.1. Building a Program Without IPA – Single Step

Using the nvc command-level compiler driver, multiple source files can be compiled and linked into a single executable with one command. The following example compiles and links three source files:

% nvc -o a.out file1.c file2.c file3.c

In actuality, the nvc driver executes several steps to produce the assembly code and object files corresponding to each source file, and subsequently to link the object files together into a single executable file. This command is roughly equivalent to the following commands performed individually:

% nvc -S -o file1.s file1.c
% as -o file1.o file1.s
% nvc -S -o file2.s file2.c
% as -o file2.o file2.s
% nvc -S -o file3.s file3.c
% as -o file3.o file3.s
% nvc -o a.out file1.o file2.o file3.o

If any of the three source files is edited, the executable can be rebuilt with the same command line:

% nvc -o a.out file1.c file2.c file3.c
Note: This always works as intended, but has the side-effect of recompiling all of the source files, even if only one has changed. For applications with a large number of source files, this can be time-consuming and inefficient.

3.6.2. Building a Program Without IPA – Several Steps

It is also possible to use individual nvc commands to compile each source file into a corresponding object file, and one to link the resulting object files into an executable:

% nvc -c file1.c
% nvc -c file2.c
% nvc -c file3.c
% nvc -o a.out file1.o file2.o file3.o

The nvc driver invokes the compiler and assembler as required to process each source file, and invokes the linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking:

% nvc -c file1.c
% nvc -o a.out file1.o file2.o file3.o

3.6.3. Building a Program Without IPA Using Make

The program compilation and linking process can be simplified greatly using the make utility on systems where it is supported. Suppose you create a makefile containing the following lines:

a.out: file1.o file2.o file3.o
 nvc $(OPT) -o a.out file1.o file2.o file3.o
file1.o: file1.c
 nvc $(OPT) -c file1.c
file2.o: file2.c
 nvc $(OPT) -c file2.c
file3.o: file3.c
nvc $(OPT) -c file3.c

It is then possible to type a single make command:

% make

The make utility determines which object files are out of date with respect to their corresponding source files, and invokes the compiler to recompile only those source files and to relink the executable. If you subsequently edit one or more source files, the executable can be rebuilt with the minimum number of recompilations using the same single make command.

3.6.4. Building a Program with IPA

Interprocedural analysis and optimization (IPA) by the NVIDIA HPC Compilers alters the standard and make utility command-level interfaces as little as possible. IPA occurs in three phases:

  • Collection: Create a summary of each function or procedure, collecting the useful information for interprocedural optimizations. This is done during the compile step if the -⁠Mipa switch is present on the command line; summary information is collected and stored in the object file.
  • Propagation: Process all the object files to propagate the interprocedural summary information across function and file boundaries. This is done during the link step, when all the object files are combined, if the -⁠Mipa switch is present on the link command line.
  • Recompile/Optimization: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. This process is also performed during the link step when the -⁠Mipa switch is present on the link command line.

When linking with -⁠Mipa, the NVIDIA HPC Compilers automatically regenerate IPA-optimized versions of each object file, essentially recompiling each file. If there are IPA-optimized objects from a previous build, the compilers will minimize the recompile time by reusing those objects if they are still valid. They will still be valid if the IPA-optimized object is newer than the original object file, and the propagated IPA information for that file has not changed since it was optimized.

After each object file has been recompiled, the regular linker is invoked to build the application with the IPA-optimized object files. The IPA-optimized object files are saved in the same directory as the original object files, for use in subsequent program builds.

3.6.5. Building a Program with IPA – Single Step

By adding the -⁠Mipa command line switch, several source files can be compiled and linked with interprocedural optimizations with one command:

% nvc -Mipa=fast -o a.out file1.c file2.c file3.c

Just like compiling without -⁠Mipa, the driver executes several steps to produce the assembly and object files to create the executable:

% nvc -Mipa=fast -S -o file1.s file1.c
% as -o file1.o file1.s
% nvc -Mipa=fast -S -o file2.s file2.c
% as -o file2.o file2.s
% nvc -Mipa=fast -S -o file3.s file3.c
% as -o file3.o file3.s
% nvc -Mipa=fast -o a.out file1.o file2.o file3.o

In the last step, an IPA linker is invoked to read all the IPA summary information and perform the interprocedural propagation. The IPA linker reinvokes the compiler on each of the object files to recompile them with interprocedural information. This creates three new objects with mangled names:

file1_ipa5_a.out.oo.o, file2_ipa5_a.out.oo.o, file3_ipa5_a.out.oo.o

The system linker is then invoked to link these IPA-optimized objects into the final executable. Later, if one of the three source files is edited, the executable can be rebuilt with the same command line:

% nvc -Mipa=fast -o a.out file1.c file2.c file3.c

This works, but again has the side-effect of compiling each source file, and recompiling each object file at link time.

3.6.6. Building a Program with IPA – Several Steps

Just by adding the -⁠Mipa command-line switch, it is possible to use individual nvc commands to compile each source file, followed by a command to link the resulting object files into an executable:

% nvc -Mipa=fast -c file1.c
% nvc -Mipa=fast -c file2.c
% nvc -Mipa=fast -c file3.c
% nvc -Mipa=fast -o a.out file1.o file2.o file3.o

The nvc driver invokes the compiler and assembler as required to process each source file, and invokes the IPA linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking:

% nvc -Mipa=fast -c file1.c
% nvc -Mipa=fast -o a.out file1.o file2.o file3.o

When the IPA linker is invoked, it will determine that the IPA-optimized object for file1.o (file1_ipa5_a.out.oo.o) is stale, since it is older than the object file1.o; and hence it needs to be rebuilt, and reinvokes the compiler to generate it. In addition, depending on the nature of the changes to the source file file1.c, the interprocedural optimizations previously performed for file2 and file3 may now be inaccurate. For instance, IPA may have propagated a constant argument value in a call from a function in file1.c to a function in file2.c; if the value of the argument has changed, any optimizations based on that constant value are invalid. The IPA linker determines which, if any, of the previously created IPA-optimized objects need to be regenerated; and, as appropriate, reinvokes the compiler to regenerate them. Only those objects that are stale or which have new or different IPA information are regenerated. This approach saves compile time.

3.6.7. Building a Program with IPA Using Make

As shown earlier, programs can be built with IPA using the make utility. Just add the command-line switch -⁠Mipa, as shown here:

OPT=-Mipa=fast
a.out: file1.o file2.o file3.o
 nvc $(OPT) -o a.out file1.o file2.o file3.o
file1.o: file1.c
 nvc $(OPT) -c file1.c
file2.o: file2.c
 nvc $(OPT) -c file2.c
file3.o: file3.c
 nvc $(OPT) -c file3.c

Using the single make command invokes the compiler to generate any of the object files that are out-of-date, then invokes nvc to link the objects into the executable. At link time, nvc calls the IPA linker to regenerate any stale or invalid IPA-optimized objects.

% make

3.6.8. Questions about IPA

Question: Why is the object file so large?

Answer: An object file created with -⁠Mipa contains several additional sections. One is the summary information used to drive the interprocedural analysis. In addition, the object file contains the compiler internal representation of the source file, so the file can be recompiled at link time with interprocedural optimizations. There may be additional information when inlining is enabled. The total size of the object file may be 5-10 times its original size. The extra sections are not added to the final executable.

Question: What if I compile with -⁠Mipa and link without -⁠Mipa?

Answer: The NVIDIA HPC Compilers generate a legal object file, even when the source file is compiled with -⁠Mipa. If you compile with -⁠Mipa and link without -⁠Mipa, the linker is invoked on the original object files. A legal executable is generated. While this executable does not have the benefit of interprocedural optimizations, any other optimizations do apply.

Question: What if I compile without -⁠Mipa and link with -⁠Mipa?

Answer: At link time, the IPA linker must have summary information about all the functions or routines used in the program. This information is created only when a file is compiled with -⁠Mipa. If you compile a file without -⁠Mipa and then try to get interprocedural optimizations by linking with -⁠Mipa, the IPA linker will issue a message that some routines have no IPA summary information, and will proceed to run the system linker using the original object files. If some files were compiled with -⁠Mipa and others were not, it will determine the safest approximation of the IPA summary information for those files not compiled with -⁠Mipa, and use that to recompile the other files using interprocedural optimizations.

Question: Can I build multiple applications in the same directory with -⁠Mipa?

Answer: Yes. Suppose you have three source files: main1.c, main2.c, and sub.c, where sub.c is shared between the two applications. Suppose you build the first application with -⁠Mipa, using this command:

% nvc -Mipa=fast -o app1 main1.c sub.c

The IPA linker creates two IPA-optimized object files and uses them to build the first application.

main1_ipa4_app1.oo sub_ipa4_app1.oo

Now suppose you build the second application using this command:

% nvc -Mipa=fast -o app2 main2.c sub.c

The IPA linker creates two more IPA-optimized object files:

main2_ipa4_app2.oo sub_ipa4_app2.oo
Note: There are now three object files for sub.c: the original sub.o, and two IPA-optimized objects, one for each application in which it appears.

Question: How is the mangled name for the IPA-optimized object files generated?

Answer: The mangled name has ‘_ipa’ appended, followed by the decimal number of the length of the executable file name, followed by an underscore and the executable file name itself. The suffix is changed to .oo so that linking *.o does not pull in the IPA-optimized objects. If the IPA linker determines that the file would not benefit from any interprocedural optimizations, it does not have to recompile the file at link time, and uses the original object.

Question: Can I use parallel make environments (e.g., pmake) with IPA?

Answer: No. IPA is not compatible with parallel make environments.

4. Using Function Inlining

Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Note that using function inlining indiscriminately can result in much larger code size and no increase in execution speed.

The NVIDIA HPC compilers provide two categories of inlining:

  • Automatic function inlining – In C++ and C, you can inline static functions with the inline keyword by using the -⁠Mautoinline option, which is included with -⁠fast.
  • Function inlining – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the lib suboption. For the latter, you create inline libraries, for example using the nvfortran compiler driver and the -⁠o and -⁠Mextract options.

There are important restrictions on inlining. Inlining only applies to certain types of functions. Refer to Restrictions on Inlining for more details on function inlining limitations.

This section describes how to use the following options related to function inlining:

  -Mautoinline
  -Mextract
  -Minline
  -Mnoinline
  -Mrecursive

4.1. Automatic function inlining in C++ and C

To enable automatic function inlining in C++ and C for static functions with the inline keyword, use the -⁠Mautoinline option (included in -⁠fast). Use -⁠Mnoautoinline to disable it.

These -⁠Mautoinline suboptions let you determine the selection criteria, where n loosely corresponds to the number of lines in the procedure:

maxsize:n
Automatically inline functions size n and less
totalsize:n
Limit automatic inlining to total size of n

4.2. Invoking Procedure Inlining

To invoke the procedure inliner, use the -⁠Minline option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass.

Several -⁠Minline suboptions let you determine the selection criteria for procedures to be inlined. These suboptions include:

except:func
Inlines all eligible procedures except func, a procedure in the source text. You can use a comma-separated list to specify multiple procedure.
[name:]func
Inlines all procedures in the source text whose name matches func. You can use a comma-separated list to specify multiple procedures.
[maxsize:]n
A numeric option is assumed to be a size. Procedures of size n or less are inlined, where n loosely corresponds to the number of lines in the procedure. If both n and func are specified, then procedures matching the given name(s) or meeting the size requirements are inlined.
reshape
Fortran subprograms with array arguments are not inlined by default if the array shape does not match the shape in the caller. Use this option to override the default.
smallsize:n
Always inline procedures of size smaller than n regardless of other size limits.
totalsize:n
Stop inlining in a procedure when the procedure's total size with inlining reaches the n specified.
[lib:]file.ext
Instructs the inliner to inline the procedures within the library file file.ext. If no inline library is specified, procedures are extracted from a temporary library created during an extract prepass.
Tip: Create the library file using the -⁠Mextract option.

If you specify both a procedure name and a maxsize n, the compiler inlines procedures that match the procedure name or have n or fewer statements.

If a name is used without a keyword, then a name with a period is assumed to be an inline library and a name without a period is assumed to be a procedure name. If a number is used without a keyword, the number is assumed to be a size.

Inlining can be disabled with -⁠Mnoinline.

In the following example, the compiler inlines procedures with fewer than approximately 100 statements in the source file myprog.f and writes the executable code in the default output file a.out.

   $ nvfortran -Minline=maxsize:100 myprog.f 

4.3. Using an Inline Library

If you specify one or more inline libraries on the command line with the -Minline option, the compiler does not perform an initial extract pass. The compiler selects functions to inline from the specified inline library. If you also specify a size or function name, all functions in the inline library meeting the selection criteria are selected for inline expansion at points in the source text where they are called.

If you do not specify a function name or a size limitation for the -Minline option, the compiler tries to inline every function in the inline library that matches a function in the source text.

In the following example, the compiler inlines the function proc from the inline library lib.il and writes the executable code in the default output file a.out.

$ nvfortran -Minline=name:proc,lib:lib.il myprog.f

The following command line is equivalent to the preceding line, with the exception that in the following example does not use the keywords name: and lib:. You typically use keywords to avoid name conflicts when you use an inline library name that does not contain a period. Otherwise, without the keywords, a period informs the compiler that the file on the command line is an inline library.

$ nvfortran -Minline=proc,lib.il myprog.f

4.4. Creating an Inline Library

You can create or update an inline library using the -⁠Mextract command-line option. If you do not specify selection criteria with the -⁠Mextract option, the compiler attempts to extract all procedures.

Several -⁠Mextract options let you determine the selection criteria for creating or updating an inline library. These selection criteria include:

func
Extracts the procedure func. you can use a comma-separated list to specify multiple procedures.
[name:]func
Extracts the procedure whose name matches func, a procedure in the source text.
[size:]n
Limits the size of the extracted procedures to those with a statement count less than or equal to n, the specified size.
Note: The size n may not exactly equal the number of statements in a selected procedure; the size parameter is merely a rough gauge.
[lib:]ext.lib
Stores the extracted information in the library directory ext.lib.

If no inline library is specified, procedures are extracted to a temporary library created during an extract prepass for use during the compilation stage.

When you use the -⁠Mextract option, only the extract phase is performed; the compile and link phases are not performed. The output of an extract pass is a library of procedures available for inlining. This output is placed in the inline library file specified on the command line with the -⁠o filename specification. If the library file exists, new information is appended to it. If the file does not exist, it is created. You can use a command similar to the following:

$ nvfortran -Mextract=lib:lib.il myfunc.f

You can use the -⁠Minline option with the -⁠Mextract option. In this case, the extracted library of procedures can have other procedures inlined into the library. Using both options enables you to obtain more than one level of inlining. In this situation, if you do not specify a library with the -⁠Minline option, the inline process consists of two extract passes. The first pass is a hidden pass implied by the -⁠Minline option, during which the compiler extracts procedures and places them into a temporary library. The second pass uses the results of the first pass but puts its results into the library that you specify with the -⁠o option.

4.4.1. Working with Inline Libraries

An inline library is implemented as a directory with each inline function in the library stored as a file using an encoded form of the inlinable function.

A special file named TOC in the inline library directory serves as a table of contents for the inline library. This is a printable, ASCII file which you can examine to locate information about the library contents, such as names and sizes of functions, the source file from which they were extracted, the version number of the extractor which created the entry, and so on.

Libraries and their elements can be manipulated using ordinary system commands.

  • Inline libraries can be copied or renamed.
  • Elements of libraries can be deleted or copied from one library to another.
  • The ls or dir command can be used to determine the last-change date of a library entry.

4.4.2. Dependencies

When a library is created or updated using one of the NVIDIA HPC compilers, the last-change date of the library directory is updated. This allows a library to be listed as a dependence in a makefile and ensures that the necessary compilations are performed when a library is changed.

4.4.3. Updating Inline Libraries – Makefiles

If you use inline libraries you must be certain that they remain up-to-date with the source files into which they are inlined. One way to assure inline libraries are updated is to include them in a makefile.

The makefile fragment in the following example assumes the file utils.f contains a number of small functions used in the files parser.f and alloc.f.

This portion of the makefile:

  • Maintains the inline library utils.il.
  • Updates the library whenever you change utils.f or one of the include files it uses.
  • Compiles parser.f and alloc.f whenever you update the library.

Sample Makefile

SRC = mydir
FC = nvfortran
FFLAGS = -O2
main.o: $(SRC)/main.f $(SRC)/global.h
	$(FC) $(FFLAGS) -c $(SRC)/main.f
utils.o: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h
	$(FC) $(FFLAGS) -c $(SRC)/utils.f
utils.il: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h
	$(FC) $(FFLAGS) -Mextract=15 -o utils.il $(SRC)/utils.f
parser.o: $(SRC)/parser.f $(SRC)/global.h utils.il
	$(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/parser.f
alloc.o: $(SRC)/alloc.f $(SRC)/global.h utils.il
	$(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/alloc.f 
myprog: main.o utils.o parser.o alloc.o
	$(FC) -o myprog main.o utils.o parser.o alloc.o

4.5. Error Detection during Inlining

You can specify the -⁠Minfo=inline option to request inlining information from the compiler when you invoke the inliner. For example:

$ nvfortran -Minline=mylib.il -Minfo=inline myext.f

4.6. Examples

Assume the program dhry consists of a single source file dhry.f. The following command line builds an executable file for dhry in which proc7 is inlined wherever it is called:

$ nvfortran dhry.f -Minline=proc7

The following command lines build an executable file for dhry in which proc7 plus any functions of approximately 10 or fewer statements are inlined (one level only).

Note: The specified functions are inlined only if they are previously placed in the inline library, temp.il, during the extract phase.
$ nvfortran dhry.f -Mextract=lib:temp.il 
$ nvfortran dhry.f -Minline=10,proc7,temp.il

Using the same source file dhry.f, the following example builds an executable for dhry in which all functions of roughly ten or fewer statements are inlined. Two levels of inlining are performed. This means that if function A calls function B, and B calls C, and both B and C are inlinable, then the version of B which is inlined into A will have had C inlined into it.

$ nvfortran dhry.f -Minline=maxsize:10

4.7. Restrictions on Inlining

The following Fortran subprograms cannot be extracted:

  • Main or BLOCK DATA programs.
  • Subprograms containing alternate return, assigned GO TO, DATA, SAVE, or EQUIVALENCE statements.
  • Subprograms containing FORMAT statements.
  • Subprograms containing multiple entries.

A Fortran subprogram is not inlined if any of the following applies:

  • It is referenced in a statement function.
  • A common block mismatch exists; in other words, the caller must contain all common blocks specified in the callee, and elements of the common blocks must agree in name, order, and type (except that the caller's common block can have additional members appended to the end of the common block).
  • An argument mismatch exists; in other words, the number and type (size) of actual and formal parameters must be equal.
  • A name clash exists, such as a call to subroutine xyz in the extracted subprogram and a variable named xyz in the caller.

The following types of C and C++ functions cannot be inlined:

  • Functions which accept a variable number of arguments

Certain C/C++ functions can only be inlined into the file that contains their definition:

  • Static functions
  • Functions which call a static function
  • Functions which reference a static variable

5. Using GPUs

An NVIDIA GPU can be used as an accelerator to which a CPU can offload data and executable kernels to perform compute-intensive calculations. This section gives an overview of options for programming NVIDIA GPUs with NVIDIA's HPC Compilers and covers topics that affect GPU programming when using one or more of the GPU programming models.

5.1. Overview

With the NVIDIA HPC Compilers you can program NVIDIA GPUs using certain standard language constructs, OpenACC directives, OpenMP directives, or CUDA Fortran language extensions. GPU programming with standard language constructs or directives allows you to create high-level GPU-accelerated programs without the need to explicitly initialize the GPU, manage data or program transfers between the host and GPU, or initiate GPU startup and shutdown. Rather, all of these details are implicit in the programming model and are managed by the NVIDIA HPC SDK Fortran, C⁠+⁠+ and C compilers. GPU programming with CUDA extensions gives you access to all NVIDIA GPU features and full control over data management and offloading of compute-intensive loops and kernels.

The NVC⁠+⁠+ compiler supports automatic offload of C⁠+⁠+17 Parallel Algorithms invocations to NVIDIA GPUs under control of the -stdpar compiler option. See the Blog post Accelerating Standard C⁠+⁠+ with GPUs for details on using this feature. The NVFORTRAN compiler supports automatic offload to NVIDIA GPUs of certain Fortran array intrinsics and patterns of array syntax, including use of Volta and Ampere architecture Tensor Cores for appropriate intrinsics. See the Blog post Bringing Tensor Cores to Standard Fortran for details on using this feature.

The NVFORTRAN compiler supports CUDA programming in Fortran. See the NVIDIA CUDA Fortran Programming Guide for complete details on how to use CUDA Fortran. The NVCC compiler supports CUDA programming in C and C⁠+⁠+ in combination with a host C⁠+⁠+ compiler on your system. See the CUDA C⁠+⁠+ Programming Guide for an introduction and overview of how to use NVCC and CUDA C⁠+⁠+.

The NVFORTRAN, NVC⁠+⁠+ and NVC compilers all support directive-based programming of NVIDIA GPUs using OpenACC. OpenACC is an accelerator programming model that is portable across operating systems and various host CPUs and types of accelerators, including both NVIDIA GPUs and multicore CPUs. OpenACC directives allow a programmer to migrate applications incrementally to accelerator targets using standards-compliant Fortran, C⁠+⁠+ or C that remains completely portable to other compilers and systems. It allows the programmer to augment information available to the compilers, including specification of data local to an accelerator region, guidance on mapping of loops onto an accelerator, and similar performance-related details.

The NVFORTRAN, NVC⁠+⁠+, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly.

In user-directed accelerator programming the user specifies the regions of a host program to be targeted for offloading to an accelerator. The bulk of a user's program, as well as regions containing constructs that are not supported on the targeted accelerator, are executed on the host.

5.2. Terminology

Clear and consistent terminology is important in describing any programming model. This section provides definitions of the terms required for you to effectively use this section and the associated programming model.

Accelerator
a parallel processor, such as a GPU or a CPU running in multicore mode, to which a CPU can offload data and executable kernels to perform compute-intensive calculations.
Compute intensity
for a given loop, region, or program unit, the ratio of the number of arithmetic operations performed on computed data divided by the number of memory transfers required to move that data between two levels of a memory hierarchy.
Compute region
a structured block defined by a compute construct. A compute construct is a structured block containing loops which are compiled for the accelerator. A compute region may require device memory to be allocated and data to be copied from host to device upon region entry, and data to be copied from device to host memory and device memory deallocated upon exit. The dynamic range of a compute construct, including any code in procedures called from within the construct, is the compute region. In this release, compute regions may not contain other compute regions or data regions.
Construct
a structured block identified by the programmer or implicitly defined by the language. Certain actions may occur when program execution reaches the start and end of a construct, such as device memory allocation or data movement between the host and device memory. Loops in a compute construct are targeted for execution on the accelerator. The dynamic range of a construct including any code in procedures called from within the construct, is called a region.
CUDA
stands for Compute Unified Device Architecture; CUDA C++ and Fortran language extensions and API calls can be used to explicitly control and program an NVIDIA GPU.
Data region
a region defined by a data construct, or an implicit data region for a function or subroutine containing directives. Data regions typically require device memory to be allocated and data to be copied from host to device memory upon entry, and data to be copied from device to host memory and device memory deallocated upon exit. Data regions may contain other data regions and compute regions.
Device
a general reference to any type of accelerator.
Device memory
memory attached to an accelerator which is physically separate from the host memory.
Directive
in C, a #pragma, or in Fortran, a specially formatted comment statement that is interpreted by a compiler to augment information about or specify the behavior of the program.
DMA
Direct Memory Access, a method to move data between physically separate memories; this is typically performed by a DMA engine, separate from the host CPU, that can access the host physical memory as well as an IO device or GPU physical memory.
GPU
a Graphics Processing Unit; one type of accelerator device.
Host
the main CPU that in this context has an attached accelerator device. The host CPU controls the program regions and data loaded into and executed on the device.
Loop trip count
the number of times a particular loop executes.
Private data
with respect to an iterative loop, data which is used only during a particular loop iteration. With respect to a more general region of code, data which is used within the region but is not initialized prior to the region and is re-initialized prior to any use after the region.
Region
the dynamic range of a construct, including any procedures invoked from within the construct.
Structured block
in C++ or C, an executable statement, possibly compound, with a single entry at the top and a single exit at the bottom. In Fortran, a block of executable statements with a single entry at the top and a single exit at the bottom.
Vector operation
a single operation or sequence of operations applied uniformly to each element of an array.
Visible device copy
a copy of a variable, array, or subarray allocated in device memory, that is visible to the program unit being compiled.

5.3. Execution Model

The execution model targeted by the NVIDIA HPC Compilers is host-directed execution with an attached accelerator device, such as a GPU. The bulk of a user application executes on the host. Compute intensive regions are offloaded to the accelerator device under control of the host. The accelerator device executes kernels, which may be as simple as a tightly-nested loop, or as complex as a subroutine, depending on the accelerator hardware.

5.3.1. Host Functions

Even in accelerator-targeted regions, the host must orchestrate the execution; it

  • allocates memory on the accelerator device
  • initiates data transfer
  • sends the kernel code to the accelerator
  • passes kernel arguments
  • queues the kernel
  • waits for completion
  • transfers results back to the host
  • deallocates memory
Note: In most cases, the host can queue a sequence of kernels to be executed on the device, one after the other.

5.4. Memory Model

The most significant difference between a host-only program and a host+accelerator program is that the memory on the accelerator can be completely separate from host memory, which is the case on many GPUs. For example:

  • The host cannot read or write accelerator memory directly because it is not mapped into the virtual memory space of the host.
  • All data movement between host memory and accelerator memory must be performed by the host through runtime library calls that explicitly move data between the separate memories.
  • In general it is not valid for the compiler to assume the accelerator can read or write host memory directly. This is well-defined starting with the OpenACC 2.7 and OpenMP 5.0 specifications.

The systems with the latest GPUs provide a unified single address space between CPU and GPU for some or all memory regions, as detailed in the Managed and Unified Memory Modes subsection below. In these systems data can be accessed from host and accelerator subprograms without the need for explicit data movement.

The NVIDIA HPC Compilers support the following system memory modes:

Table 12. GPU Memory Modes
Memory Mode Description Compiler flags
Separate All data accessed in host and accelerator programs are in separate (CPU and GPU) memories. Data in the application need to be physically moved between CPU and GPU memory either by adding explicit annotations or by relying on a compiler to detect and migrate the data. -⁠gpu=mem:separate
Managed Dynamically allocated host data are placed in CUDA Managed Memory which is a unified single address space between host and accelerator programs and can therefore be accessed on device without explicit data movement. All other data (host, stack, or global data) remain in separate memory. -⁠gpu=mem:managed
Unified All host data are placed in a unified single address space between the host and accelerator subprograms; no explicit data movements are required. This mode is intended for targets with full CUDA Unified Memory capability and it may utilize CUDA Managed Memory for dynamic allocations. -⁠gpu=mem:unified

If the memory mode is not selected explicitly by passing one of the above -⁠gpu=mem:* options, the compiler selects a default memory mode. The default memory mode for Stdpar is explained in Using Stdpar. When Stdpar is not enabled, the default memory mode is Separate Memory. Memory modes may have specific semantics in each programming language and the compilers can sometimes implicitly determine the data movement that's required. More details can be found in the subsections of each programming model.

The following options -⁠gpu=[no]managed, -⁠gpu=[no]unified and -⁠gpu=pinned are deprecated but still accepted. Refer to Command-line Options Selecting Compiler Memory Modes for compatibility between the current and deprecated memory specific flags.

The compiler implicitly defines the following macros corresponding to the memory mode it compiles for:
  • When the code is compiled for Separate Memory Mode, the compiler defines __NVCOMPILER_GPU_SEPARATE_MEM macro.
  • When the code is compiled for Managed Memory Mode, the compiler defines __NVCOMPILER_GPU_MANAGED_MEM macro.
  • When the code is compiled for Unified Memory Mode, the compiler defines __NVCOMPILER_GPU_UNIFIED_MEM macro. If CUDA Managed Memory is utilised, the compiler defines additionally __NVCOMPILER_GPU_MANAGED_MEM.
When a binary is compiled for one memory mode it may need to be run on a system with specific memory capabilities as follows:
  • Applications compiled for Separate Memory Mode can run on any CUDA platforms.
  • Applications compiled for Managed Memory Mode must be run on platforms with CUDA Managed Memory or full CUDA Unified Memory capabilities.
  • Applications compiled for Unified Memory Mode must be run on platforms with full CUDA Unified Memory.
Note: Memory allocated in the accelerator subprogram can’t be accessed or deallocated from the host.

5.4.1. Separate Host and Accelerator Memory Considerations

The programmer must be aware of the potentially separate memories for many reasons, including but not limited to:

  • Memory bandwidth between host memory and accelerator memory determines the compute intensity required to effectively accelerate a given region of code.
  • Limited size of accelerator memory may prohibit offloading of regions of code that operate on very large amounts of data.

5.4.1.1. Accelerator Memory

On the accelerator side, current GPUs implement a weak memory model. In particular, they do not support memory coherence between threads unless those threads are parallel only at the synchronous level and the memory operations are separated by an explicit barrier. Otherwise, if one thread updates a memory location and another reads the same location, or two threads store a value to the same location, the hardware does not guarantee the results. While the results of running such a program might be inconsistent, it is not accurate to say that the results are incorrect. By definition, such programs are defined as being in error. While a compiler can detect some potential errors of this nature, it is nonetheless possible to write an accelerator region that produces inconsistent numerical results.

Stack data in accelerator subprograms are allocated per thread. Stack data from one thread are not accessible by the other threads.

5.4.1.2. Staging Memory Buffer

Memory transfers between the accelerator and host may not always be asynchronous with respect to the host, even if the chosen programming model (for instance, OpenACC) declares that. This limitation may be due to the specific GPU and host memory architectures.

In order to help the host program proceed while a memory transfer to or from the accelerator is underway, the NVIDIA HPC Compilers Runtime maintains a designated staging memory area, also known as a pinned buffer. This memory area is registered with the CUDA API, which makes it suitable for asynchronous memory transfers between the GPU and the host. When an asynchronous memory transfer is started, the data being transferred is staged through the pinned buffer. Multiple asynchronous operations on the same data can be issued - in that case, the runtime system will operate on the data staged in the pinned buffer, not on the original host memory. When the host program issues an explicit or implicit synchronization request, the data is moved from the pinned buffer to its destination transparently to the application.

The runtime has the discretion to enable or disable the pinned buffer depending on the host and GPU memory architecture. Also, the size of the pinned buffer is determined by the runtime system as appropriate. The user can control some of these decisions using environment variables at the start of the application. Please refer to Environment Variables Controlling Device Memory Management to learn more.

5.4.1.3. Cache Management

Some current GPUs have a software-managed cache, some have hardware-managed caches, and most have hardware caches that can be used only in certain situations and are limited to read-only data. In low-level programming models such as CUDA, it is up to the programmer to manage these caches. The OpenACC programming model provides directives the programmer can use as hints to the compiler for cache management.

5.4.1.4. Environment Variables Controlling Device Memory Management

This section summarizes the environment variables that NVIDIA HPC Compilers use to control device memory management.

The following table contains the environment variables that are currently supported and provides a brief description of each.

Table 13. Memory Management Environment Variables
Environment Variable Use
NVCOMPILER_ACC_BUFFERSIZE For NVIDIA CUDA devices, this defines the size of the pinned buffer used to transfer data between host and device.
NVCOMPILER_ACC_CUDA_CTX_SCHED For NVIDIA CUDA devices, sets flags to be used when creating a new CUDA context. By default, the CU_CTX_SCHED_YIELD flag is used. Please refer to the CUDA Toolkit Documentation for the detailed description of the cuCtxCreate function and the possible flag values.
NVCOMPILER_ACC_CUDA_HEAPSIZE For NVIDIA CUDA devices, sets the heap size limit for malloc() when called on device.
NVCOMPILER_ACC_CUDA_MAX_L2_FETCH_GRANULARITY For NVIDIA CUDA devices, sets the maximum L2 cache fetch granularity size in bytes. A correct value is an integer between 0 and 128.
NVCOMPILER_ACC_CUDA_MEMALLOCASYNC For NVIDIA CUDA devices, when set to a non-zero integer value, enables CUDA asynchronous memory allocations from the default CUDA memory pool as descibed in the CUDA Toolkit Documentation. By default, an internal NVIDIA HPC Runtime memory pool is used instead.
NVCOMPILER_ACC_CUDA_MEMALLOCASYNC_POOLSIZE For NVIDIA CUDA devices, sets the size of the default CUDA memory pool for asynchronous allocations if the NVCOMPILER_ACC_CUDA_MEMALLOCASYNC environment variable is also set to a non-zero integer value.
NVCOMPILER_ACC_CUDA_NOCOPY Disables the use of the pinned buffer when transferring user data between host and NVIDIA CUDA devices. When this variable is set to a non-zero integer value, user data will be transferred directly bypassing the pinned buffer. Asynchronous execution of such data transfers can be limited when this setting is in effect.
NVCOMPILER_ACC_CUDA_PIN For NVIDIA CUDA devices, enables host memory pinning at data directives. When host memory is pinned, data transfers to and from the device can be asynchronous, which can potentially improve program performance. A non-zero integer value enables this mechanism. A value of 2 or greater additionally disallows unpinning the host data after it is pinned. A value of 3 or greater also enables pinning the whole array referenced in a data directive (provided that the size of the array is known), rather than its subarray specified in the data directive. By default, host data referenced at data directives is not pinned unless directed by the compiler at compile-time; refer to Command-line Options Selecting Compiler Memory Modes for more information about the compile-time memory modes.
NVCOMPILER_ACC_CUDA_PINSIZE For NVIDIA CUDA devices, sets the host memory pinning granularity. If host memory pinning is enabled with the NVCOMPILER_ACC_CUDA_PIN environment variable, the runtime will attempt to use this setting to pin larger regions of memory at once, thus potentially lowering the cost of pinning memory when the program needs to pin multiple data regions separately. The maximum allowed value is 1 MB. By default, single byte pinning granularity is used.
NVCOMPILER_ACC_CUDA_PRINTFIFOSIZE For NVIDIA CUDA devices, sets the buffer size for formatted output calls on device. In particular, it controls the buffer size for the printf C function.
NVCOMPILER_ACC_CUDA_STACKSIZE For NVIDIA CUDA devices, sets the stack size limit for device threads.
NVCOMPILER_ACC_DEV_MEMORY For NVIDIA CUDA devices, when set to a valid non-zero size value, enables the use of a device memory pool and sets its size. By default, the device memory pool is not used.
NVCOMPILER_ACC_MEM_MANAGE For NVIDIA CUDA devices, when set to the integer value 0, disables the use of an internal device memory manager. By default, the device memory manager is enabled. It maintains a list of deallocated chunks of device memory in an attempt to efficiently reuse them for future allocations.

5.4.2. Managed and Unified Memory Modes

The NVIDIA HPC Compilers support interoperability with CUDA Unified Memory. This feature is available with the x86-64, OpenPOWER and Arm Server compilers. Unified memory provides a single address space for CPU and GPU; data movement between CPU and GPU memories is implicitly handled by the NVIDIA CUDA driver.

Whenever data is accessed on the CPU or the GPU, it could trigger a data transfer if the last time it was accessed was not on the same device. In some cases, page thrashing may occur and impact performance. An introduction to CUDA Unified Memory is available on Parallel Forall.

5.4.2.1. Managed Memory Mode

In Managed Memory Mode, all Fortran, C⁠+⁠+ and C explicit allocation statements (e.g. allocate, new, and malloc, respectively) in a program unit are replaced by equivalent CUDA managed data allocation calls that place the data in CUDA Managed Memory. The result is that OpenACC and OpenMP data clauses and directives are not needed to manage data movement. They are essentially ignored and can be omitted. For Stdpar this is the minimal required memory mode since there are no specific annotations for data used in the parallel region.

To enable Managed Memory Mode, add the option -⁠gpu=mem:managed to the compiler and linker command lines.

When a program allocates managed memory, it allocates host pinned memory as well as device memory thus making allocate and free operations somewhat more expensive and data transfers somewhat faster. A memory pool allocator is used to mitigate the overhead of the allocate and deallocate operations. More details can be found in Memory Pool Allocator.

Managed Memory Mode has the following limitations:

  • Use of managed memory applies only to dynamically-allocated data.
  • Given an allocatable aggregate with a member that points to local, global, or static data, compiling with -⁠gpu=mem:managed and attempting to access memory through that pointer from the compute kernel will cause a failure at runtime.
  • C⁠+⁠+ virtual functions are not supported.
  • The -⁠gpu=mem:managed compiler option must be used to compile the files in which variables (accessed from GPU) are allocated, even if there is no code to accelerate on the GPU in the source file.
  • When linking multiple translation units, the application must ensure that all data are deallocated using the scheme corresponding to their allocation. For example if the data are allocated in managed memory the deallocation must be performed using CUDA API calls for managed memory. More details and extra compiler support is detailed in Interception of Deallocations.

Managed Memory Mode has the following additional limitations when used with NVIDIA Kepler GPUs:

  • Data motion on Kepler GPUs is achieved through fast pinned asynchronous data transfers; from the program's perspective, however, the transfers are synchronous.
  • The NVIDIA HPC Compiler Runtime enforces synchronous execution of kernels when -⁠gpu=mem:managed is used on a system with a Kepler GPU. This situation may result in slower performance because of the extra synchronizations and decreased overlap between CPU and GPU.
  • The total amount of managed memory is limited to the amount of available device memory on Kepler GPUs.
Memory Allocations/Deallocations Automatically Changed to Managed Memory

When the compiler utilizes CUDA Managed Memory capability either with -⁠gpu=mem:managed or -⁠gpu=mem:unified, the following explicit allocations/deallocations are automatically changed into cudaMallocManaged/cudaFree-type allocations/deallocations:

  • For C++:
    • All calls to global operator new and operator delete that allocate or deallocate memory, such as:
      operator new(std::size_t size)
      operator new(std::size_t size, const std::nothrow_t &nothrow_value)
      operator new(std::size_t size, std::align_val_t align)
      operator new(std::size_t size, std::align_val_t align, const std::nothrow_t &nothrow_value)
      operator delete(void *p)
      operator delete(void *p, std::size_t size)
      operator delete(void *p, std::align_val_t align)
      operator delete(void *p, std::size_t size, std::align_val_t align)
      operator delete(void *p, const std::nothrow_t &nothrow_value)
      operator delete(void *p, std::align_val_t align, const std::nothrow_t &nothrow_value)
      
    • All the array forms of the above overloads.
    • All calls to malloc/free functions.
  • For C: all calls to malloc/free functions.
  • For Fortran:
    • All allocations of automatic arrays.
    • all allocate/deallocate statements with allocatable arrays or pointer variables.

5.4.2.2. Unified Memory Mode

In Unified Memory Mode, the requirements for the program are further relaxed compared to Managed Memory Mode. Specifically, not only is dynamically allocated system memory accessible on the GPU, but global and local memory are also accessible.

To enable this feature, add the option -⁠gpu=mem:unified to the compiler and linker command lines.

Programs compiled with -⁠gpu=mem:unified must be run on systems that support full CUDA Unified Memory capability. At this time, full CUDA Unified Memory is supported on NVIDIA Grace Hopper Superchip systems and Linux x86-64 systems running with the Heterogeneous Memory Management (HMM) feature enabled in the Linux kernel. Details about these platforms are available in the following blog posts on the NVIDIA website: Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip and Simplifying GPU Application Development with Heterogeneous Memory Management.

In Unified Memory Mode, the compiler assumes that any system memory is accessible on the GPU. Even so, the compiler may generate managed memory allocations for explicit data allocations when it considers them beneficial for program performance. If you would like to enforce or prohibit the use of managed memory for dynamic allocations pass -⁠gpu=mem:unified:[no]managedalloc to compilation and linking.

Unified Memory Mode has the following limitations:

  • Unified memory support for OpenACC, OpenMP and Stdpar Fortran is not mix-and-match; all object files containing OpenACC/OpenMP directives or Fortran DO CONCURRENT constructs must be compiled and linked with -⁠gpu=mem:unified to ensure correct execution.
  • C⁠+⁠+ virtual functions are not supported.
Transitioning to Unified Memory Mode

Applications transitioning to architectures that support Unified Memory Mode can be recompiled with -⁠gpu=mem:unified without any code modifications.

The programmer should be aware that in Unified Memory Mode, the whole program state becomes essentially shared between the CPU and the GPU. By implication, modifications to program variables made on the GPU are visible on the CPU. That is, the GPU does not operate on a copy of the data even if the program contains respective directives, but instead the GPU operates directly on the data in system memory. To understand the importance of this idea, consider the following OpenACC C program:

int x[N];
void foo() {
  #pragma acc enter data create(x[0:N])
  #pragma acc parallel loop
  for (int i = 0; i < N; i++) {
    x[i] = i;
  }
}
      

When compiled in Separate Memory Mode, in the foo() function a copy of the array x is created in GPU memory and initialized as written in the loop construct. When -⁠gpu=mem:unified is added, however, the compiler ignores the acc enter data construct, and the loop construct initializes the array x in system memory.

Another implication of which to be aware, asynchronous code execution on the GPU can introduce race conditions over access to program data. More details about code patterns to avoid when writing application sources for Unified Memory Mode can be found in the sections about specific programming models of this guide e.g. OpenACC, OpenMP, or CUDA Fortran.

5.4.3. Memory Pool Allocator

Dynamic memory allocations may be made using cudaMallocManaged(), a routine which has higher overhead than allocating non-managed memory using cudaMalloc(). The more calls to cudaMallocManaged(), the more significant the impact on performance.

To mitigate the overhead of cudaMallocManaged() or other CUDA allocation API calls, there is a pool allocator enabled by default in the presence of the -⁠gpu=mem:managed, -⁠gpu=mem:separate:pinnedalloc, or -⁠gpu=mem:unified compiler options. It can be disabled, or its behavior modified, using these environment variables:

Table 14. Pool Allocator Environment Variables
Environment Variable Use
NVCOMPILER_ACC_POOL_ALLOC Disable the pool allocator. The pool allocator is enabled by default; to disable it, set NVCOMPILER_ACC_POOL_ALLOC to 0.
NVCOMPILER_ACC_POOL_SIZE Set the of the pool. The default size is 1GB but other sizes (i.e., 2GB, 100MB, 500KB, etc.) can be used. The actual pool size is set such that the size is the nearest, smaller number in the Fibonacci series compared to the provided or default size. If necessary, the pool allocator will add more pools but only up to the NVCOMPILER_ACC_POOL_THRESHOLD value.
NVCOMPILER_ACC_POOL_ALLOC_MAXSIZE Set the maximum size for allocations. The default maximum size for allocations is 500MB but another size (i.e., 100KB, 10MB, 250MB, etc.) can be used as long as it is greater than or equal to 16B.
NVCOMPILER_ACC_POOL_ALLOC_MINSIZE Set the minimum size for allocation blocks. The default size is 128B but other sizes can be used. The size must be greater than or equal to 16B.
NVCOMPILER_ACC_POOL_THRESHOLD Set the percentage of total device memory that the pool allocator can occupy. Values from 0 to 100 are accepted. The default value is 50, corresponding to 50% of device memory.
Note: Note that where the size is specified if the unit suffix (B, KB, MB or GB) is ommited, the value is set by default in bytes.

5.4.4. Interception of Deallocations

While NVIDIA HPC Compilers facilitate the use of managed or pinned memory automatically, the application must ensure that memory is deallocated using the API which "matches" the API used to allocate said memory. For example, if cudaMallocManaged is used to allocate, then cudaFree must be used to deallocate; if cudaMallocHost is used for allocations, cudaFreeHost must be used for deallocations. Understanding this requirement is particularly important when third party or standard libraries are used; these libraries may have been compiled without any memory mode settings which sets up a situation where the deallocation routines in the libraries may not match the allocations made. When data is deallocated with an unmatching API call, the application may exhibit undefined behavior including crashing. To mitigate this issue, the compiler supports an interception mode in which calls to the standard deallocation function (e.g. free in C, delete in C++, or deallocate in Fortran) are inspected by the runtime and, if the memory is not detected as being system-allocated, the runtime replaces the standard deallocation function with the deallocation API corresponding to the allocation scheme in use. To activate this interception mode, use the -⁠gpu=interceptdeallocations compiler flag. The interception is enabled by default for Stdpar in the presence of managed memory allocations. To deactivate the interception use the -⁠gpu=nointerceptdeallocations compiler switch. This interception can incur extra runtime overhead.

5.4.5. Command-line Options Selecting Compiler Memory Modes

The following table maps the new memory model flags to their deprecated equivalents.

Table 15. Command-line Options Corresponding to Compiler Memory Modes
Current Flags Deprecated Flags Brief Description
-⁠gpu=mem:managed -⁠gpu=managed Managed Memory Mode
-⁠gpu=mem:managed -stdpar -⁠gpu=nounified -stdpar Managed Memory Mode
-⁠gpu=mem:unified -⁠gpu=unified Unified Memory Mode
-⁠gpu=mem:unified:managedalloc -⁠gpu=unified,managed Unified Memory Mode, all dynamically allocated data are implicitly in CUDA Managed Memory.
-⁠gpu=mem:unified:nomanagedalloc -⁠gpu=unified,nomanaged Unified Memory Mode, CUDA Managed Memory is not used implicitly.
-⁠gpu=mem:separate -⁠gpu=nomanaged Separate Memory Mode
-⁠gpu=mem:separate -⁠gpu=nounified Separate Memory Mode
-⁠gpu=mem:separate -⁠gpu=nomanaged,nounified Separate Memory Mode
-⁠gpu=mem:separate:pinnedalloc -⁠gpu=pinned Separate Memory Mode, dynamically allocated data are in CPU pinned memory implicitly.

5.5. Fortran pointers in device code

A Fortran pointer variable is implemented with a pointer and a descriptor, where the descriptor (often called a "dope vector") holds the array bounds and strides for each dimension, among other information, such as the size for each element and whether the pointer is associated. A Fortran scalar pointer has no bounds information, but does have a minimal descriptor. In Fortran, referring to the pointer variable always refers to the pointer target. There is no syntax to explicitly refer to the pointer and descriptor that implement the pointer variable.

Fortran allocatable arrays and variables are implemented much the same way as pointer arrays and variables. Much of the discussion below applies both to allocatables and pointers.

In OpenACC and OpenMP, when a pointer variable reference appears in a data clause, it's the pointer target that gets allocated or moved to device memory. The pointer and descriptor are neither allocated nor moved.

When a pointer variable is declared in a module declaration section and appears in an !$acc declare create() or !$omp declare target to() directive, then the pointer and descriptor are statically allocated in device memory. When the pointer variable appears in a data clause, the pointer target is allocated or copied to the device, and the pointer and descriptor are 'attached' to the device copy of the data. If the pointer target is already present in device memory, no new memory is allocated or copied, but the pointer and descriptor are still 'attached', making the pointer valid in device memory. An important side effect of adding declare create in the module declaration section is that when the program executes an 'allocate' statement for the pointer (or allocatable), memory is allocated in both CPU and device memory. This means the newly allocated data is already present in device memory. To get values from CPU to device memory or back, you'll have to use update directives.

When a pointer variable is used in an OpenACC or OpenMP compute construct, the compiler creates a private copy of the pointer and descriptor for each thread, unless the pointer variable was in a module as described above. The private pointer and descriptor will contain information about the device copy of the pointer target. In the compute construct, the pointer variables may be used pretty much as they can in host code outside a compute construct. However, there are some limitations. The program can do a pointer assignment to the pointer, changing the pointer, but that will only change the private pointer for that thread. The modified pointer in the compute construct will not change the corresponding pointer and descriptor in host memory.

5.6. Calling routines in a compute kernel

Using explicit interfaces is a common occurrence when writing Fortran applications. Here are some cases where doing so is required for GPU programming.

  • Explicit interfaces are required when using OpenACC routine bind or OpenMP declare variant.
  • Fortran do concurrent requires routines to be pure which creates the need for an explicit interface.

5.7. Supported Processors and GPUs

This NVIDIA HPC Compilers release supports x86-64, OpenPOWER and Arm Server CPUs. Cross-compilation across the different families of CPUs is not supported, but you can use the -⁠tp=<target> flag as documented in the man pages to specify a target processor within a family.

To direct the compilers to generate code for NVIDIA GPUs, use the -⁠acc flag to enable OpenACC directives, the -⁠mp=gpu flag to enable OpenMP directives, the -⁠stdpar flag for standard language parallelism, and the -cuda flag for CUDA Fortran. Use the -⁠gpu flag to select specific options for GPU code generation. You can then use the generated code on any supported system with CUDA installed that has a CUDA-enabled GeForce, Quadro, or Tesla card.

For more information on these flags as they relate to accelerator technology, refer to Compiling an OpenACC Program.

For a complete list of supported CUDA GPUs, refer to the NVIDIA website at: http://www.nvidia.com/object/cuda_learn_products.html

5.8. CUDA Versions

The NVIDIA HPC compilers use components from NVIDIA's CUDA Toolkit to build programs for execution on an NVIDIA GPU. The NVIDIA HPC SDK puts the CUDA Toolkit components into an HPC SDK installation sub-directory; the HPC SDK currently bundles two versions of recently-released Toolkits.

You can compile a program for an NVIDIA GPU on any system supported by the HPC compilers. You will be able to run that program only on a system with an NVIDIA GPU and an installed NVIDIA CUDA driver. NVIDIA HPC SDK products do not contain CUDA device drivers. You must download and install the appropriate CUDA Driver from NVIDIA.

The NVIDIA HPC SDK utility nvaccelinfo prints the driver version as its first line of output. You can use it to find out which version of the CUDA Driver is installed on your system.

The NVIDIA HPC SDK 24.11 includes components from the following versions of the CUDA Toolkit:
  • CUDA 11.8
  • CUDA 12.4

If you are compiling a program for GPU execution on a system without an installed CUDA driver, the compiler selects the version of the CUDA Toolkit to use based on the value of the DEFCUDAVERSION variable contained in a file called localrc which is created during installation of the HPC SDK.

If you are compiling a program for GPU execution on a system with an installed CUDA driver, the compiler detects the version of the CUDA driver and selects the appropriate CUDA Toolkit version to use from those bundled with the HPC SDK.

The compilers look for a CUDA Toolkit version in the /opt/nvidia/hpc_sdk/target/24.11/cuda directory that matches the version of the CUDA Driver installed on the system. If an exact match is not found, the compiler searches for the closest match. For CUDA Driver versions 11.2 through 11.8, the compiler will use the CUDA 11.8 Toolkit. For CUDA Driver versions 12.0 and later, the compiler will use the newest CUDA 12.x Toolkit.

You can change the compiler's default selection of CUDA Toolkit version using a compiler option. Add the cudaX.Y sub-option to -⁠gpu where X.Y denotes the CUDA version. Using a compiler option changes the CUDA Toolkit version for one invocation of the compiler. For example, to compile an OpenACC C file with the CUDA 11.8 Toolkit you would use:
nvc -acc -gpu=cuda11.8

5.9. Compute Capability

The compilers can generate code for NVIDIA GPU compute capabilities 3.5 through 8.6. The compilers construct a default list of compute capabilities that matches the compute capabilities supported by the GPUs found on the system used in compilation. If there are no GPUs detected, the compilers generate code for every supported compute capability.

You can override the default by specifying one or more compute capabilities using either command-line options or an rcfile.

To change the default with a command-line option, provide a comma-separated list of compute capabilities to the -⁠gpu option.

To change the default with an rcfile, set the DEFCOMPUTECAP value to a blank-separated list of compute capabilities in the siterc file located in your installation's bin directory:

set DEFCOMPUTECAP=60 70;

Alternatively, if you don't have permissions to change the siterc file, you can add the DEFCOMPUTECAP definition to a separate .mynvrc file in your home directory.

The generation of device code can be time consuming, so you may notice an increase in compile time as the number of compute capabilities increases.

5.10. PTX JIT Compilation

As of HPC SDK 22.9, support for PTX JIT compilation is enabled in all compilers for relocatable device code mode. This means that applications built with -gpu=rdc (that is, with relocatable device code enabled, which is the default mode) are forward-compatible with newer GPUs thanks to the embedded PTX code. The embedded PTX code is dynamically compiled when the application runs on a GPU architecture newer than the architecture specified at compile time.

The support for PTX JIT compilation is enabled automatically, which means that you do not need to change the compiler invocation command lines for your existing projects.

Use scenarios

  • As an example, you can compile your application targeting the Ampere GPU without having to worry about the Hopper GPU architecture. Once the application runs on a Hopper GPU, it will seamlessly use the embedded PTX code.
  • In CUDA Fortran, or with the CUDA Interoperability mode enabled, you can mix in object files compiled with the CUDA NVCC compiler containing PTX code. This PTX code from NVCC will be handled by the JIT compiler alongside the PTX code contained in object files produced by the HPC SDK compilers. When using the CUDA NVCC compiler, the relocatable device code generation must be enabled explicitly using the NVCC --relocatable-device-code true switch, as explained in the CUDA Compiler Driver guide. For information about CUDA Interoperability, please refer to https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-interop-cuda. The CUDA Fortran Programming Guide is available here: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide.

By default, the compiler will choose the compute capability that matches the GPU on the system where the code is being compiled. For code that is going to run on the system where it is compiled, we recommend letting the compiler set the compute capability.

When the default won’t work, we recommend compiling applications for a range of compute capabilities that the application is expected to run against, for example, using the -gpu=ccall compiler option. When running the application on a system that supports one of those compute capabilities, the CUDA driver minor version is allowed to be less than the version of the CUDA toolkit used at compile time, as covered in section CUDA Versions.

Performance considerations

PTX JIT compilation, when it occurs, can have a start-up overhead for the application. The JIT compiler keeps a cached copy of the produced device code, which reduces the overhead on subsequent runs. Please refer to the CUDA Programming Guide for detailed information about how the JIT compiler works.

Known limitations

In general, in order for PTX JIT compilation to work, the CUDA driver installed on the deployment system must be at least of the version that matches the CUDA toolkit used to compile the application. This requirement is stricter than those explained in section CUDA Versions.

For example, as explained in that section, the compilers will use the CUDA 11.8 toolkit that is shipped as part of the HPC SDK toolkit when the CUDA driver installed in the system is at least 11.2. However, while the CUDA 11.2 driver is commonly sufficient to run the application, it will not be able to compile the PTX code produced by the CUDA 11.8 toolkit. This means that any deployment system where the PTX JIT compilation is expected to be used must have at least the CUDA 11.8 driver installed. Please refer to the CUDA Compatibility guide for further information about the CUDA Driver compatibility with CUDA Toolkits.

When the application is expected to run on a newer GPU architecture than specified at compile time, we recommend having a CUDA driver installed on the deployment system matching the CUDA toolkit used to build the application. One way to achieve that is to use the NVHPC_CUDA_HOME environment variable at compile time to provide a specific CUDA toolkit.

Below are a few examples of how the PTX version incompatibility can be diagnosed and fixed. As a general rule, if the CUDA driver is unable to run the application due to incompatible PTX, the application will terminate with an error message indicating the cause. OpenACC and OpenMP applications will in most cases suggest compiler flags to target the current CUDA installation.

OpenACC

Consider this program that we will compile for Volta GPU and attempt to run on an Ampere GPU, on a system that has CUDA 11.5 installed:

#include <stdio.h>
#define N 1000
int array[N];
int main() {
#pragma acc parallel loop copy(array[0:N])
   for(int i = 0; i < N; i++) {
      array[i] = 3.0;
   }
   printf("Success!\n");
}
    
When we build the program, HPC SDK will choose the CUDA 11.8 toolkit that is included as the default. When we attempt to run it, it fails because code generated with 11.8 does not work with the 11.5 driver:
$ nvc -acc -gpu=cc70 app.c
$ ./a.out
Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc70
Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0
 File: /tmp/app.c
 Function: main:3
 Line: 3
    
From the error message it follows that the system is unable to execute the Volta GPU instructions on the current system. The embedded Volta PTX could not be compiled, which implies a CUDA driver incompatibility. A way to fix this is to use the installed CUDA 11.5 toolkit at compile time:
$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5
$ nvc -acc -gpu=cc70 app.c
$ ./a.out
Success!
    

OpenMP

Likewise, an OpenMP program will compile but not run:
#include <stdio.h>
#define N 1000
int array[N];
int main() {
#pragma omp target loop
   for(int i = 0; i < N; i++) {
      array[i] = 0;

   }
   printf("Success!\n");
}
    
$ nvc -mp=gpu -gpu=cc70 app.c
$ ./a.out
Accelerator Fatal Error: Failed to find device function 'nvkernel_main_F1L3_2'! File was compiled with: -gpu=cc70
Rebuild this file with -gpu=cc80 to use NVIDIA Tesla GPU 0
 File: /tmp/app.c
 Function: main:3
 Line: 3
    

We can also fix it by having NVHPC_CUDA_HOME point at the matching CUDA toolkit location:

$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5
$ nvc -acc -gpu=cc70 app.c
$ ./a.out
Success!
    

C++

In contrast to OpenACC and OpenMP applications that simply terminate when PTX JIT encounters an insufficient CUDA driver version, C++ applications throw a system exception when there is a PTX incompatibility:

#include <vector>
#include <algorithm>
#include <execution>
#include <iostream>
#include <assert.h>
int main() {
  std::vector<int> x(1000, 0);
  x[1] = -20;
  auto result = std::count(std::execution::par, x.begin(), x.end(), -20);
  assert(result == 1);
  std::cout << "Success!" << std::endl;
}
    
$ nvc++ -stdpar -gpu=cc70 app.cpp
$ ./a.out
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  after reduction step 1: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain.
Aborted (core dumped)
    

The exception message contains a direct reference to an incompatible PTX, which in turn implies an mismatch between the CUDA toolkit and the CUDA driver version.

We can fix it similarly by setting NVHPC_CUDA_HOME:

$ export NVHPC_CUDA_HOME=/usr/local/cuda-11.5
$ nvc++ -stdpar -gpu=cc70 app.cpp
$ ./a.out
Success!
    

6. Using OpenACC

This chapter gives an overview of directive-based OpenACC programming in which compiler directives are used to specify regions of code in Fortran, C and C++ programs to be offloaded from a host CPU to an NVIDIA GPU. For complete details on using OpenACC with NVIDIA GPUs, see the OpenACC Getting Started Guide.

6.1. OpenACC Programming Model

With the emergence of GPU architectures in high performance computing, programmers want the ability to program using a familiar, high level programming model that provides both high performance and portability to a wide range of computing architectures. OpenACC emerged in 2011 as a programming model that uses high-level compiler directives to expose parallelism in the code and parallelizing compilers to build the code for a variety of parallel accelerators.

This chapter will not attempt to describe OpenACC itself. For that, please refer to the OpenACC specification on the OpenACC www.openacc.org website. Here, we will discuss differences between the OpenACC specification and its implementation by the NVIDIA HPC Compilers.

Other resources to help you with your parallel programming including video tutorials, course materials, code samples, a best practices guide and more are available on the OpenACC website.

6.1.1. Levels of Parallelism

OpenACC supports three levels of parallelism:

  • an outer doall (fully parallel) loop level
  • a workgroup or threadblock (worker parallel) loop level
  • an inner synchronous (SIMD or vector) loop level

Each level can be multidimensional with 2 or 3 dimensions, but the domain must be strictly rectangular. The synchronous level may not be fully implemented with SIMD or vector operations, so explicit synchronization is supported and required across this level. No synchronization is supported between parallel threads across the doall level.

The OpenACC execution model on the device side exposes these levels of parallelism and the programmer is required to understand the difference between, for example, a fully parallel loop and a loop that is vectorizable but requires synchronization across iterations. All fully parallel loops can be scheduled for any of doall, workgroup or synchronous parallel execution, but by definition SIMD vector loops that require synchronization can only be scheduled for synchronous parallel execution.

6.1.2. Enable OpenACC Directives

NVIDIA HPC compilers enable OpenACC directives with the -acc and -gpu command line options. For more information on these options refer to Compiling an OpenACC Program.

_OPENACC macro

The _OPENACC macro name is defined to have a value yyyymm where yyyy is the year and mm is the month designation of the version of the OpenACC directives supported by the implementation. For example, the version for November, 2017 is 201711. All OpenACC compilers define this macro when OpenACC directives are enabled.

6.1.3. OpenACC Support

The NVIDIA HPC Compilers implement most features of OpenACC 2.7 as defined in The OpenACC Application Programming Interface, Version 2.7, November 2018, http://www.openacc.org, with the exception that the following OpenACC 2.7 features are not supported:

  • nested parallelism
  • declare link
  • enforcement of the cache clause restriction that all references to listed variables must lie within the region being cached
  • Subarrays and composite variables in reduction clauses
  • The self clause
  • The default clause on data constructs

6.1.4. OpenACC Extensions

The NVIDIA Fortran compiler supports an extension to the collapse clause on the loop construct. The OpenACC specification defines collapse:

collapse(n)

NVIDIA Fortran supports the use of the identifier force within collapse:

collapse(force:n)

Using collapse(force:n) instructs the compiler to enforce collapsing parallel loops that are not perfectly nested.

6.2. Compiling an OpenACC Program

Several compiler options are applicable specifically when working with OpenACC. These options include -acc, -gpu, and -Minfo.

6.2.1. -[no]acc

Enable [disable] OpenACC directives. The following suboptions may be used following an equals sign ("="), with multiple sub-options separated by commas:

gpu
OpenACC directives are compiled for GPU execution only.
host
Compile for serial execution on the host CPU.
multicore
Compile for parallel execution on the host CPU.
legacy
Suppress warnings about deprecated NVIDIA accelerator directives.
[no]autopar
Enable [disable] loop autoparallelization within acc parallel. The default is to autoparallelize, that is, to enable loop autoparallelization.
[no]routineseq
Compile every routine for the devicee. The default behavior is to not treat every routine as a seq directive.
strict
Instructs the compiler to issue warnings for non-OpenACC accelerator directives.
sync
Ignore async clauses
verystrict
Instructs the compiler to fail with an error for any non-OpenACC accelerator directive.
[no]wait
Wait for each device kernel to finish. Kernel launching is blocked by default unless the async clause is used.

Default

By default OpenACC directives are compiled for GPU and sequential CPU host execution (i.e. equivalent to explicitly setting -acc=gpu,host).

Usage

The following command-line requests that OpenACC directives be enabled and that an error be issued for any non-OpenACC accelerator directive.

$ nvfortran -acc=verystrict prog.f

Predefined Macros

The following macros corresponding to the target compiled for are added implicitly:
  • __NVCOMPILER_OPENACC_GPU when the OpenACC directives are compiled for GPU.
  • __NVCOMPILER_OPENACC_MULTICORE when the OpenACC directives are compiled for multicore CPU.
  • __NVCOMPILER_OPENACC_HOST when the OpenACC directives are compiled for serial execution on CPU.

6.2.2. -gpu

Used in combination with the -⁠acc, -⁠cuda, -⁠mp, and -⁠stdpar flags to specify options for GPU code generation. The following sub-options may be used following an equals sign ("="), with multiple sub-options separated by commas:

autocompare
Automatically compare CPU vs GPU results at execution time: implies redundant
ccXY
Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -⁠help -⁠gpu to see the valid compute capabilities for your installation.
ccall
Generate code for all compute capabilities supported by this platform and by the selected or default CUDA Toolkit.
ccall-major
Compile for all major supported compute capabilities.
ccnative
Detects the visible GPUs on the system and generates codes for them. If no device is available, the compute capability matching NVCC's default will be used.
cudaX.Y
Use CUDA X.Y Toolkit compatibility, where installed
[no]debug
Enable [disable] debug information generation in device code
deepcopy
Enable full deep copy of aggregate data structures in OpenACC; Fortran only
fastmath
Use routines from the fast math library
[no]flushz
Enable [disable] flush-to-zero mode for floating point computations on the GPU
[no]fma
Generate [do not generate] fused multiply-add instructions; default at -⁠O3
[no]implicitsections
Change [Do not change] array element references in a data clause into an array section. In C⁠+⁠+, the implicitsections option will change update device(a[n]) to update device(a[0:n]). In Fortran, it will change enter data copyin(a(n)) to enter data copyin(a(:n)). The default behavior, noimplicitsections, can also be changed using rcfiles; for example, one could add set IMPLICITSECTIONS=0; to siterc or another rcfile.
[no]interceptdeallocations
Intercept [Do not intercept] calls to standard library memory deallocations (e.g. free) and call the corresponding CUDA memory deallocation version if address is in pinned or managed memory, regular version otherwise.
keep
Keep the kernel files (.cubin, .ptx, source)
[no]lineinfo
Enable [disable] GPU line information generation
loadcache:{L1|L2}
Choose what hardware level cache to use for global memory loads; options include the default, L1, or L2
[no]managed
Allocate [do not allocate] any dynamically allocated data in CUDA Managed memory. Use -⁠gpu=nomanaged with -⁠stdpar to prevent that flag's implicit use of -⁠gpu=managed when CUDA Managed memory capability is detected. This option is deprecated.
maxregcount:n
Specify the maximum number of registers to use on the GPU; leaving this blank indicates no limit
mem:{separate|managed|unified}
Select GPU memory mode for the generated binary. This controls CUDA memory capability to be utilised such as separate GPU memory only (separate), GPU Managed Memory for the dynamically allocated data (managed), or system memory aka full CUDA Unified Memory (unified). Use of Managed or Unified Memory facilitates simpler programming by eliminating the need to detect all data to be copied into and outside of the code region executing on the GPU.
pinned
Use CUDA Pinned Memory. This option is deprecated.
ptxinfo
Print PTX info
[no]rdc
Generate [do not generate] relocatable device code.
redundant
Redundant CPU/GPU execution
safecache
Allow variable-sized array sections in cache directives; compiler assumes they fit into CUDA shared memory
sm_XY
Generate code for a device with compute capability X.Y. Multiple compute capabilities can be specified, and one version will be generated for each. By default, the compiler will detect the compute capability for each installed GPU. Use -⁠help -⁠gpu to see the valid compute capabilities for your installation.
stacklimit:<l>nostacklimit
Sets the limit (l) of stack variables in a procedure or kernel, in KB. This option is deprecated.
[no]unified
Compile [do not compile] for CUDA Unified memory capability, where system memory is accessible from the GPU. This mode utilizes system and managed memory for dynamically allocated data unless explicit behavior is set through -⁠gpu=[no]managed. Use -⁠gpu=nounified with -⁠stdpar to prevent that flag's implicit use of -⁠gpu=unified when CUDA Unified memory capability is detected. This option must appear in both the compile and link lines. This option is deprecated.
[no]unroll
Enable [disable] automatic inner loop unrolling; default at -⁠O3
zeroinit
Initialize allocated device memory with zero

Usage

In the following example, the compiler generates code for NVIDIA GPUs with compute capabilities 6.0 and 7.0.

$ nvfortran -acc -gpu=cc60,cc70 myprog.f

The compiler automatically invokes the necessary software tools to create the kernel code and embeds the kernels in the object file.

To link in the appropriate GPU libraries, you must link an OpenACC program with the -⁠acc flag, and similarly for -⁠cuda, -⁠mp, or -⁠stdpar.

DWARF Debugging Formats

Use the -⁠g option to enable generation of full DWARF information on both the host and device; in the absence of other optimization flags, -⁠g sets the optimization level to zero. If a -⁠O option raises the optimization level to one or higher, only GPU line information is generated in device code even when -⁠g is specified. To enforce full DWARF generation for device code at optimization levels above zero, use the debug sub-option to -⁠gpu. Conversely, to prevent the generation of dwarf information for device code, use the nodebug sub-option to -⁠gpu. Both debug and nodebug can be used independently of -⁠g.

6.3. OpenACC for Multicore CPUs

The NVIDIA OpenACC compilers support the option -acc=multicore, to set the target accelerator for OpenACC programs to the host multicore CPU. This will compile OpenACC compute regions for parallel execution across the cores of the host processor or processors. The host multicore CPU will be treated as a shared-memory accelerator, so the data clauses (copy, copyin, copyout, create) will be ignored and no data copies will be executed.

By default, -acc=multicore will generate code that will use all the available cores of the processor. If the compute region specifies a value in the num_gangs clause, the minimum of the num_gangs value and the number of available cores will be used. At runtime, the number of cores can be limited by setting the environment variable ACC_NUM_CORES to a constant integer value. The number of cores can also be set with the void acc_set_num_cores(int numcores) runtime call. If an OpenACC compute construct appears lexically within an OpenMP parallel construct, the OpenACC compute region will generate sequential code. If an OpenACC compute region appears dynamically within an OpenMP region or another OpenACC compute region, the program may generate many more threads than there are cores, and may produce poor performance.

The -acc=multicore option differs from the -acc=host option in that -acc=host generates sequential host CPU code for the OpenACC compute regions.

6.4. OpenACC with CUDA Unified Memory

When developing OpenACC source for a target supporting CUDA Unified Memory, you can take advantage of a simplified approach to programming because there is no need for data clauses and directives, either in full or in part, depending on the exact memory capability the target supports and the compiler options used.

The discussion in this section assumes you have become familiar with the Separate, Managed, and Unified Memory Modes covered in the Memory Model and Managed and Unified Memory Modes sections.

In Managed Memory Mode, only dynamically-allocated data are implicitly managed by the CUDA runtime; OpenACC data clauses and directives are therefore not needed for movement of this "managed" data. Data clauses and directives are still required to handle static data (C static and extern variables, Fortran module, common block and save variables) and function local data.

In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data clauses and directives are no longer required to indicate which data should reside in GPU memory. All variables are accessible from the OpenACC compute regions executing on the GPU. The NVHPC compiler implementation closely adheres to the shared memory mode detailed in the OpenACC specification, meaning that copy, copyin, copyout, and create clauses will not result in any device allocation or data transfer. The device_resident clause is still honored as in discrete memory mode and results in an allocation of data only accessible from device code. Device memory can also be allocated or deallocated in OpenACC programs in Unified Memory Mode by using the acc_malloc or acc_free API calls.

Understanding Data Movement

In the absence of visible data clauses or directives, when the compiler encounters a compute construct it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows:

  • In Separate Memory Mode, the compiler emits an error requesting an explicit data clause be added to specify size/shape of the data to be copied.
  • In Managed Memory Mode (-⁠gpu=mem:managed), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, if the data was defined globally or is located on the CPU stack, the program may fail at runtime.
  • In Unified Memory Mode (-⁠gpu=mem:unified), all data is accessible from the device making information about size and shape unnecessary.

Take the following example in C:

void set(int* ptr, int i, int j, int dim){
  int idx = i * dim + j;
  return ptr[idx] = someval(i, j);
}

void fill2d(int* ptr, int dim){
#pragma acc parallel loop
  for (int i = 0; i < dim; i++)
    for (int j = 0; j < dim; j++)
      set(ptr, i, j, dim);
}
      

In Separate Memory Mode, the only way to guarantee correctness for this example is to change the line with the acc directive as follows:

#pragma acc parallel loop create(ptr[0:dim*dim]) copyout(ptr[0:dim*dim]) 
      

This change explicitly instructs the OpenACC implementation about the precise data segment used within the parallel loop.

In Unified Memory Mode, that is, by compiling with -⁠acc -⁠gpu=mem:unified and executing on a platform with unified memory capability, the create and copyout clauses are not required.

The next example, in Fortran, illustrates how a global variable can be accessed in an OpenACC routine without requiring any explicit annotation.

module m
integer :: globmin = 1234
contains
subroutine findmin(a)
!$acc routine seq
  integer, intent(in)  :: a(:)
  integer :: i
  do i = 1, size(a)
    if (a(i) .lt. globmin) then
      globmin = a(i)
    endif
  end do
end subroutine
end module m
      

Compile the example above for Unified Memory Mode:

nvfortran -acc -gpu=mem:unified example.f90
      

The source does not need any OpenACC directives to access module variable globmin, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to globmin will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenACC declare and update directives in the source code.

In most cases, migrating existing OpenACC applications written for Separate Memory Mode should be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode.

Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU -⁠-⁠ without maintaining data synchronization with the CPU -⁠-⁠ pose a challenge for migration to Unified Memory.

For the following Fortran example, the value of variable c after the last loop will differ depending on whether the example is compiled with or without -⁠gpu=mem:unified.

b(:) = ...
c = 0

!$acc kernels copyin(b) copyout(a)
!$acc loop
do i = 1, N
  b(i) = b(i) * i
end do
!$acc loop
do i = 1, N
  a(i) =  b(i) + i
end do
!$acc end kernels

do i = 1, N
  c = c + a(i) + b(i)
end do
      

Without Unified Memory, array b is copied into the GPU memory at the beginning of the OpenACC kernels region. It is then updated in the GPU memory and used to compute elements of array a. As instructed by the data clause copyin(b), b is not copied back to the CPU memory at the end of the kernels region and therefore its initial value is used in the computation of c. With -⁠acc -⁠gpu=mem:unified, the updated value of b in the first loop is automatically visible in the last loop leading to a different value of c at its end.

Implications of Asynchronous Execution

Additional complexities can arise when dealing with asynchronous execution, particularly when CPU-GPU shared data is accessed within async compute regions instead of using an independent data copy on GPU. The programmer should be especially careful about accessing local variables in asynchronous GPU code. Unless the GPU code execution is explicitly synchronized before the end of the scope in which local variables are defined, the GPU can access stale data thus resulting in undefined behavior. Consider the following OpenACC C example, where a local array is used to hold temporary data on the GPU:

void bar() {
  int x[N];
  #pragma acc enter data create(x[0:N]) async
  #pragma acc parallel loop async
  for (int i = 0; i < N; i++)
    x[i] = i;
  ...
  #pragma acc exit data delete(x[0:N]) async
}
      

When compiled for Separate Memory Mode, the bar() function creates a copy of the array x in GPU memory and initializes it as written in the loop construct. That copy is eventually deleted. In Unified Memory Mode, however, the compiler ignores the acc enter data and acc exit data directives, so the loop construct executed on the GPU accesses the array x in local CPU memory. Moreover, since all constructs in this example are made asynchronous, the access to x on the GPU leads to undefined behavior of the program because the variable x goes out of scope once the bar() function finishes.

Performance Considerations

In Unified Memory Mode, the OpenACC runtime may leverage data action information such as create/delete or copyin/copyout to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: Simplifying GPU Application Development with Heterogeneous Memory Management. Such actions originate either from explicit data clauses in the source code or via implicit data movement generated by the compiler. This approach can minimize the amount of automatic data migration and may let a developer fine-tune application performance. For the C example above, while adding the data clauses create(ptr[0:dim*dim]) and copyout(ptr[0:dim*dim]) becomes optional with -⁠gpu=mem:unified, their uses in the OpenACC parallel loop directive may improve performance.

6.5. OpenACC Error Handling

The OpenACC specification provides a mechanism to allow you to intercept errors triggered during execution on a GPU and execute a specific routine in response before the program exits. For example, if an MPI process fails while allocating memory on the GPU, the application may want to call MPI_Abort to shut down all the other processes before the program exits. This section explains how to take advantage of this feature.

To intercept errors the application must give a callback routine to the OpenACC runtime. To provide the callback, the application calls acc_set_error_routine with a pointer to the callback routine.

The interface is the following, where err_msg contains a description of the error:

typedef void (*exitroutinetype)(char *err_msg);
extern void acc_set_error_routine(exitroutinetype callback_routine);

When the OpenACC runtime detects a runtime error, it will invoke the callback_routine.

Note: This feature is not the same as error recovery. If the callback routine returns to the application, the behavior is decidedly undefined.

Let's look at this feature in more depth using an example.

Take the MPI program below and run it with two processes. Process 0 tries to allocate a large array on the GPU, then sends a message to the second process to acknowledge the success of the operation. Process 1 waits for the acknowledgment and terminates upon receiving it.

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define N 2147483648

int main(int argc, char **argv)
{
  int rank, size;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int ack;
  if(rank == 0) {
    float *a = (float*) malloc(sizeof(float) * N);

#pragma acc enter data create(a[0:N])
#pragma acc parallel loop independent
    for(int i = 0; i < N; i++) {
      a[i] = i *0.5;
    }
#pragma acc exit data copyout(a[0:N])
    printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N);
    ack = 1;
    MPI_Send(&ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
  } else if(rank == 1) {
    MPI_Recv(&ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N);
    fflush(stdout);
  }

  // do some more work

  MPI_Finalize();

  return 0;
}

We compile the program with:

$ mpicc -acc -o error_handling_mpi error_handling_mpi.c

If we run this program with two MPI processes, the output will look like the following:

$ mpirun -n 2 ./error_handling_mpi
Out of memory allocating -8589934592 bytes of device memory
total/free CUDA memory: 11995578368/11919294464
Present table dump for device[1]:
NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1
...empty...
call to cuMemAlloc returned error 2: Out of memory

-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated.

Process 0 failed while allocating memory on the GPU and terminated unexpectedly with an error. In this case mpirun was able to identify that one of the processes failed, so it shut down the remaining process and terminated the application. A simple two-process program like this is straightforward to debug. In a real world application though, with hundreds or thousands of processes, having a process exit prematurely may cause the application to hang indefinitely. Therefore it would be ideal to catch the failure of a process, control the termination of the other processes, and provide a useful error message.

We can use the OpenACC error handling feature to improve the previous program and correctly terminate the application in case of failure of an MPI process.

In the following sample code, we have added an error handling callback routine that will shut down the other processes if a process encounters an error while executing on the GPU. Process 0 tries to allocate a large array into the GPU and, if the operation is successful, process 0 will send an acknowledgment to process 1. Process 0 calls the OpenACC function acc_set_error_routine to set the function handle_gpu_errors as an error handling callback routine. This routine prints a message and calls MPI_Abort to shut down all the MPI processes. If process 0 successfully allocates the array on the GPU, process 1 will receive the acknowledgment. Otherwise, if process 0 fails, it will terminate itself and trigger the call to handle_gpu_errors. Process 1 is then terminated by the code executed in the callback routine.

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define N 2147483648


typedef void (*exitroutinetype)(char *err_msg);
extern void acc_set_error_routine(exitroutinetype callback_routine);

void handle_gpu_errors(char *err_msg) {
  printf("GPU Error: %s", err_msg);
  printf("Exiting...\n\n");
  MPI_Abort(MPI_COMM_WORLD, 1);
  exit(-1);
}


int main(int argc, char **argv)
{
  int rank, size;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int ack;
  if(rank == 0) {
    float *a = (float*) malloc(sizeof(float) * N);

    
    acc_set_error_routine(&handle_gpu_errors);
    

#pragma acc enter data create(a[0:N])
#pragma acc parallel loop independent
    for(int i = 0; i < N; i++) {
      a[i] = i *0.5;
    }
#pragma acc exit data copyout(a[0:N])
    printf("I am process %d, I have initialized a vector of size %ld bytes on the GPU. Sending acknowledgment to process 1.", rank, N);
    fflush(stdout);
    ack = 1;
    MPI_Send(&ack, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
  } else if(rank == 1) {
    MPI_Recv(&ack, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("I am process %d, I have received the acknowledgment from process 0 that data in the GPU has been initialized.\n", rank, N);
    fflush(stdout);
  }

  // more work

  MPI_Finalize();

  return 0;
}

Again, we compile the program with:

$ mpicc -acc -o error_handling_mpi error_handling_mpi.c

We run the program with two MPI processes and obtain the output below:

$ mpirun -n 2 ./error_handling_mpi
Out of memory allocating -8589934592 bytes of device memory
total/free CUDA memory: 11995578368/11919294464
Present table dump for device[1]:
NVIDIA Tesla GPU 0, compute capability 3.7, threadid=1
...empty...
GPU Error: call to cuMemAlloc returned error 2: Out of memory
Exiting...

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

This time the error on the GPU was intercepted by the application which managed it with the error handling callback routine. In this case the routine printed some information about the problem and called MPI_Abort to terminate the remaining processes and avoid any unexpected behavior from the application.

6.6. OpenACC and CUDA Graphs

NVIDIA provides an optimized model for work submission onto GPUs called CUDA Graphs. A graph is a series of operations, such as kernel launches and other stream-oriented tasks, connected by their dependencies. A graph can be defined once, "captured", then launched repeatedly. This has potential benefits in reducing launch latencies and other overheads associated with kernel setup.

A complete write-up explaining CUDA Graphs and the CUDA API for graph definition, instantiation, and execution can be found in Chapter 3 of the CUDA C Programming Guide. In OpenACC, we currently expose just the minimal set of operations to allow capture and replay of a graph containing OpenACC compute regions and data directives. The code executed between a "begin capture" call, accx_begin_capture_async(), and the "end capture" call, accx_end_capture_async(), is called the capture region.

The CUDA graph API captures (or records) all the device work between accx_begin_capture_async and accx_end_capture_async. The host code in the capture region will be executed once normally, with the exception that no device work is actually executed on the device. Instead, a graph object is created that can be used to replay the captured work multiple times.

Note: Graph capture is similar to a closure concept in many programming languages, like lambda-functions in C++. In lambda-function terms, CUDA graphs capture all the variables by value. That means that all the FIRSTPRIVATE scalars, array shapes, and those derived types, arrays and scalar addresses for data resident on the GPU, are baked into the graph object and cannot be altered. The device data behind the pointers, of course, can be updated by the graph execution normally, and updated by the host between replays.

It is important to understand both what can and cannot be captured within a CUDA Graph capture region:

  • Asynchronous data clauses including data create can be captured. The OpenACC runtime will use the stream-ordered cudaMallocAsync() call in the capture region for variables which need allocation in data clauses, an API call allowed in CUDA Graphs.
  • Asynchronous compute regions, preferably ACC parallel regions, can be captured. For ACC kernels regions, verify that no work is performed on the host. Host compute sections cannot be captured.
  • Asynchronous ACC update host (self) and update device directives can be captured. The host and device addresses which are captured must be valid during the graph replay/execution.
  • Since only the device work is captured and replayed, any data dependencies between the host and device inside the capture region are erroneous. For example, downloading data from the device, processing it on host and uploading it back to the device within the capture region is invalid.
  • Host code, even host code containing conditionals, can occur within a capture region. Note though that the path taken through the host code will be the path captured by the graph, i.e. the conditionals must likely be consistent during the replay for correct results. Host code which updates host variables, such as i=i+1 will not be captured in the graph, which might affect proper indexing into device-side arrays or other kernel arguments.
  • Similarly, device work initiated in host code loops can be captured in the CUDA Graph. The graph will not contain a notion of looping, just the sequence of device operations submitted to the device during the loop.
  • Subroutine and function calls within a capture region, which contain further compute regions or other work which runs on the device, are captured. Care must be taken that the device data addresses passed to the kernels are valid throughout graph execution, and don't come and go based on stack addresses or something similar.
  • Codes which double-buffer, or ping-pong between source and destination arrays that are input on odd iterations, and output on even iterations, can be accomodated by capturing two graphs: one per even iteration, one per odd iteration.
  • Many CUDA library calls, like cublas, etc. can occur in a captured region. Setup for the library calls, such as creating handles, and computing and allocating workspace requirements, should be done before the capture region.
  • Graph capturing is thread-safe with respect to each async queue. Host threads can independently capture graphs using different async queues.

The OpenACC API follows the basic portion of the CUDA Graph API fairly closely. The major difference is OpenACC includes the cudaGraphInstantiate() call as part of the end capture function.

From Fortran, the graph type is defined in the OpenACC module:

   type, bind(c) :: acc_graph_t
     type(c_ptr) :: graph
     type(c_ptr) :: graph_exec
   end type acc_graph_t

These subroutines are available in the OpenACC runtime. Here, pGraph is type(acc_graph_t) and async is just the asynchronous queue value:

   subroutine accx_async_begin_capture( async )
   subroutine accx_async_end_capture( async, pGraph )
   subroutine accx_graph_launch( pGraph, async )
   subroutine accx_graph_delete( pGraph )
   type(c_ptr) function accx_get_graph( pGraph )
   type(c_ptr) function accx_get_graph_exec( pGraph )

From C, the graph type is defined in OpenACC.h:

   typedef struct { void *graph; void *graph_exec; } acc_graph_t;

These void functions are available in the OpenACC runtime:

   extern void accx_async_begin_capture(long async);
   extern void accx_async_end_capture(long async, acc_graph_t *pgraph);
   extern void accx_graph_launch(acc_graph_t *pgraph, long async);
   extern void accx_graph_delete(acc_graph_t *pgraph);
   extern void *accx_get_graph(acc_graph_t *pgraph);
   extern void *accx_get_graph_exec(acc_graph_t *pgraph);

We will use a simple Fortran example code which demonstrates some of the modifications needed to use CUDA Graphs from OpenACC. The original serial code for a conjugate gradient iterative solver:

  subroutine RunCG(N, A, b, x, tol, max_iter)
    implicit none
    integer, intent(in) :: N, max_iter
    real(WP), intent(in) :: A(N, N), b(N), tol
    real(WP), intent(inout) :: x(N)

    real(WP) :: alpha, rr0, rr
    real(WP), allocatable :: Ax(:), r(:), p(:)
    integer :: it, i

    allocate(Ax(N), r(N), p(N))

    call symmatvec(N, N, A, x, Ax)
    do i = 1, N
      r(i) = b(i) - Ax(i)
      p(i) = r(i)
    enddo
    rr0 = dot(N, r, r)

    do it = 1, max_iter
      call symmatvec(N, N, A, p, Ax)
      alpha = rr0 / dot(N, p, Ax)

      do i = 1, N
        x(i) = x(i) + alpha * p(i)
        r(i) = r(i) - alpha * Ax(i)
      enddo

      rr = dot(N, r, r)

      print*, "Iteration ", it, " residual: ", sqrt(rr)
      if (sqrt(rr) <= tol) then
        deallocate(Ax, r, p)
        return
      endif
      do i = 1, N
        p(i) = r(i) + (rr / rr0) * p(i)
      enddo
      rr0 = rr
    enddo

    deallocate(Ax, r, p)

  end subroutine RunCG 

For this exercise we wish to put the do it = 1,max_iter work for each iteration into a CUDA graph. Step one is to port the code to OpenACC, keeping in mind that we want to use asynchronous queues. We annotate the dot function with OpenACC directives like this:

  function dot(N, x, y) result(r)
    integer, intent(in) :: N
    real(WP), intent(in) :: x(N), y(N)
    integer :: i
    real(WP) :: r

    r = 0.d0
    !$acc parallel loop present(x, y) reduction(+:r) async(1)
    do i = 1, N
      r = r + x(i) * y(i)
    enddo
    !$acc wait(1)
  end function dot 

We write the symmetric matrix multiply like this:

  subroutine symmatvec(M, N, AT, x, Ax)
    implicit none
    integer, intent(in) :: M, N
    real(WP), intent(in) :: AT(N, M), x(N)
    real(WP), intent(out) :: Ax(M)

    integer :: i, j
    real(WP) :: s

    ! Note: Since A is symmetric, we can use the "transpose"
    ! for better memory access here
    !$acc parallel loop gang present(AT, x, Ax) async(1)
    do i = 1, M
      s = 0.d0
      !$acc loop vector reduction(+:s)
      do j = 1, N
        s = s + AT(j,i) * x(j)
      end do
      Ax(i) = s
    end do
  end subroutine 

And now our main loop of the conjugate gradient solver looks like this:

    do it = 1, max_iter
      call symmatvec(N, N, A, p, Ax)
      alpha = rr0 / dot(N, p, Ax)

      !$acc parallel loop gang vector async(1)
      do i = 1, N
        x(i) = x(i) + alpha * p(i)
        r(i) = r(i) - alpha * Ax(i)
      enddo

      rr = dot(N, r, r)

      print*, "Iteration ", it, " residual: ", sqrt(rr)
      if (sqrt(rr) <= tol) exit

      !$acc parallel loop gang vector async(1)
      do i = 1, N
        p(i) = r(i) + (rr / rr0) * p(i)
      enddo
      rr0 = rr
    enddo 

Step 2 is to prepare the code for running under CUDA Graphs. There is a lot of host code executing in the main loop. While the dot() function runs on the GPU, the rest of the statement alpha = rr0 / dot(...) runs on the host. Similarly, the 2nd dot() call returns its value to the host. The print statement occurs on the host, as does the residual check. Finally, this iteration's value for rr is moved to rr0 in the last statement of the loop, on the host.

The dot product is tricky. We wish to compute the dot product on the GPU, and leave the result on the GPU, so the reduction variable must be present on the GPU. Here, we change the function call to a subroutine, and remove the initialization which is outside of the parallel region:

  subroutine dot(N, x, y, r)
    implicit none
    integer, intent(in) :: N
    real(WP), intent(in) :: x(N), y(N)
    integer :: i
    real(WP) :: r

    !$acc parallel loop present(x, y, r) reduction(+:r) async(1)
    do i = 1, N
      r = r + x(i) * y(i)
    enddo
  end subroutine dot 

We add one serial kernel to do some of the swapping between rr0 and rr, as well as zeroing out the scalar that will hold the dot product reduction, and move the print and check outside of the GPU capture region, replaced by a update host operation. The finished loop, complete with graph control, looks like this:

  do it = 1, max_iter
    if (it .eq. 1) then  ! First time capture
      call accx_async_begin_capture(1)

      call symmatvec(N, N, A, p, Ax)
      call dot(N, p, Ax, rden)

      !$acc serial async(1)
      rr0 = rr
      alpha = rr0 / rden
      rden = 0.0d0
      rr = 0.0d0
      !$acc end serial

      !$acc parallel loop gang vector async(1)
      do i = 1, N
        x(i) = x(i) + alpha * p(i)
        r(i) = r(i) - alpha * Ax(i)
      enddo

      call dot(N, r, r, rr)

      !$acc update host(rr) async(1)

      !$acc parallel loop gang vector async(1)
      do i = 1, N
        p(i) = r(i) + (rr / rr0) * p(i)
      enddo
      call accx_async_end_capture(1, graph)
    endif
    ! Always launch, then wait
    call accx_graph_launch(graph, 1)
    !$acc wait(1)

    rra(it) = rr
    if (sqrt(rr) <= tol) exit
  enddo 

Step 3 is to compile, run, and profile the result. No special compiler options are needed besides -acc=gpu. When running, you may be advised to set the NVCOMPILER_ACC_USE_GRAPH environment variable. This is currently necessary to properly set the OpenACC runtime for graph capture. Failure to abide by the guidelines above may result in wrong answers, which can be hard to debug. See the following sections on how to use environment variables to help. A common issue is that the pointers passed to the device kernels during graph playback will be the same every time. Make sure that is the case between iterations in the code without graph capture.

The Nsight Systems tool has very good support for profiling CUDA graphs. The timeline view will provide information on whether you have reduced the launch overhead gaps between the GPU kernels. Figure 1 shows a timeline of the iterations of the original OpenACC loop:

Figure 1. Nsight Systems Report1 Timeline
png for PDF.

Figure 2 shows a timeline of the iterations when using CUDA Graphs. When the size N is less than a few thousand, launch latency becomes a major contributor to the overall time and here we can see about a 2x speedup:

Figure 2. Nsight Systems Report2 Timeline
png for PDF.

You can see a more-detailed trace of the CUDA Graph components by adding the --cuda-graph-trace=node option to the nsys profile command.

The above loop demonstrates several of the guidelines outlined at the top of this section, namely, capturing compute regions, whether at the top level or in subprogram units, capturing data movement, and restructuring code regions to minimize or eliminate the host code within a capture region. And the minimal API to begin capture, end capture, then launch the captured graph.

6.7. Environment Variables

This section summarizes the environment variables that NVIDIA OpenACC supports. These environment variables are user-setable environment variables that control behavior of accelerator-enabled programs at execution. These environment variables must comply with these rules:

  • The names of the environment variables must be upper case.
  • The values of environment variables are case insensitive and may have leading and trailing white space.
  • The behavior is implementation-defined if the values of the environment variables change after the program has started, even if the program itself modifies the values.

The following table contains the environment variables that are currently supported and provides a brief description of each.

Table 16. Supported Environment Variables
Use this environment variable... To do this...
NVCOMPILER_ACC_CUDA_PROFSTOP Set to 1 (or any positive value) to tell the runtime environment to insert an 'atexit(cuProfilerStop)' call upon exit. This behavior may be desired in the case where a profile is incomplete or where a message is issued to call cudaProfilerStop().
NVCOMPILER_ACC_DEVICE_NUM Sets the default device number to use. NVCOMPILER_ACC_DEVICE_NUM. Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices attached to the host.
ACC_DEVICE_NUM Legacy name. Superseded by NVCOMPILER_ACC_DEVICE_NUM.
NVCOMPILER_ACC_DEVICE_TYPE Sets the default device type to use for OpenACC regions. NVCOMPILER_ACC_DEVICE_TYPE. Specifies which accelerator device to use when executing accelerator regions when the program has been compiled to use more than one different type of device. The value of this environment variable is implementation-defined, and in the NVIDIA OpenACC implementation may be the strings NVIDIA, MULTICORE or HOST
ACC_DEVICE_TYPE Legacy name. Superseded by NVCOMPILER_ACC_DEVICE_TYPE.
NVCOMPILER_ACC_GANGLIMIT For NVIDIA CUDA devices, this defines the maximum number of gangs (CUDA thread blocks) that will be launched by a kernel.
NVCOMPILER_ACC_NOTIFY With no argument, a debug message will be written to stderr for each kernel launch and/or data transfer. When set to an integer value, the value is used as a bit mask to print information about:

1: kernel launches

2: data transfers

4: region entry/exit

8: wait operations or synchronizations with the device

16: device memory allocates and deallocates

NVCOMPILER_ACC_PROFLIB Enables 3rd party tools interface using the new profiler dynamic library interface.
NVCOMPILER_ACC_SYNCHRONOUS Disables asynchronous launches and data movement.
NVCOMPILER_ACC_TIME Enables a lightweight profiler to measure data movement and accelerator kernel execution time and print a summary at the end of program execution.

6.8. Profiling Accelerator Kernels

Support for Profiler/Trace Tool Interface

The NVIDIA HPC Compilers support the OpenACC Profiler/Trace Tools Interface. This is the interface used by the NVIDIA profilers to collect performance measurements of OpenACC programs.

Using NVCOMPILER_ACC_TIME

Setting the environment variable NVCOMPILER_ACC_TIME to a nonzero value enables collection and printing of simple timing information about the accelerator regions and generated kernels.

Note: Turn off all CUDA Profilers (NVIDIA's Visual Profiler, NVPROF, CUDA_PROFILE, etc) when enabling NVCOMPILER_ACC_TIME, they use the same library to gather performance data and cannot be used concurently.

Accelerator Kernel Timing Data

bb04.f90
  s1
   15: region entered 1 times
     time(us): total=1490738 
                 init=1489138 region=1600
                 kernels=155 data=1445
     w/o init: total=1600 max=1600 
                 min=1600 avg=1600
   18: kernel launched 1 times
   time(us): total=155 max=155 min=155 avg=155

In this example, a number of things are occurring:

  • For each accelerator region, the file name bb04.f90 and subroutine or function name s1 is printed, with the line number of the accelerator region, which in the example is 15.
  • The library counts how many times the region is entered (1 in the example) and the microseconds spent in the region (in this example 1490738), which is split into initialization time (in this example 1489138) and execution time (in this example 1600).
  • The execution time is then divided into kernel execution time and data transfer time between the host and GPU.
  • For each kernel, the line number is given, (18 in the example), along with a count of kernel launches, and the total, maximum, minimum, and average time spent in the kernel, all of which are 155 in this example.

6.9. OpenACC Runtime Libraries

This section provides an overview of the user-callable functions and library routines that are available for use by programmers to query the accelerator features and to control behavior of accelerator-enabled programs at runtime.

Note: In Fortran, none of the OpenACC runtime library routines may be called from a PURE or ELEMENTAL procedure.

6.9.1. Runtime Library Definitions

There are separate runtime library files for Fortran, and for C++ and C.

C++ and C Runtime Library Files

In C++ and C, prototypes for the runtime library routines are available in a header file named accel.h. All the library routines are extern functions with ‘C’ linkage. This file defines:

  • The prototypes of all routines in this section.
  • Any data types used in those prototypes, including an enumeration type to describe types of accelerators.

Fortran Runtime Library Files

In Fortran, interface declarations are provided in a Fortran include file named accel_lib.h and in a Fortran module named accel_lib. These files define:

  • Interfaces for all routines in this section.
  • Integer parameters to define integer kinds for arguments to those routines.
  • Integer parameters to describe types of accelerators.

6.9.2. Runtime Library Routines

Table 17 lists and briefly describes the runtime library routines supported by the NVIDIA HPC Compilers in addition to the standard OpenACC runtine API routines.

Table 17. Accelerator Runtime Library Routines
This Runtime Library Routine... Does this...
acc_allocs Returns the number of arrays allocated in data or compute regions.
acc_bytesalloc Returns the total bytes allocated by data or compute regions.
acc_bytesin Returns the total bytes copied in to the accelerator by data or compute regions.
acc_bytesout Returns the total bytes copied out from the accelerator by data or compute regions.
acc_clear_freelists Clears lists of deallocated device memory chunks retained by the device memory manager for reuse. Please refer to the section Environment Variables Controlling Device Memory Management to learn more about device memory control.
acc_copyins Returns the number of arrays copied in to the accelerator by data or compute regions.
acc_copyouts Returns the number of arrays copied out from the accelerator by data or compute regions.
acc_disable_time Tells the runtime to stop profiling accelerator regions and kernels.
acc_enable_time Tells the runtime to start profiling accelerator regions and kernels, if it is not already doing so.
acc_exec_time Returns the number of microseconds spent on the accelerator executing kernels.
acc_frees Returns the number of arrays freed or deallocated in data or compute regions.
acc_get_device Returns the type of accelerator device used to run the next accelerator region, if one is selected.
acc_get_device_num Returns the number of the device being used to execute an accelerator region.
acc_get_free_memory Returns the total available free memory on the attached accelerator device.
acc_get_memory Returns the total memory on the attached accelerator device.
acc_get_num_devices Returns the number of accelerator devices of the given type attached to the host.
acc_kernels Returns the number of accelerator kernels launched since the start of the program.
acc_present_dump Summarizes all data present on the current device.
acc_present_dump_all Summarizes all data present on all devices.
acc_regions Returns the number of accelerator regions entered since the start of the program.
acc_total_time Returns the number of microseconds spent in accelerator compute regions and in moving data for accelerator data regions.

6.10. Supported Intrinsics

An intrinsic is a function available in a given language whose implementation is handled specifically by the compiler. Typically, an intrinsic substitutes a sequence of automatically-generated instructions for the original function call. Since the compiler has an intimate knowledge of the intrinsic function, it can better integrate it and optimize it for the situation.

Intrinsics make the use of processor-specific enhancements easier because they provide a language interface to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data.

This section contains an overview of the Fortran and C intrinsics that the accelerator supports.

6.10.1. Supported Fortran Intrinsics Summary Table

Table 18 is an alphabetical summary of the supported Fortran intrinsics that the accelerator supports. These functions are specific to Fortran 90/95 unless otherwise specified.

In most cases support is provided for all the data types for which the intrinsic is valid. When support is available for only certain data types, the middle column of the table specifies which ones, using the following codes:

I for integer S for single precision real C for single precision complex
  D for double precision real Z for double precision complex
Table 18. Supported Fortran Intrinsics
This intrinsic Return value
ABS I,S,D absolute value of the argument.
ACOS   arccosine of the specified argument.
AINT   truncation of the argument to a whole number.
ANINT   nearest whole number of the real argument.
ASIN   arcsine of the argument.
ATAN   arctangent of the argument.
ATAN2   angle in radians of the complex value first-argument + i*second-argument.
COS S,D,C,Z cosine of the argument.
COSH   hyperbolic cosine of the argument.
DBLE S,D conversion of the argument to double precision real.
DPROD   double precision product of two single precision arguments.
EXP S,D,C,Z natural exponential value of the argument.
IAND   result of logical AND of the two integer arguments.
IEOR   result of the boolean exclusive OR of the two integer arguments.
INT I,S,D conversion of the argument to integer type.
IOR   result of the boolean inclusive OR of the two integer arguments.
LOG S,D,C,Z base-e (natural logarithm) of the argument.
LOG10   base-10 logarithm of the argument.
MAX   maximum value of the arguments.
MIN   minimum value of the arguments.
MOD I remainder of the first argument divided by the second argument.
NINT   nearest integer of the real argument.
NOT   logical complement of the integer argument.
REAL I,S,D conversion of the argument to real.
SIGN   absolute value of first argument times the sign of second argument.
SIN S,D,C,Z sine of the argument.
SINH   hyperbolic sine of the argument.
SQRT S,D,C,Z square root of the argument.
TAN   tangent of the argument.
TANH   hyperbolic tangent of the argument.

6.10.2. Supported C Intrinsics Summary Table

This section contains two alphabetical summaries – one for double functions and a second for float functions. These lists contain only those C intrinsics that the accelerator supports.

Table 19. Supported C Intrinsic Double Functions
This intrinsic Return value
acos arccosine of the argument.
asin arcsine of the argument.
atan arctangent of the argument.
atan2 arctangent of y/x, where y is the first argument, x the second.
cos cosine of the argument.
cosh hyperbolic cosine of the argument.
exp exponential value of the argument.
fabs absolute value of the argument.
fmax maximum value of the two arguments
fmin minimum value of the two arguments
log natural logarithm of the argument.
log10 base-10 logarithm of the argument.
pow value of the first argument raised to the power of the second argument.
sin value of the sine of the argument.
sinh hyperbolic sine of the argument.
sqrt square root of the argument.
tan tangent of the argument.
tanh hyperbolic tangent of the argument.
Table 20. Supported C Intrinsic Float Functions
This intrinsic Return value
acosf arccosine of the argument.
asinf arcsine of the argument.
atanf arctangent of the argument.
atan2f arctangent of y/x, where y is the first argument, x the second.
cosf cosine of the argument.
coshf hyperbolic cosine of the argument.
expf exponential value of the argument.
fabsf absolute value of the argument.
logf natural logarithm of the argument.
log10f base-10 logarithm of the argument.
powf value of the first argument raised to the power of the second argument.
sinf value of the sine of the argument.
sinhf hyperbolic sine of the argument.
sqrtf square root of the argument.
tanf tangent of the argument.
tanhf hyperbolic tangent of the argument.

7. Using OpenMP

OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution in Fortran, C⁠+⁠+, and C programs. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the OpenMP organization's website.

The NVFORTRAN, NVC⁠+⁠+, and NVC compilers support a subset of the OpenMP Application Program Interface for CPUs and GPUs. In defining this subset, we have focused on OpenMP 5.0 features that will enable CPU and GPU targeting for OpenMP applications with a goal of encouraging programming practices that are portable and scalable. For features that are to be avoided, wherever possible, the directives and API calls related to those features are parsed and ignored to maximize portability. Where ignoring such features is not possible, or could result in ambiguous or incorrect execution, the compilers emit appropriate error messages at compile- or run-time.

OpenMP applications properly structured for GPUs, meaning they expose massive parallelism and have relatively little or no synchronization in GPU-side code segments, should compile and execute with performance on par with or close to equivalent OpenACC. Codes that are not well-structured for GPUs may perform poorly but should execute correctly.

Use the -⁠mp compiler switch to enable processing of OpenMP directives and pragmas. The most important sub-options to -⁠mp are the following:
  • gpu: OpenMP directives are compiled for GPU execution plus multicore CPU fallback; this feature is supported on NVIDIA V100 or later GPUs.
  • multicore: OpenMP directives are compiled for multicore CPU execution only; this sub-option is the default.

Predefined Macros

The following macros corresponding to the offload target compiled for are added implicitly:
  • __NVCOMPILER_OPENMP_GPU when OpenMP target directives are compiled for GPU.
  • __NVCOMPILER_OPENMP_MULTICORE when OpenMP target directives are compiled for multicore CPU.

7.1. Environment Variables

The OpenMP specification includes many environment variables related to program execution.

Thread affinity

One important environment variable is OMP_PROC_BIND. It controls the OpenMP CPU thread affinity policy. When thread affinity is disabled, the operating system is free to move threads between the available CPU cores. When thread affinity is enabled, each thread is bound to a subset of the available CPU cores. The environment variable OMP_PLACES can be used to specify how a subset of the available CPU cores is determined for each thread. When set to a valid value, this environment variable will enable thread affinity and override the default thread affinity policy.

Binding threads to certain CPU cores is often beneficial for application performance, because that can improve the CPU cache hit rate and limit memory transactions between different NUMA nodes. Therefore, it is important to consider enabling thread affinity for your application.

The default value of OMP_PROC_BIND is false. Thus, thread affinity is disabled by default. This is a conservative setting that allows certain classes of applications (such as OpenMP + MPI) to create multiple processes without taking special care of the thread affinity policy to avoid binding threads in different processes to the same CPU cores.

The following table explains the simplest possible values of OMP_PROC_BIND. For the comprehensive explanation of OMP_PROC_BIND and OMP_PLACES, please refer to the OpenMP specification.

Value Behavior
OMP_PROC_BIND=false Thread affinity is disabled unless OMP_PLACES is set to a valid value. When thread affinity is disabled, the operating system is free to assign threads to any available CPU core at any time of the application execution. This is the default value.
OMP_PROC_BIND=true Thread affinity is enabled. Unless OMP_PLACES is set, the implementation attempts to assign threads optimally to CPU cores to maximize the cache hit rate and minimize the number of memory transactions between NUMA nodes.

Device offload

Another important environment variable to understand is OMP_TARGET_OFFLOAD. Use this environment variable to affect the behavior of execution on host and device including host fallback. The following table explains the behavior determined by each of the values to which you can set this environment variable.

Value Behavior
OMP_TARGET_OFFLOAD=DEFAULT Try to execute on a GPU; if a supported GPU is not available, fallback to the host
OMP_TARGET_OFFLOAD=DISABLED Do not execute on the GPU even if one is available; execute on the host
OMP_TARGET_OFFLOAD=MANDATORY Execute on a GPU or terminate the program

Number of teams on device

When an application offloads an omp target teams construct to the GPU, the number of teams is calculated automatically unless the construct has a num_teams clause. The automatic setting of the number of teams can be limited to a maximum value provided by the OMP_NUM_TEAMS environment variable. The same maximum value can also be set by the application at run time with the function omp_set_num_teams.

Value Behavior
OMP_NUM_TEAMS=<positive_integer> Maximum number of teams on device

For the comprehensive explanation of OMP_NUM_TEAMS, please refer to the OpenMP specification.

Number of threads in teams

An omp target teams construct offloaded to the GPU creates a league of teams each consisting of a certain number of threads. The number of threads is the same for all teams in the league, and is calculated automatically unless the construct has a thread_limit clause.

The environment variable OMP_TEAMS_THREAD_LIMIT can be used to limit the maximum number of threads in teams. The same maximum value can be set by the application with the runtime function omp_set_teams_thread_limit.

For NVIDIA GPUs, we recommend using values that are multiples of 32 (which is the size of the GPU thread warp). That equally applies to the OMP_TEAMS_THREAD_LIMIT environment variable, the omp_set_teams_thread_limit function and the thread_limit clause. For any other value, the actual limit on the number of threads per team will likely be rounded down to the nearest multiple of 32. The same guidance applies to the num_threads clause as well.

Value Behavior
OMP_TEAMS_THREAD_LIMIT=<positive_integer> Maximum number of threads in teams

For the comprehensive explanation of OMP_TEAMS_THREAD_LIMIT, please refer to the OpenMP specification.

Forcing the number of device teams and threads

In certain situations, for instance for debugging or performance tuning, it may be desirable to specify an exact number of teams and threads on the GPU. While OpenMP offers a number of convenient ways to control that, e.g. the num_teams and thread_limit clauses, as well as the environment variables described above, they do not guarantee an exact teams and threads configuration.

The NVIDIA HPC OpenMP Runtime supports the NVCOMPILER_OMP_CUDA_GRID environment variable. When set, it requests the runtime to use the exact number of teams and threads per team when running OpenMP compute constructs on the GPU. Essentially, its effect is to use a specific CUDA grid configuration for any kernel, bypassing runtime and compiler guidance.

Value Behavior
NVCOMPILER_OMP_CUDA_GRID=<num_blocks>,<num_threads> The <num_blocks> and <num_threads> must be positive integers. They are used to form a CUDA grid when running GPU kernels associated with omp target compute constructs.

However, even with an exact CUDA grid specified, the runtime may still use a corrected configuration if that is necessary for a successful kernel launch.

Please refer to the CUDA C++ Programming Guide for the detailed explanation of how the CUDA kernel execution configurations work.

7.2. Fallback Mode

The HPC compilers support host fallback of OpenMP target regions when no GPU is present or OMP_TARGET_OFFLOAD is set to DISABLED. Execution should always be correct but the performance of the target region may not always be optimal when run on the host. OpenMP target regions prescriptively structured for optimal execution on GPUs may not perform well when run on the dissimilar architecture of the CPU. To provide performance portability between host and device, we recommend use of the loop construct.

firstprivates with nowait not supported for host execution

There is currently a limitation on the use of the nowait clause on target regions intended for execution on the host (-⁠mp or -⁠mp=gpu with OMP_TARGET_OFFLOAD=DISABLED). If the target region references variables having the firstprivate data-sharing attribute, their concurrent updates are not guaranteed to be safe. To work around this limitation, when running on the host, we recommend avoiding the nowait clause on such target regions or equivalently using the taskwait construct immediately following the region.

7.3. Loop

The HPC compilers support the loop construct with an extension to the default binding thread set mechanism specified by OpenMP in order to allow the compilers the freedom to analyze loops and dependencies to generate highly parallel code for CPU and GPU targets. In other words, the compilers map loop to either teams or to threads, as the compiler chooses, unless the user explicitly specifies otherwise. The mapping selected is specific to each target architecture even within the same executable (i.e., GPU offload and host fallback) thereby facilitating performance portability.

The shape of the parallelism offered by NVIDIA's GPUs, consisting of thread blocks and three dimensions of threads therein, differs from the multi-threaded vector parallelism of modern CPUs. The following table summarizes the OpenMP mapping to NVIDIA GPUs and multicore CPUs:

Construct CPU GPU
!$omp target starts offload
!$omp teams single team CUDA thread blocks in grid
!$omp parallel CPU threads CUDA threads within thread block
!$omp simd hint for vector instructions simdlen(1)

HPC programs need to leverage all available parallelism to achieve performance. The programmer can attempt to become an expert in the intricacies of each target architecture and use that knowledge to structure programs accordingly. This prescriptive model can be successful but tends to increase source code complexity and often requires restructuring for each new target architecture. Here's an example where a programmer explicitly requests the steps the compiler should take to map parallelism to two targets:

#ifdef TARGET_GPU
   #pragma omp target teams distribute reduction(max:error)
#else 
   #pragma omp parallel for reduction(max:error)
#endif
for( int j = 1; j < n-1; j++) {
#ifdef TARGET_GPU
   #pragma omp parallel for reduction(max:error) 
#endif
  for( int i = 1; i < m-1; i++ ) {
      Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                            + A[j-1][i] + A[j+1][i]);
      error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
  }
}

An alternative is for the programmer to focus on exposing parallelism in a program and allowing a compiler to do the mapping onto the target architectures. The HPC compilers' implementation of loop supports this descriptive model. In this example, the programmer specifies the loop regions to be parallelized by the compiler and the compilers parallelize loop across teams and threads:

#pragma omp target teams loop reduction(max:error) 
for( int j = 1; j < n-1; j++) {
  #pragma omp loop reduction(max:error)
  for( int i = 1; i < m-1; i++ ) {
      Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                            + A[j-1][i] + A[j+1][i]);
      error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
  }
}

The programmer's tuning tool with loop is the bind clause. The following table extends the previous mapping example:

Construct CPU GPU
!$omp loop bind(teams) threads CUDA thread blocks and threads
!$omp loop bind(parallel) threads CUDA threads
!$omp loop bind(thread) single thread (useful for vector instructions) single thread

Orphaned loop constructs within a single file are supported; a binding region of either parallel or thread must be specified with such loops via the bind clause. The compilers support loop regions containing procedure calls as long as the callee does not contain OpenMP directives.

Here are a few additional examples using loop. We also show examples of the type of information the compiler would provide when using the -⁠Minfo compiler option.

Use of loop in Fortran:

!$omp target teams loop 
do n1loc_blk = 1, n1loc_blksize
  do igp = 1, ngpown 
    do ig_blk = 1, ig_blksize 
      do ig = ig_blk, ncouls, ig_blksize
        do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize
          !expensive computation codes           
        enddo 
      enddo 
    enddo 
  enddo 
enddo

$ nvfortran test.f90 -mp=gpu -Minfo=mp
42, !$omp target teams loop
  42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel
      Generating Tesla code
    43, Loop parallelized across teams ! blockidx%x
    44, Loop run sequentially
    45, Loop run sequentially
    46, Loop run sequentially
    47, Loop parallelized across threads(128) ! threadidx%x
  42, Generating Multicore code
    43, Loop parallelized across threads

Use of loop, collapse, and bind:

!$omp target teams loop collapse(3)
do n1loc_blk = 1, n1loc_blksize
  do igp = 1, ngpown 
    do ig_blk = 1, ig_blksize 
      !$omp loop bind(parallel) collapse(2)
      do ig = ig_blk, ncouls, ig_blksize
        do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize
          !expensive computation codes           
        enddo 
      enddo 
    enddo 
  enddo 
enddo

$ nvfortran test.f90 -mp=gpu -Minfo=mp

42, !$omp target teams loop
  42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel
      Generating Tesla code
    43, Loop parallelized across teams collapse(3) ! blockidx%x
    44,   ! blockidx%x collapsed
    45,   ! blockidx%x collapsed
    47, Loop parallelized across threads(128) collapse(2) ! threadidx%x
    48,   ! threadidx%x collapsed
  42, Generating Multicore code
    43, Loop parallelized across threads

Use of loop, collapse, and bind(thread):

!$omp target teams loop collapse(3)
do n1loc_blk = 1, n1loc_blksize
  do igp = 1, ngpown 
    do ig_blk = 1, ig_blksize 
      !$omp loop bind(thread) collapse(2)
      do ig = ig_blk, ncouls, ig_blksize
        do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize
          ! expensive computation codes           
        enddo 
      enddo 
    enddo 
  enddo 
enddo

$ nvfortran test.f90 -mp=gpu -Minfo=mp

42, !$omp target teams loop
  42, Generating "nvkernel_MAIN__F1L42_1" GPU kernel
      Generating Tesla code
    43, Loop parallelized across teams, threads(128) collapse(3) ! blockidx%x threadidx%x
    44,   ! blockidx%x threadidx%x collapsed
    45,   ! blockidx%x threadidx%x collapsed
    47, Loop run sequentially
    48,   collapsed
  42, Generating Multicore code
    43, Loop parallelized across threads

7.4. OpenMP Subset

This section contains the subset of OpenMP 5.0 features that the HPC compilers support. We have attempted to define this subset of features to be those that enable, where possible, OpenMP-for-GPU application performance that closely mirrors the success NVIDIA has seen with OpenACC. Almost every feature supported on NVIDIA GPUs is also supported on multicore CPUs, although the reverse is not true. Most constructs from OpenMP 3.1 and OpenMP 4.5 that apply to multicore CPUs are supported for CPU targets, and some features from OpenMP 5.0 are supported as well.

OpenMP target offload to NVIDIA GPUs is supported on NVIDIA V100 or later GPUs.

The section numbers below correspond to the section numbers in the OpenMP Application Programming Interface Version 5.0 November 2018 document.

2. Directives

2.3 Variant Directives

2.3.4 Metadirectives

The target_device/device context selector is supported with the kind(host|nohost|cpu|gpu) and arch(nvtpx|nvptx64) trait selectors. The arch trait property nvptx is an alias for nvptx64; any other arch trait properties are treated as not matching or are ignored. The isa selector is treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability.

The implementation context selector is supported with the vendor(nvidia) trait selector.

The user context selector is supported with the condition(expression) trait selector including dynamic user traits.

The syntax begin/end metadirective is not supported.

2.3.5 Declare Variant Directive

The device context selector is supported with the kind(host|nohost|cpu|gpu) and arch(nvtpx|nvptx64) trait selectors. The arch trait property nvptx is an alias for nvptx64; any other arch trait properties are treated as not matching or are ignored. The isa selector is also treated as not matching or is ignored; no support is provided to select a context based on NVIDIA GPU compute capability.

The implementation context selector is supported with the vendor(nvidia) trait selector; all other implementation trait selectors are treated as not matching.

The syntax begin/end declare variant is supported for C/C⁠+⁠+.

2.4 Requires Directive

The requires directive has limited support. The requirement clauses unified_address and unified_shared_memory are accepted but have no effect. To activate OpenMP unified shared memory programming a command-line option needs to be passed in (refer to OpenMP with CUDA Unified Memory for more details).

2.5 Internal Control Variables

ICV support is as follows.
  • dyn-var, nthread-var, thread-limit-var, max-active-levels-var, active-levels-var, levels-var, run-sched-var, dyn-sched-var, and stacksize-var are supported
  • place-partition-var, bind-var, wait-policy-var, display-affinity-var, default-device-var, and target-offload-var are supported only on the CPU
  • affinity-format-var is supported only on the CPU; its value is immutable
  • max-task-priority-var, def-allocator-var are not supported
  • cancel-var is not supported; it always returns false

2.6 Parallel Construct

Support for parallel construct clauses is as follows.
  • The num_threads, default, private, firstprivate, and shared clauses are supported
  • The reduction clause is supported as described in 2.19.5
  • The if and copyin clauses are supported only for CPU targets; the compiler emits an error for GPU targets
  • The proc_bind clause is supported only for CPU targets; it is ignored for GPU targets
  • The allocate clause is ignored

2.7 Teams Construct

The teams construct is supported only when nested within a target construct that contains no statements, declarations, or directives outside the teams construct, or as a combined targetteams construct. The teams construct is supported for GPU targets. If the target construct falls back to CPU mode, the number of teams is one. Support for teams construct clauses is as follows.
  • The num_teams, thread_limit, default, private, and firstprivate clauses are supported
  • The reduction clause is supported as described in 2.19.5
  • The shared clause is supported for CPU targets and is supported for GPU targets in unified-memory mode
  • The allocate clause is ignored

2.8 Worksharing Constructs

2.8.1 Sections Construct

The sections construct is supported only for CPU targets; the compiler emits an error for GPU targets. Support for sections construct clauses is as follows.
  • The private and firstprivate clauses are supported
  • The reduction clause is supported as described in 2.19.5
  • The lastprivate clause is supported; the optional lastprivate modifier is not supported
  • The allocate clause is ignored

2.8.2 Single Construct

Support for single construct clauses is as follows.
  • The private, firstprivate, and nowait clauses are supported
  • The copyprivate clause is supported only for CPU targets; the compiler emits an error for GPU targets
  • The allocate clause is ignored

2.8.3 Workshare Construct

The workshare construct is supported in Fortran only for CPU targets; the compiler emits an error for GPU targets.

2.9 Loop-Related Constructs

2.9.2 Worksharing-Loop Construct (for/do)

Support for worksharing for and do construct clauses is as follows.
  • The private, firstprivate, and collapse clauses are supported
  • The reduction clause is supported as described in 2.19.5
  • The schedule clause is supported; the optional modifiers are not supported
  • The lastprivate clause is supported; the optional lastprivate modifier is not supported
  • The ordered clause is supported only for CPU targets; ordered(n) clause is not supported
  • The linear clause is not supported
  • The order(concurrent) clause is ignored
  • The allocate clause is ignored

2.9.3 SIMD Directives

The simd construct can be used to provide tuning hints for CPU targets; the simd construct is ignored for GPU targets. Support for simd construct clauses is as follows.
  • The reduction clause is supported as described in 2.19.5
  • The lastprivate clause is supported; the optional lastprivate modifier is not supported
  • The if, simdlen, and linear clauses are not supported
  • The safelen, aligned, nontemporal, and order(concurrent) clauses are ignored

The composite forsimd and dosimd constructs are supported for CPU targets; they are treated as for and do directives for GPU targets. Supported simd clauses are supported on the composite constructs for the CPU. Any simd clauses are ignored for GPU targets.

The declaresimd directive is ignored.

2.9.4 Distribute Directives

The distribute construct is supported within a teams construct. Support for distribute construct clauses is as follows:
  • The private, firstprivate, collapse, and dist_schedule(static [ ,chunksize]) clauses are supported
  • The lastprivate clause is not supported
  • The allocate clause is ignored

The distributesimd construct is treated as a distribute construct and is supported for GPU targets; valid supported distribute clauses are accepted; simd clauses are ignored. The distributesimd construct is not supported for CPU targets.

The distributeparallelfor or distributeparalleldo constructs are supported for GPU targets. Valid supported distribute and parallel and for or do clauses are accepted. The distributeparallelfor or distributeparalleldo constructs are not supported for CPU targets.

The distributeparallelforsimd or distributeparalleldosimd constructs are treated as distributeparallelfor or distributeparalleldo constructs and are supported for GPU targets. These are not supported for CPU targets.

2.9.5 Loop Construct

Support for loop construct clauses is as follows.
  • The private, bind, and collapse clauses are supported
  • The reduction clause is supported as described in 2.19.5
  • The order(concurrent) clause is assumed
  • The lastprivate clause is not supported

2.10 Tasking Constructs

2.10.1 Task Construct

The task construct is supported for CPU targets. The compiler emits an error when it encounters task within a target construct. Support for task construct clauses is as follows:
  • The if, final, default , private, firstprivate, and shared clauses are supported
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11

2.10.4 Taskyield Construct

The taskyield construct is supported for CPU targets; it is ignored for GPU targets.

2.11 Memory Management Directives

The memory management allocators, memory management API routines, and memory management directives are not supported

2.12 Device Directives

2.12.1 Device Initialization

Depending on how the program is compiled and linked, device initialization may occur at the first target construct or API routine call, or may occur implicitly at program startup.

2.12.2 Target Data Construct

The targetdata construct is supported for GPU targets. Support for targetdata construct clauses is as follows.
  • The if, device, use_device_ptr, and use_device_addr clauses are supported
  • The map clause is supported as described in 2.19.7

2.12.3 Target Enter Data Construct

The targetenterdata construct is supported for GPU targets. Support for enterdata construct clauses is as follows.
  • The if, device, and nowait clauses are supported
  • The map clause is supported as described in 2.19.7.
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11

2.12.4 Target Exit Data Construct

The targetexitdata construct is supported for GPU targets. Support for exitdata construct clauses is as follows.
  • The if, device, and nowait clauses are supported
  • The map clause is supported as described in 2.19.7.
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11

2.12.5 Target Construct

The target construct is supported for GPU targets. If there is no GPU or GPU offload is otherwise disabled, execution falls back to CPU mode. Support for target construct clauses is as follows:
  • The if, private, firstprivate, is_device_ptr, and nowait clauses are supported
  • The device clause is supported without the device-modifier ancestor keyword
  • The map clause is supported as described in 2.19.7
  • The defaultmap clause is supported using OpenMP 5.0 semantics
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11
  • The allocate and uses_allocate clauses are ignored

2.12.6 Target Update Construct

The targetupdate construct is supported for GPU targets. Support for targetupdate construct clauses is as follows.
  • The if, device, and nowait clauses are supported.
  • The to and from clauses are supported without mapper or mapid
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11

Array sections are supported in to and from clauses, including noncontiguous array sections. Array section strides are not supported. If the array section is noncontiguous, the OpenMP runtime may have to use multiple host-to-device or device-to-host data transfer operations, which increases the overhead. If the host data is in host-pinned memory, then update data transfers with the nowait clause are asynchronous. This means the data transfer for a targetupdatetonowait may not occur immediately or synchronously with the program thread, and any changes to the data may affect the transfer, until a synchronizing operation is reached. Similarly, a targetupdatefromnowait may not occur immediately or synchronously with the program thread, and the downloaded data may not be available until a synchronizing operation is reached. If the host data is not in host-pinned memory, then update data transfers with the nowait clause may require that the data transfer operation use an intermediate pinned buffer managed by the OpenMP runtime library, and that a memory copy operation on the host between the program memory and the pinned buffer may be needed before starting or before finishing the transfer operation, which affects overhead and performance. To learn more about the pinned buffer, please refer to Staging Memory Buffer.

2.12.7 Declare Target Construct

The declaretarget construct is supported for GPU targets.
  • declare target ... end declare target is supported
  • declare target(list) is supported
  • The to(list) clause is supported
  • The device_type clause is supported for C/C⁠+⁠+
A function or procedure that is referenced in a function or procedure that appears in a declaretargetto clause (explicitly or implicitly) is treated as if its name had implicitly appeared in a declaretargetto clause.

2.13 Combined Constructs

Combined constructs are supported to the extent that the component constructs are themselves supported.

2.14 Clauses on Combined and Composite Constructs

Clauses on combined constructs are supported to the extent that the clauses are supported on the component constructs.

2.16 Master Construct

The master construct is supported for CPU and GPU targets.

2.17 Synchronization Constructs and Clauses

2.17.1 Critical Construct

The critical construct is supported only for CPU targets; the compiler emits an error for GPU targets.

2.17.2 Barrier Construct

The barrier construct is supported.

2.17.3 Implicit Barriers

Implicit barriers are implemented.

2.17.4 Implementation-Specific Barriers

There may be implementation-specific barriers, and they may be different for CPU targets than for GPU targets.

2.17.5 Taskwait Construct

The taskwait construct is supported only for CPU targets; it is ignored for GPU targets.
  • The depend([dependmodifier,] dependtype : list) clause is supported as described in 2.17.11

2.17.6 Taskgroup Construct

The taskgroup construct is supported only for CPU targets. It is ignored for GPU targets.

2.17.7 Atomic Construct

Support for atomic construct clauses is as follows.
  • The read, write, update, and capture clauses are supported.
  • The memory order clauses seq_cst, acq_rel, release, acquire, relaxed are not supported
  • The hint clause is ignored

2.17.8 Flush Construct

The flush construct is supported only for CPU targets.

2.17.9 Ordered Construct and Ordered Directive

The ordered block construct is supported only for CPU targets.

2.17.11 Depend Clause

The depend clause is supported on CPU targets. It is not supported on GPU targets. The dependence types in, out, and inout are supported. The dependence types mutexinoutset and depobj, dependence modifier iterator(iters), depend(source), and depend(sink:vector) are not supported.

2.19 Data Environment

2.19.2 Threadprivate Directive

The threadprivate directive is supported only for CPU targets. It is not supported for GPU targets; references to threadprivate variables in device code are not supported.

2.19.5 Reduction Clauses and Directives

The reduction clause is supported. The optional modifier is not supported.

2.19.6 Data Copying Clauses

The data copying copyin and copyprivate clauses are supported only for CPU targets; the compiler emits a compile-time error for GPU targets.

2.19.7 Data Mapping Attribute Rules, Clauses, and Directives

  • The map([[mapmod[,]...] maptype:] datalist) clause is supported. Of the map-type-modifiers, always is supported, close is ignored, and mapper(mapid) is not supported.
  • The defaultmap clause is supported using OpenMP 5.0 semantics.

2.20 Nesting of Regions

For constructs supported in this subset, restrictions on nesting of regions is observed. Additionally, nested parallel regions on CPU are not supported and nested teams or parallel regions in a target region are not supported.

Runtime Library Routines

3.2 Execution Environment Routines

The following execution environment runtime API routines are supported.

  • omp_set_num_threads, omp_get_num_threads, omp_get_max_threads, omp_get_thread_num, omp_get_thread_limit, omp_get_supported_active_levels, omp_set_max_active_levels, omp_get_max_active_levels, omp_get_level, omp_get_ancestor_thread_num, omp_get_team_size, omp_get_num_teams, omp_get_team_num, omp_is_initial_device

The following execution environment runtime API routines are supported only on the CPU.

  • omp_get_num_procs, omp_set_dynamic, omp_get_dynamic, omp_set_schedule, omp_get_schedule, omp_in_final, omp_get_proc_bind, omp_get_num_places, omp_get_affinity_format, omp_set_default_device, omp_get_default_device, omp_get_num_devices, omp_get_device_num, omp_get_initial_device

The following execution environment runtime API routines have limited support.

  • omp_get_cancellation, omp_get_nested; supported only on the CPU; the value returned is always false
  • omp_display_affinity, omp_capture_affinity; supported only on the CPU; the format specifier is ignored
  • omp_set_nested; supported only on the CPU, the value is ignored

The following execution environment runtime API routines are not supported.

  • omp_get_place_num_procs, omp_get_place_proc_ids, omp_get_place_num, omp_get_partition_num_places, omp_get_partition_place_nums, omp_set_affinity_format, omp_get_max_task_priority, omp_pause_resource, omp_pause_resource_all

3.3 Lock Routines

Lock runtime API routines are not supported on the GPU. The following lock runtime API routines are supported on the CPU.

  • omp_init_lock, omp_init_nest_lock, omp_destroy_lock, omp_destroy_nest_lock, omp_set_lock, omp_set_nest_lock, omp_unset_lock, omp_unset_nest_lock, omp_test_lock, omp_test_nest_lock

The following lock runtime API routines are not supported.

  • omp_init_lock_with_hint, omp_init_nest_lock_with_hint

3.4 Timing Routines

The following timing runtime API routines are supported.

  • omp_get_wtime, omp_get_wtick

3.6 Device Memory Routines

The following device memory routines are supported only on the CPU.

  • omp_target_is_present, omp_target_associate_ptr, omp_target_disassociate_ptr
  • omp_target_memcpy and omp_target_memcpy_rect are only supported when copying to and from the same device.

The following device memory routines are supported on the CPU; we extend OpenMP to support these in target regions on a GPU, but only allocation and deallocation on the same device is supported.

  • omp_target_alloc, omp_target_free

3.7 Memory Management Routines

The following memory management routines are supported.

  • omp_alloc, omp_free

The following memory management routines are not supported.

  • omp_init_allocator, omp_destroy_allocator, omp_set_default_allocator, omp_get_default_allocator

6 Environment Variables

The following environment variables have limited support.

  • OMP_SCHEDULE, OMP_NUM_THREADS, OMP_NUM_TEAMS, OMP_DYNAMIC, OMP_PROC_BIND, OMP_PLACES, OMP_STACKSIZE, OMP_WAIT_POLICY, OMP_MAX_ACTIVE_LEVELS, OMP_NESTED, OMP_THREAD_LIMIT, OMP_TEAMS_THREAD_LIMIT, OMP_DISPLAY_ENV, OMP_DISPLAY_AFFINITY, OMP_DEFAULT_DEVICE, and OMP_TARGET_OFFLOAD are supported on CPU.
  • OMP_CANCELLATION and OMP_MAX_TASK_PRIORITY are ignored.
  • OMP_AFFINITY_FORMAT, OMP_TOOL, OMP_TOOL_LIBRARIES, OMP_DEBUG, and OMP_ALLOCATOR are not supported

7.5. Using metadirective

This section contains limitations affecting metadirective along with a few guidelines for its use.

The Fortran compiler does not support variants leading to an OpenMP directive for which a corresponding end directive is required.

Nesting user conditions, while legal, may create situations that the HPC Compilers do not handle gracefully. To avoid potential problems, use device traits inside user conditions instead. The following example illustrates this best practice.

Avoid nesting dynamic user conditions like this:

#pragma omp metadirective \
  when( user={condition(use_offload)} : target teams distribute) \
  default( parallel for schedule(static) )
  for (i = 0; i < N; i++) {
    ...
#pragma omp metadirective \
  when( user={condition(use_offload)} : parallel for)
    for (j = 0; j < N; j++) {
      ...
    }
    ...
  }

Instead, use target_device and device traits within dynamic user conditions like this:

#pragma omp metadirective \
  when( target_device={kind(gpu)}, user={condition(use_offload)} : target teams distribute) \ 
  default( parallel for schedule(static) )
  for (i = 0; i < N; i++) {
    ...
#pragma omp metadirective \
  when( device={kind(gpu)} : parallel for)
    for (j = 0; j < N; j++) {
      ...
    }
    ...
  }

The HPC compilers do not support nesting metadirective inside a target construct applying to a syntactic block leading to a teams variant. Some examples:

The compilers will emit an error given the following code:

#pragma omp target map(to:v1,v2) map(from:v3)
{
#pragma omp metadirective \
when( device={arch("nvptx")} : teams distribute parallel for) \
default( parallel for)
  for (int i = 0; i < N; i++) {
    v3[i] = v1[i] * v2[i];
  }
}

The compilers will always match device={arch("nvptx")} given the following code:

#pragma omp target map(to:v1,v2) map(from:v3)
#pragma omp metadirective \
when( device={arch("nvptx")} : teams distribute parallel for) \
default( parallel for)
  for (int i = 0; i < N; i++) {
    v3[i] = v1[i] * v2[i];
  }

The compilers match device={"arch") for GPU code, and default for host fallback, given the following code:

#pragma omp target teams distribute map(to:v1,v2) map(from:v3)
for (...)
{
#pragma omp metadirective \
when( device={arch("nvptx")} : parallel for) \
default( simd )
  for (int i = 0; i < N; i++) {
    v3[i] = v1[i] * v2[i];
  }
}

7.6. Mapping target constructs to CUDA streams

An OpenMP target task generating construct is executed on the GPU in a CUDA stream. The following are target task generating constructs:

  • target enter data
  • target exit data
  • target update
  • target

This section explains how these target constructs are mapped to CUDA streams. The relationship with the OpenACC queues is also explained below.

Keep in mind that the target data construct does not generate a task and is not necessarily executed in a CUDA stream. It also cannot have the depend and nowait clauses, thus its behavior cannot be directly controlled by the user application. The rest of this section does not cover the behavior of the target data construct.

Any task-generating target construct can have depend and nowait clauses. The NVIDIA OpenMP Runtime takes these clauses as a guidance for how to map the construct to a specific CUDA stream. Below is a breakdown of how the clauses affect the mapping decisions.

'target' without 'depend', without 'nowait'

For these constructs, the per-thread default CUDA stream is normally used. The stream is unique for each host thread, so target regions created by different host threads will execute independently in different streams according to the CUDA rules described in CUDA Runtime API; see the rules in the "Per-thread default stream" section.

The OpenACC queue acc_async_sync is initially associated with the same per-thread default CUDA stream. The user is allowed to change the association by calling acc_set_cuda_stream(acc_async_sync, stream). This will change accordingly the stream used for target without nowait.

The CUDA stream handle can be directly obtained via the ompx_get_cuda_stream(int device, int nowait) function, with the nowait parameter set to 0. The per-thread default stream can be obtained with the CUDA handle CU_STREAM_PER_THREAD or cudaStreamPerThread.

Here is an example of how a custom CUDA stream can be used to substitute the default stream:

extern __global__ void kernel(int *data);

  CUstream stream;
  cuStreamCreate(&stream, CU_STREAM_DEFAULT);
  acc_set_cuda_stream(acc_async_sync, stream); 
#pragma omp target enter data map(to:data[:N])
#pragma omp target data use_device_ptr(data)
  kernel<<<N/32, 32, 0, stream>>>(data);
#pragma omp target teams distribute parallel for
  for (int i = 0; i < N; i++) {
    data[i]++;
  }
#pragma omp target exit data map(from:data[:N])

Note there is no explicit stream synchronization after the CUDA kernel is launched. The stream is synchronized automatically at the target constructs that follow.

'target' with 'depend', without 'nowait'

For this construct, the runtime will block the current thread until all dependencies listed in the depend clause are resolved. Then, the target construct will be executed in the default per-thread CUDA stream as described in the previous section (that is, as if there is no depend clause).

'target' with 'nowait', without 'depend'

By default, the runtime will select a CUDA stream for each new target nowait construct. The selected stream may be the same that was used for a prior target nowait construct. That is, there is no guarantee of uniqueness of the selected stream.

This is different from the OpenACC model that uses the same CUDA stream associated with the acc_async_noval queue for any asynchronous construct with the async clause without an argument. To change this behavior, the user can call the ompx_set_cuda_stream_auto(int enable) function with the enable parameter set to 0. In this case, the CUDA stream associated with the acc_async_noval OpenACC queue will be used for all OpenMP target nowait constructs. Another way to enable this behavior is to set the environment variable NVCOMPILER_OMP_AUTO_STREAMS to FALSE.

To access the stream used for the next target nowait construct, the user can call the ompx_get_cuda_stream(int device, int nowait) function, with the nowait parameter set to 1.

'target' with both 'depend' and 'nowait'

The decision on which CUDA stream to use in this case relies on previously scheduled target and host tasks sharing a subset of the dependencies listed in the depend clause:

  • If the target construct has only one dependency, which is of the type inout or out, and that dependency maps to a previously scheduled target depend(...) nowait construct, and the same device is used for both target constructs, then the CUDA stream which the previous target task was scheduled to will be used.
  • Otherwise, a CUDA stream will be selected for this target construct according to the stream selection policy.

Note that target constructs with a single in dependency can be scheduled on a newly selected CUDA stream. This is to allow parallel execution of multiple target nowait constructs that depend on data produced by another previously scheduled target nowait construct.

Here is a simplified example of how a target construct, a CUDA library function and a CUDA kernel can be executed on the GPU in the same stream asynchronously with respect to the host thread:

extern __global__ void kernel(int *data); 

cudaStream_t stream = (cudaStream_t)ompx_get_cuda_stream(omp_get_default_device(), 1);
cufftSetStream(cufft_plan, stream);

#pragma omp target enter data map(to:data[:N]) depend(inout:stream) nowait
#pragma omp target data use_device_ptr(data)
  {
    kernel<<<N/32, 32, 0, stream>>>(data);
    cufftExecC2C(cufft_plan, data, data, CUFFT_FORWARD);
  }
#pragma omp target teams distribute parallel for depend(inout:stream) nowait
  for (int i = 0; i < N; i++) {
    data[i]++;
  }
#pragma omp target exit data map(from:data[:N]) depend(inout:stream) nowait

Note that the stream variable holds the CUDA stream handle and also serves as the dependency for the target constructs. This dependency enforces the order of execution and also guarantees the target constructs are on the same stream that was returned from the ompx_get_cuda_stream function call.

NVIDIA OpenMP API to access and control CUDA streams

NVIDIA OpenMP Runtime provides the following API to access CUDA streams and to control their use.

void *ompx_get_cuda_stream(int device, int nowait);

This function returns the handle of the CUDA stream that will be used for the next target construct:

  • If the nowait parameter is set to 0, it returns the CUDA stream associated with the OpenACC queue acc_async_sync, which is initially mapped to the default per-thread CUDA stream;
  • Otherwise, it returns a CUDA stream which will be used for the next target nowait construct that cannot be mapped to an existing stream according to the rules for the depend clause.
void ompx_set_cuda_stream_auto(int enable);

This function sets the policy for how CUDA streams are selected for target nowait constructs:

  • If the enable parameter is set to a non-zero value, an internally selected CUDA stream will be used for each target nowait construct that follows. This is the default behavior;
  • Otherwise, the CUDA stream associated with the OpenACC queue acc_async_noval will be used for all target nowait constructs that follow. This becomes the default behavior if the environment variable NVCOMPILER_OMP_AUTO_STREAMS is set to FALSE.

The setting is done only for the host thread which calls this function.

7.7. Noncontiguous Array Sections

Array sections can be used in to and from clauses, including noncontiguous array sections. The noncontiguous array section must be specified in a single map clause; it cannot be split between multiple directives. Although this feature may become a part of a future OpenMP specification, at this time it is an NVIDIA HPC compilers extension.

7.8. OpenMP with CUDA Unified Memory

This section will focus on OpenMP unified shared memory programming, and assume users are familiar with Separate, Managed, and Unified Memory Modes explained in the Memory Model and Managed and Unified Memory Modes sections. OpenMP unified shared memory corresponds to Unified Memory Mode in NVHPC Compilers and it can be enabled with -⁠gpu=mem:unified flag. Source code with requires unified_shared_memory directive is accepted but requires -⁠gpu=mem:unified flag to activate Unified Memory Mode.

In Unified Memory Mode, map clauses on target constructs are optional. Additionally, declare target directives are optional for variables with static storage duration accessed inside functions to which such directive is applied. The OpenMP unified shared memory eases accelerator programming on the GPUs removing the need for data management and only requiring to express the parallelism in the compute regions.

In Unified Memory Mode, all data is managed by the CUDA runtime. Explicit data map clauses which manage the data movement across the host and devices become optional. All variables are accessible from the OpenMP offload compute regions executing on the GPU. The map clause with alloc, to,from, and tofrom type will not result in any device allocation or data transfer. The OpenMP runtime, however, may leverage such clauses to communicate preferable data placement to the CUDA runtime by means of memory hint APIs as elaborated in the following blog post on the NVIDIA website: Simplifying GPU Application Development with Heterogeneous Memory Management. Device memory can be allocated or deallocated in OpenMP programs in Unified Memory Mode by using the omp_target_alloc and omp_target_free API calls. Please, note that the memory allocated through omp_target_alloc cannot be accessed by the host.

Understanding Data Movement

When the compiler encounters a compute construct without visible target data directives or map clauses, it attempts to determine what data is required for correct execution of the region on the GPU. When the compiler is unable to determine the size and shape of data needing to be accessible on the device, it behaves as follows:

  • In Separate Memory Mode, the compiler may not be able to alert you to the need for an explicit data clause specifying size and/or shape of data being copied to/from the GPU. In this case, the default length of one may be used. This may cause illegal memory access errors at runtime on the GPU devices.
  • In Managed Memory Mode (-⁠gpu=mem:managed), the compiler assumes the data is allocated in managed memory and thus is accessible from the device; if this assumption is wrong, for example, if the data was defined globally or is located on the CPU stack, the program may fail at runtime.
  • In Unified Memory Mode (-⁠gpu=mem:unified), all data is accessible from the device making information about size and shape unnecessary.

Take the following example in C:

#pragma omp declare target
void set(int* ptr, int i, int j, int dim){
  int idx = i * dim + j;
  return ptr[idx] = someval(i, j);
}
#pragma omp end declare target

void fill2d(int* ptr, int dim){
#pragma omp target teams distribute parallel for
  for (int i = 0; i < dim; i++)
    for (int j = 0; j < dim; j++)
      set(ptr, i, j, dim);
}
      

In Separate Memory Mode, the only way to guarantee correctness for this example is to specify an array section in the target construct as follows:

#pragma omp target teams distribute parallel for map(from: ptr[0:dim*dim])
      

This change explicitly instructs the OpenMP implementation about the precise data segment used within the target for loop.

In Unified Memory Mode, the map clause is not required.

The next example, in Fortran, illustrates how a global variable can be accessed in an OpenMP routine without requiring any explicit annotation.

module m
integer :: globmin = 1234
contains
subroutine findmin(a)
!$omp declare target
  integer, intent(in)  :: a(:)
  integer :: i
  do i = 1, size(a)
    if (a(i) .lt. globmin) then
      globmin = a(i)
    endif
  end do
end subroutine
end module m
      

Compile the example above for Unified Memory Mode:

nvfortran -mp=gpu -gpu=mem:unified example.f90
      

The source does not need any OpenMP directives to access module variable globmin, to either read or update its value, in the routine invoked from CPU and GPU. Moreover, any access to globmin will be made to the same exact instance of the variable from CPU and GPU; its value is synchronized automatically. In Separate or Managed Memory Modes, such behavior can only be achieved with a combination of OpenMP declare target and target update directives in the source code.

Migrating existing OpenMP applications written for Separate Memory Mode should, in most cases, be a seamless process requiring no source changes. Some data access patterns, however, may lead to different results produced during application execution in Unified Memory Mode. Applications which rely on having separate data copies in GPU memory to conduct temporary computations on the GPU -⁠-⁠ without maintaining data synchronization with the CPU -⁠-⁠ pose a challenge for migration to unified memory. For the following Fortran example, the value of variable c after the last loop will differ depending on whether the example is compiled with or without -⁠gpu=mem:unified.

b(:) = ...
c = 0

!$omp target data map(to: b) map(from: a) 
!$omp target distribute teams parallel for
do i = 1, N
  b(i) = b(i) * i
end do
!$omp target distribute teams parallel for
do i = 1, N
  a(i) =  b(i) + i
end do
!$omp end target data

do i = 1, N
  c = c + a(i) + b(i)
end do
      

Without Unified Memory, array b is copied into the GPU memory at the beginning of the OpenMP target data region. It is then updated in the GPU memory and used to compute elements of array a. As instructed by the data clause map(to:b), b is not copied back to the CPU memory at the end of the target data region and therefore its initial value is used in the computation of c. With -⁠mp=gpu -⁠gpu=mem:unified, the updated value of b in the first loop is automatically visible in the last loop leading to a different value of c at its end.

Additional complications may arise from the asynchronous execution as the use of unified shared memory may require extra synchronizations to avoid data races.

7.9. Multiple Device Support

A program can use multiple devices on a single node.

This functionality is supported using the omp_set_default_device API call and the device() clause on the target constructs. Our experience is that most programs use MPI parallelism with each MPI rank selecting a single GPU to which to offload. Some programs assign multiple MPI ranks to each GPU, in order to keep the GPU fully occupied, though the fixed memory size of the GPU limits how effective this strategy can be. Similarly, other programs use OpenMP thread parallelism on the CPU, with each thread selecting a single GPU to which to offload.

7.10. Interoperability with CUDA

The HPC Compilers support interoperability of OpenMP and CUDA to the same extent they support CUDA interoperability with OpenACC.

If OpenMP and CUDA code coexist in the same program, the OpenMP runtime and the CUDA runtime use the same CUDA context on each GPU. To enable this coexistence, use the compilation and linking option -⁠cuda. CUDA-allocated data is available for use inside OpenMP target regions with the OpenMP analog is_device_ptr to OpenACC's deviceptr() clause.

OpenMP-allocated data is available for use inside CUDA kernels directly if the data was allocated with the omp_target_alloc() API call; if the OpenMP data was created with a target data map clause, it can be made available for use inside CUDA kernels using the target data use_device_addr() clause. Calling a CUDA device function inside an OpenMP target region is supported, as long as the CUDA function is a scalar function, that is, does not use CUDA shared memory or any inter-thread synchronization. Calling an OpenMP declare target function inside a CUDA kernel is supported as long as the declare target function has no OpenMP constructs or API calls.

7.11. Interoperability with Other OpenMP Compilers

OpenMP CPU-parallel object files compiled with NVIDIA's HPC compilers are interoperable with OpenMP CPU-parallel object files compiled by other compilers using the KMPC OpenMP runtime interface. Compilers supporting KMPC OpenMP include Intel and CLANG. The HPC compilers support a GNU OpenMP interface layer as well which provides OpenMP CPU-parallel interoperability with the GNU compilers.

For OpenMP GPU computation, there is no similar formal or informal standard library interface for launching GPU compute constructs or managing GPU memory. There is also no standard way to manage the device context in such a way as to interoperate between multiple offload libraries. The HPC compilers therefore do not support interoperability of device compute offload operations and similar operations generated with another compiler.

7.12. GNU STL

When using nvc⁠+⁠+ on Linux, the GNU STL is thread-safe to the extent listed in the GNU documentation as required by the C⁠+⁠+11 standard. If an STL thread-safe issue is suspected, the suspect code can be run sequentially inside of an OpenMP region using #pragma omp critical sections.

8. Using Stdpar

This chapter describes the NVIDIA HPC Compiler support for standard language parallelism, also known as Stdpar:

  • ISO C⁠+⁠+ standard library parallel algorithms with nvc⁠+⁠+
  • ISO Fortran do concurrent loop construct with nvfortran

Use the -⁠stdpar compiler option to enable parallel execution with standard parallelism. The sub-options to -⁠stdpar are the following:

  • gpu: compile for parallel execution on GPU; this sub-option is the default. This feature is supported on the NVIDIA Pascal architecture and newer.
  • multicore: compile for multicore CPU execution.
  • gpu,multicore: compile for parallel execution on GPU and CPU; if execution platform has any GPU, the code will be offloaded to run on GPU. Otherwise, fall back to multicore CPU execution.

NVC++ supports the additional -⁠stdpar sub-options detailed in Enabling Parallel Algorithms with the -⁠stdpar Option.

By default, the compiler auto-detects and generates GPU code for the type of GPU that is installed on the system on which the compiler is running. To generate code for a specific GPU architecture, which may be necessary when the application is compiled and run on different systems, add the -⁠gpu=ccXX command-line option. More details can be found in Compute Capability.

Predefined Macros

The following macros corresponding to the parallel execution target compiled for are added implicitly:
  • __NVCOMPILER_STDPAR_GPU for parallel execution on GPU.
  • __NVCOMPILER_STDPAR_MULTICORE for parallel execution on multicore CPU.

8.1. GPU Memory Modes

When compiling for GPU execution, Stdpar utilizes Managed and Unified Memory Modes for managing data accessed from the sequential code running on CPU and from the parallel code on GPU.

The compiler detects the memory capability of the system on which the compiler is running and uses that information to enable the correct memory mode as follows:

  • When compiled on the platform with full CUDA Unified Memory capability, -⁠stdpar implies -⁠gpu=mem:unified.
  • When compiled on the platform with CUDA Managed Memory capability only, -⁠stdpar implies -⁠gpu=mem:managed.

To compile code for a specific Memory Mode regardless of the memory capability of the system on which you are compiling, add the desired -⁠gpu=mem:unified or -⁠gpu=mem:managed option.

Stdpar with Separate Memory Mode can only be supported when the data are fully managed through features of other programming models e.g. OpenACC.

All restrictions on variables used on the GPU in standard language parallel code in Managed Memory Mode have been removed when using Unified Memory Mode.

If the compiler utilises CUDA Managed Memory automatically, the interception of deallocations is enabled implicitly at runtime. This is to prevent deallocating the data with unmatching API which may lead to undefined behavior. The interception incurs some runtime overhead and may be unnecessary if allocatations and deallocations for all data in the application are performed using the matching APIs. The interception can be disabled using dedicated command-line options detailed in Interception of Deallocations. More details about the memory modes supported by the NVIDIA HPC Compilers and dedicated command-line options can be found in Memory Model.

8.2. Stdpar C++

The NVIDIA HPC C⁠+⁠+ compiler, NVC⁠+⁠+, supports C⁠+⁠+ Standard Language Parallelism (Stdpar) for execution on NVIDIA GPUs and multicore CPUs. As mentioned previously, use the NVC⁠+⁠+ command-line option -⁠stdpar to enable GPU accelerated C⁠+⁠+ Parallel Algorithms. The following sections go into more detail about the NVC⁠+⁠+ support for the ISO C⁠+⁠+ Standard Library Parallel Algorithms.

8.2.1.  Introduction to Stdpar C++

The C⁠+⁠+17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms.

This higher-level parallelism is expressed by adding an execution policy as the first parameter to any algorithm that supports execution policies. Most of the existing Standard C++ algorithms were enhanced to support execution policies. C⁠+⁠+17 defined several new parallel algorithms, including the useful std::reduce and std::transform_reduce.

C⁠+⁠+17 defines three execution policies:

  • std::execution::seq: Sequential execution. No parallelism is allowed.
  • std::execution::par: Parallel execution on one or more threads.
  • std::execution::par_unseq: Parallel execution on one or more threads, with each thread possibly vectorized.

When you use an execution policy other than std::execution::seq, you are communicating two important things to the compiler:

  • You prefer but do not require that the algorithm be run in parallel. A conforming C⁠+⁠+17 implementation may ignore the hint and run the algorithm sequentially, but a performance-oriented implementation takes the hint and executes in parallel when possible and prudent.
  • The algorithm is safe to run in parallel. For the std::execution::par and std::execution::par_unseq policies, any user-provided code—such as iterators, lambdas, or function objects passed into the algorithm—must not introduce data races if run concurrently on separate threads. For the std::execution::par_unseq policy, any user-provided code must not introduce data races or deadlocks if multiple calls are interleaved on the same thread, which is what happens when a loop is vectorized. For more information about potential deadlocks, see the forward progress guarantees provided by the parallel policies or watch CppCon 2018: Bryce Adelstein Lelbach “The C++ Execution Model”.

The C⁠+⁠+ Standard grants compilers great freedom to choose if, when, and how to execute algorithms in parallel as long as the forward progress guarantees the user requests are honored. For example, std::execution::par_unseq may be implemented with vectorization and std::execution::par may be implemented with a CPU thread pool. It is also possible to execute parallel algorithms on a GPU, which is a good choice for invocations with sufficient parallelism to take advantage of the processing power and memory bandwidth of NVIDIA GPU processors.

8.2.2.  NVC++ Compiler Parallel Algorithms Support

NVC⁠+⁠+ supports C⁠+⁠+ Standard Language Parallelism with the parallel execution policies std::execution::par or std::execution::par_unseq for execution on GPUs or multicore CPUs.

Lambdas, including generic lambdas, are fully supported in parallel algorithm invocations. No language extensions or non-standard libraries are required to enable GPU acceleration. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of Managed and Unified Memory Modes.

It's straightforward to automatically GPU accelerate C⁠+⁠+ Parallel Algorithms with NVC⁠+⁠+. However, there are some restrictions and limitations you need to be aware of as explained below.

8.2.2.1.  Enabling Parallel Algorithms with the -⁠stdpar Option

GPU acceleration of C⁠+⁠+ Parallel Algorithms is enabled with the -⁠stdpar=gpu command-line option to NVC++. If -⁠stdpar=gpu is specified (or -⁠stdpar without an argument), almost all algorithms that use a parallel execution policy are compiled for offloading to run in parallel on an NVIDIA GPU:

  nvc++ -stdpar=gpu program.cpp -o program 
  nvc++ -stdpar program.cpp -o program 
In addition, the GPU acceleration sub-option can be further specialized using -⁠stdpar=gpu:acc. This option directs the compiler to use its OpenACC implementation to GPU-accelerate a subset of algorithm with a parallel execution policy:
  nvc++ -stdpar=gpu:acc program.cpp -o program 
More details about the OpenACC support of Stdpar C++ is provided in OpenACC Implementation of Parallel Algorithms.

Acceleration of C⁠+⁠+ Parallel Algorithms with multicore CPUs is enabled with the -⁠stdpar=multicore command-line option to NVC⁠+⁠+. If -⁠stdpar=multicore specified, almost all algorithms that use a parallel execution policy are compiled to run on a multicore CPU:

  nvc++ -stdpar=multicore program.cpp -o program 

When either -⁠stdpar=gpu,multicore or -⁠stdpar=gpu:acc,multicore command-line options are specified to NVC++, the parallel algorithms code is compiled for both GPU and multicore CPU. When the execution platform has any GPU the binary executes on the GPU and otherwise on the multicore CPU.

 nvc++ -stdpar=gpu,multicore program.cpp -o program 
 nvc++ -stdpar=gpu:acc,multicore program.cpp -o program 

8.2.3.  Stdpar C++ Simple Example

Here are a few simple examples to get a feel for how the C++ Parallel Algorithms work.

From the early days of C++, sorting items stored in an appropriate container has been relatively easy using a single call like the following:

  std::sort(employees.begin(), employees.end(),
            CompareByLastName()); 

Assuming the comparison class CompareByLastName is thread-safe, which is true for most comparison functions, parallelizing this sort is simple with C++ Parallel Algorithms. Include <execution> and add an execution policy to the function call:

  std:sort(std::execution::par,
           employees.begin(), employees.end(),
           CompareByLastName()); 

Calculating the sum of all the elements in a container is also simple with the std::accumulate algorithm. Prior to C++17, transforming the data in some way while taking the sum was somewhat awkward. For example, to compute the average age of your employees, you might write the following code:

  int ave_age = 
    std::accumulate(employees.begin(), employees.end(), 0,
                    [](int sum, const Employee& emp){ 
                        return sum + emp.age();
                    })
    / employees.size(); 

The std::transform_reduce algorithm introduced in C++17 makes it simple to parallelize this code. It also results in cleaner code by separating the reduction operation, in this case std::plus, from the transformation operation, in this case emp.age():

  int ave_age =
    std::transform_reduce(std::execution::par_unseq,
                          employees.begin(), employees.end(),
                          0, std::plus<int>(),
                          [](const Employee& emp){ 
                              return emp.age();
                          }) 
    / employees.size(); 

8.2.4.  OpenACC Implementation of Parallel Algorithms

NVC++ has an experimental GPU support for a subset of algorithms with parallel execution policies std::par and std::par_unseq accelerated through the OpenACC implementation. This feature, enabled with the -stdpar=gpu:acc option, may result in better application performance on the GPU and faster compilation speed.

The following subset of algorithms have OpenACC implementation support:

  • std::for_each
  • std::for_each_n
  • std::transform

The following algorithms have OpenACC implementation support for scalar data types and the standard std::plus reduction operation:

  • std::reduce
  • std::transform_reduce

The remainder of the parallel algorithms are parallelized using the default GPU implementation as if -stdpar=gpu was specified.

When the code is compiled for GPU with the OpenACC acceleration __NVCOMPILER_STDPAR_OPENACC_GPU macro is defined implicitly.

8.2.5.  Coding Guidelines for GPU-accelerating Parallel Algorithms

GPUs are not simply CPUs with more threads. To effectively take advantage of the massive parallelism and memory bandwidth available on GPUs, it is typical for GPU programming models to put some limitations on code executed on the GPU. The NVC⁠+⁠+ implementation of C⁠+⁠+ Parallel Algorithms is no exception in this regard. The sections which follow detail the limitations that apply in the current release.

8.2.5.1.  Parallel Algorithms and Device Function Annotations

Functions to be executed on the GPU within parallel algorithms do not need any __device__ annotations or other special markings to be compiled for GPU execution. The NVC++ compiler walks the call graph for each source file and automatically infers which functions must be compiled for GPU execution.

However, this only works when the compiler can see the function definition in the same source file where the function is called. This is true for most inline functions and template functions but may fail when functions are defined in a different source file or linked in from an external library. You need to be aware of this when formulating parallel algorithms invocations that you expect to be offloaded and accelerated on NVIDIA GPUs.

When calling an externally defined function from within a parallel algorithm region, such functions require some form of device annotations from other GPU programming models e.g. OpenACC routine directive (refer to External Device Function Annotations for more information).

8.2.5.2. Data Management in Parallel Algorithms

When offloading parallel algorithms to a GPU, it's essential to consider how data is accessed from the parallel region. Some GPUs may not access certain segments of the CPU's address space. Developers targeting platforms without unified shared memory or those seeking to optimize performance must be aware of these memory distinctions, as they may affect the folowing types of data accessed in parallel algorithm regions:
  • Pointer data passed into lambda functions within the parallel algorithm.
  • Data captured by reference in lambda functions or pointer data captured by value.
  • Variables with static storage duration referenced inside the parallel algorithm.
To avoid memory access violations, developers must ensure that all of the above data is accessible to the GPU before the parallel algorithm is executed.

Stdpar C⁠+⁠+ only supports Managed and Unified Memory Modes which allow data being accessed from CPU and GPU. Through support in both the CUDA device driver and the NVIDIA GPU hardware, the CUDA Unified Memory manager automatically moves some types of data based on usage.

Stdpar with Separate Memory Mode can only be supported when the data are fully managed through the OpenACC data directives, refer to Interoperability with OpenACC.

Since object-oriented design is fundamental to C++, special consideration must be given to composite data types with pointer or reference members. The data referenced or pointed to may not be stored contiguously within the composite data type. Moreover, such data might not even be allocated in the same memory segment as the composite type itself. As a result, when accessing both the composite data type and its referenced or pointed-to data from parallel algorithms, the developer must ensure that the member data is also made accessible to the GPU. These considerations should also be taken into account when standard library containers are used in the parallel algorithms as the containers frequently contain member pointers to their elements.

The discussion in this section assumes familiarity with the Managed and Unified Memory Modes covered in Memory Model and Managed and Unified Memory Modes. The code executing within the parallel algorithm is referred to as the accelerator subprogram. In contrast to the code executing outside of the parallel algorithm which is referred to as the host subprogram.

Managed Memory Mode

When Stdpar code is compiled with Managed Memory Mode (as default mode or by passing -⁠gpu=mem:managed) only data dynamically allocated on the heap in CPU code can be managed automatically. CPU and GPU automatic storage (stack memory) and static storage (global or static data) cannot be automatically managed. Likewise, data that is dynamically allocated in program units not compiled by nvc⁠+⁠+ with the -⁠stdpar option is not automatically managed by CUDA Unified Memory even though it is on the CPU heap. The compiler utilizes CUDA Managed Memory for dynamic allocations to make data accessible from CPU and GPU. As managed memory allocation calls can incur higher runtime overhead than standard allocator calls, the implementation uses memory pools for performance reasons by default as detailed in Memory Pool Allocator.

The Managed Memory Mode is intended for binaries run on targets with CUDA Managed Memory capability only. Any pointer that is dereferenced and any C⁠+⁠+ object that is referenced within a parallel algorithm invocation must refer to data on the CPU heap that is allocated in a program unit compiled by nvc⁠+⁠+ with -⁠stdpar. Dereferencing a pointer to a CPU stack or a global object will result in a memory violation in GPU code.

Unified Memory Mode

When Unified Memory is the default memory mode or is selected explicitly on the command line by passing -⁠gpu=mem:unified, there are no restrictions on variables accessed in the parallel algorithms. Therefore, all CPU data (either residing on stack, heap, or globally) are simply accessible in the parallel algorithm functions. Note that memory dynamically allocated in GPU code is only visible from GPU code and can never be accessed by the CPU regardless of the CUDA Unified Memory capability.

When compiling a binary for platforms with full CUDA Unified Memory capability, only those source files using features from the standard parallel algorithms library must be compiled by nvc⁠+⁠+ with the -⁠stdpar option. There is no requirement that the code dynamically allocating memory accessed on GPU is also compiled in such a way.

Unified Memory Mode may utilize CUDA Managed Memory for dynamic allocation, more details can be found in Managed and Unified Memory Modes.

Summary

The following table provides a key summary of important command-line options selecting memory modes and the impact of memory modes on the Stdpar features.

Table 21. Stdpar C++ Feature Differences for Memory Modes
Command-line options Dynamically allocated variables outside of parallel algorithm region Automatic or static storage variables outside of parallel algorithm region Dynamic allocator
No memory-specific flags passed, compiling on target with CUDA Managed Memory only Can be accessed within parallel region code Cannot be accessed within parallel algorithm code cudaMallocManaged
No memory-specific flags passed, compiling on target with full CUDA Unified Memory Can be accessed within parallel region code Can be accessed within parallel algorithm code cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator)

-⁠gpu=mem:managed

Can be accessed within parallel region code Cannot be accessed within parallel algorithm code cudaMallocManaged

-⁠gpu=mem:unified

Can be accessed within parallel region code Can be accessed within parallel algorithm code cudaMallocManaged or system allocators: new/malloc (compiler picks the most suitable allocator)

-⁠gpu=mem:unified:managedalloc

Can be accessed within parallel region code Can be accessed within parallel algorithm code cudaMallocManaged

-⁠gpu=mem:unified:nomanagedalloc

Can be accessed within parallel region code Can be accessed within parallel algorithm code System allocators: new/malloc
Examples

For example, std::vector uses dynamically allocated memory, which is accessible from the GPU when using Stdpar. Iterating over the contents of a std::vector in a parallel algorithm works as expected when compiling with either -⁠gpu=mem:managed or -⁠gpu=mem:unified:

std::vector<int> v = ...;
std::sort(std::execution::par, 
          v.begin(), v.end()); // Okay, accesses heap memory.
      

On the other hand, std::array performs no dynamic allocations. Its contents are stored within the std::array object itself, which is often on a CPU stack. Iterating over the contents of a std::array will not work on systems with only CUDA Managed Memory support unless the std::array itself is allocated on the heap and the code is compiled with -⁠gpu=mem:managed:

std::array<int, 1024> a = ...;
std::sort(std::execution::par, 
           a.begin(), a.end()); // Fails on targets with CUDA Managed
                                // Memory capability only, array is on
                                // a CPU stack inaccessible from GPU.
                                // Works correctly on targets whith full 
                                // CUDA Unified Memory support.
      

The above example works as expected when run on a target supporting full CUDA Unified Memory capability.

When executing on targets with CUDA Managed Memory capability only, pay particular attention to lambda captures, especially capturing data objects by reference, which may contain non-obvious pointer dereferences:

void saxpy(float* x, float* y, int N, float a) {
   std::transform(std::execution::par_unseq, x, x + N, y, y,
                  [&](float xi, float yi){ return a * xi + yi; });
}
      

In the earlier example, the containing function parameter a is captured by reference. The code within the body of the lambda, which is running on the GPU, tries to access a, which is in the CPU stack memory. This attempt results in a memory violation and undefined behavior. In this case, the problem can easily be fixed by changing the lambda to capture by value:

void saxpy(float* x, float* y, int N, float a) {
   std::transform(std::execution::par_unseq, x, x + N, y, y,
                  [=](float xi, float yi){ return a * xi + yi; });
}
      

With this one-character change, the lambda makes a copy of a, which is then copied to the GPU, and there are no attempts to reference CPU stack memory from GPU code. Such code will run correctly without requiring modifications on targets with full CUDA Unified Memory capability.

If std::vector is accessed through a subscript operator from the device this would require such a vector object to be accessible from the parallel code executing on the GPU. This means that the std::vector needs to be allocated dynamically in order to make it accessible from the GPU when compiled for the systems with only CUDA Managed Memory support.
std::vector<int> v = ...;
std::for_each(std::execution::par, 
              idx.begin(), idx.end(), [&](auto i)
              {v[i] = 1;}); // Fails on targets with CUDA Managed
                            // Memory capability only, vector object is on
                            // a CPU stack inaccessible from GPU.
                            // Works correctly on targets with full 
                            // CUDA Unified Memory support.
        
An alternative approach to managing the content of the std::vector on systems with CUDA Managed Memory support only would be to obtain a pointer to its elements data region using data() member.
std::vector<int> v = ...;
int* vdataptr = v.data();
std::for_each(std::execution::par, 
              idx.begin(), idx.end(), [&](auto i)
              {vdataptr[i] = 1;}); // Works, vector elements are in heap
                                   // memory
        

Whether -⁠gpu=mem:unified is enabled by default or passed explicitly on the command line, parallel algorithms can access global variables and accesses to global variables from CPU and GPU are kept in sync. Extra care should be taken when accessing global variables within parallel algorithms, as simultaneous updates in different iterations running on the GPU can lead to data races. The following example illustrates the safe update of a global variable in the parallel algorithm since the update only occurs in one iteration.

int globvar = 123;
void foo() {
  auto r = std::views::iota(0, N);
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [](auto i) {
                       if (i == N - 1)
                         globvar += 345;
          });
  // globvar is equal to 468.
}
      

8.2.5.3.  Parallel Algorithms and Function Pointers

Functions compiled to run on either the CPU or the GPU must be compiled into two different versions, one with the CPU machine instructions and one with the GPU machine instructions.

In the current implementation, a function pointer either points to the CPU or the GPU version of the functions. This causes problems if you attempt to pass function pointers between CPU and GPU code. You might inadvertently pass a pointer to the CPU version of the function to GPU code. In the future, it may be possible to automatically and seamlessly support the use of function pointers across CPU and GPU code boundaries, but it is not supported in the current implementation.

Function pointers can’t be passed to Parallel Algorithms to be run on the GPU, and functions may not be called through a function pointer within GPU code. For example, the following code example won’t work correctly:

  void square(int& x) { x = x * x; }
  void square_all(std::vector<int>& v) {
    std::for_each(std::execution::par_unseq,
                  v.begin(), v.end(), &square);
  }

It passes a pointer to the CPU version of the function square to a parallel for_each algorithm invocation. When the algorithm is parallelized and offloaded to the GPU, the program fails to resolve the function pointer to the GPU version of square.

You can often solve this issue by using a function object, which is an object with a function call operator. The function object's call operator is resolved at compile time to the GPU version of the function, instead of being resolved at run time to the incorrect CPU version of the function as in the previous example. For example, the following code example works:

  struct squared {
    void operator()(int& x) const { x = x * x; }
  };
  void square_all(std::vector<int>& v) {
    std::for_each(std::execution::par_unseq,
                  v.begin(), v.end(), squared{});
  }

Another possible workaround is to change the function to a lambda, because a lambda is implemented as a nameless function object:

  void square_all(std::vector<int>& v) {
    std::for_each(std::execution::par_unseq, v.begin(), v.end(),
                  [](int& x) { x = x * x; });
  }

If the function in question is too big to be converted to a function object or a lambda, then it should be possible to wrap the call to the function in a lambda:

  void compute(int& x) {
    // Assume lots and lots of code here.
  }
  void compute_all(std::vector<int>& v) {
    std::for_each(std::execution::par_unseq, v.begin(), v.end(),
                  [](int& x) { compute(x); });
  }

No function pointers are used in this example.

The restriction on calling a function through a function pointer unfortunately means passing polymorphic objects from CPU code to GPU-accelerated Parallel Algorithms is not currently supported, as virtual tables are implemented using function pointers.

8.2.5.4.  Random Access Iterators

The C++ Standard requires that the iterators passed to most C++ Parallel Algorithms be forward iterators. However, C++ Parallel Algorithms on GPUs only works with random access iterators. Passing a forward iterator or a bidirectional iterator to a GPU/CPU-accelerated Parallel Algorithm results in a compilation error. Passing raw pointers or Standard Library random access iterators to the algorithms has the best performance, but most other random-access iterators work correctly.

8.2.5.5.  Interoperability with the C++ Standard Library

Large parts of the C++ Standard Library can be used with stdpar on GPUs.

  • std::atomic<T> objects within GPU code work provided that T is a four-byte or eight-byte integer type.
  • Math functions that operate on floating-point types—such as sin, cos, log, and most of the other functions declared in <cmath> —can be used in GPU code and resolve to the same implementations that are used in CUDA C++ programs.
  • std::complex, std::tuple, std::pair, std::optional, std::variant, and <type_traits>, are supported and work as expected in GPU code.

The parts of the C++ Standard Library that aren’t supported in GPU code include I/O functions and in general any function that accesses the CPU operating system. As a special case, basic printf calls can be used within GPU code and leverage the same implementation that is used in NVIDIA CUDA C++.

8.2.5.6.  No Exceptions in GPU Code

As with most other GPU programming models, throwing and catching C++ exceptions is not supported within Parallel Algorithm invocations that are offloaded to the GPU.

Unlike some other GPU programming models where try/catch blocks and throw expressions are compilation errors, exception code does compile but with non-standard behavior. Catch clauses are ignored, and throw expressions abort the GPU kernel if actually executed. Exceptions in CPU code work without restrictions.

8.2.6.  NVC++ Experimental Features

nvc++ experimental features are enabled with the --experimental–stdpar compiler flag. Experimental feature headers are exposed via the <experimental/...> namespaces and limited support for these features is available in older C++ versions. Table 1 lists all experimental features available and the minimum language version required to use them.

Table 22. Experimental features information
Feature Recommended Limited support Standard proposal Other notes
Multi-dimensional spans (mdspan) C++23 C++17 P0009 https://github.com/NVIDIA/libcudacxx
Slices of multi-dimensional spans (submdspan) C++23 C++17 P2630 https://github.com/NVIDIA/libcudacxx
Multi-dimensional arrays (mdarray) C++23 C++17 P1684 https://github.com/kokkos/mdspan
Senders and receivers C++23 C++20 P2300 https://github.com/NVIDIA/stdexec
Linear algebra C++23 C++17 P1673 https://github.com/kokkos/stdblas

8.2.6.1.  Multi-dimensional Spans

Multi-dimensional spans (std::mdspan) enable customizable multi-dimensional access to data. This feature was added to C++23 (see P0009 and follow-on papers). A Gentle Introduction to mdspan gives a tutorial. The reference mdspan implementation https://github.com/kokkos/mdspan also has many useful examples.

nvc++ provides an implementation available in the <experimental/mdspan> namespace that works with C++17 or newer. It enables applications that are not targeting the C++23 version of the standard to use mdspan.

nvc++ also provides the P0009R17 version of submdspan, which only works for the mdspan layouts in C++23; that is, it does not implement C++26 submdspan (P2630) yet.

C++23’s mdspan uses operator[] for array access. For example, if A is a rank-2 mdspan, and i and j are integers, then A[i, j] accesses the element of A at row i and column j. Before C++23, operator[] was only allowed to take one argument. C++23 changed the language to permit any number of arguments (zero or more). nvc++ does not support this new language feature. As a result, the implementation of mdspan provided by nvc++ permits use of operator() as a fall-back (e.g., A(i, j) instead of A[i, j]). Users may enable this fall-back manually, by defining the macro MDSPAN_USE_PAREN_OPERATOR to 1 before including any mdspan headers.

The following example (godbolt):

#include <experimental/mdspan>
#include <iostream>


namespace stdex = std::experimental;


int main() {
 std::array d{
   0, 5, 1,
   3, 8, 4,
   2, 7, 6,
 };


 stdex::mdspan m{d.data(), stdex::extents{3, 3}};
 static_assert(m.rank()==2, "Rank is two");


 for (std::size_t i = 0; i < m.extent(0); ++i)
   for (std::size_t j = 0; j < m.extent(1); ++j)
     std::cout << "m(" << i << ", " << j << ") == " << m(i, j) << "\n";


 return 0;
}

is compiled as follows

nvc++ -std=c++17 -o example example.cpp

and outputs

m(0, 0) == 0
m(0, 1) == 5
m(0, 2) == 1
m(1, 0) == 3
m(1, 1) == 8
m(1, 2) == 4
m(2, 0) == 2
m(2, 1) == 7
m(2, 2) == 6

8.2.6.2.  Senders and Receivers

P2300 - std::execution proposes a model of asynchronous programming for adoption into the C++26 Standard. For an introduction to this feature, see Design - user side section of the proposal. The NVIDIA implementation of Senders and receivers is open source and its repository contains many useful examples. nvc++ provides access to the NVIDIA implementation which works in C++20 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. The NVIDIA implementation is structured as follows:

Includes Namespace Description
<stdexec/...> ::stdexec Approved for C++ standard
<sexec/...> ::exec Generic additions and extensions
<nvexec/...> ::nvexec NVIDIA-specific extensions and customizations

The following example (godbolt) builds a task graph in which two different vectors, v0 and v1, are concurrently modified in bulk, using a CPU thread pool and a GPU stream context, respectively. This graph then transfers execution to the CPU thread pool, and adds both vectors into v2 on the CPU, returning the sum of all elements:

int main()
{
  // Declare a pool of 8 worker CPU threads:
  exec::static_thread_pool pool(8);


  // Declare a GPU stream context:
  nvexec::stream_context stream_ctx{};


  // Get a handle to the thread pool:
  auto cpu_sched = pool.get_scheduler();
  auto gpu_sched = stream_ctx.get_scheduler();


  // Declare three dynamic array with N elements
  std::size_t N = 5;
  std::vector<int> v0 {1, 1, 1, 1, 1};
  std::vector<int> v1 {2, 2, 2, 2, 2};
  std::vector<int> v2 {0, 0, 0, 0, 0};


  // Describe some work:
  auto work = stdexec::when_all(
    // Double v0 on the CPU
    stdexec::just()
      | exec::on(cpu_sched,
                 stdexec::bulk(N, [v0 = v0.data()](std::size_t i) {
                   v0[i] *= 2;
      })),
    // Triple v1 on the GPU
    stdexec::just()
      | exec::on(gpu_sched,
                 stdexec::bulk(N, [v1 = v1.data()](std::size_t i) {
                   v1[i] *= 3;
    }))
  )
  | stdexec::transfer(cpu_sched)
  // Add the two vectors into the output vector v2 = v0 + v1:
  | stdexec::bulk(N, [&](std::size_t i) { v2[i] = v0[i] + v1[i]; })
  | stdexec::then([&] {
    int r = 0;
    for (std::size_t i = 0; i < N; ++i) r += v2[i];
      return r;
  });
  auto [sum] = stdexec::sync_wait(work).value();
  // Print the results:
  std::printf("sum = %d\n", sum);
  for (int i = 0; i < N; ++i) {
    std::printf("v0[%d] = %d, v1[%d] = %d, v2[%d] = %d\n",
                i, v0[i], i, v1[i], i, v2[i]);
  }
  return 0;
}

is compiled as follows:

nvc++ --stdpar=gpu --experimental-stdpar -std=c++20 -o example example.cpp

and outputs:

sum = 40
v0[0] = 2, v1[0] = 6, v2[0] = 8
v0[1] = 2, v1[1] = 6, v2[1] = 8
v0[2] = 2, v1[2] = 6, v2[2] = 8
v0[3] = 2, v1[3] = 6, v2[3] = 8
v0[4] = 2, v1[4] = 6, v2[4] = 8

8.2.6.3.  Linear Algebra

P1673 - A free function linear algebra interface based on the BLAS proposes standardizing an idiomatic C++ interface based on std::mdspan for a subset of the Basic Linear Algebra Subroutines (BLAS) standard. For an introduction to this feature, see P1673 (C++ linear algebra library) background & motivation. There are many useful examples available in $HPCSDK_HOME/examples/stdpar/stdblas and in the repository of the reference implementation. A detailed documentation is available at $HPCSDK_HOME/compilers/include/experimental/__p1673_bits/README.md. nvc++ provides access to the NVIDIA implementation which works in C++17 or newer. Since the proposal is still evolving, our implementation is not stable. It is experimental in nature and will change to follow the proposal closely without any warning. To use the linear algebra library facilities, a suitable linear algebra library must be linked: cuBLAS for GPU execution via the -cudalib=cublas flag, and a CPU BLAS library for CPU execution. The HPC SDK bundles OpenBLAS which may be linked using the -lblas linker flag.

Execution BLAS library Architectures Compiler flags
Multicore OpenBLAS x86_64, aarch64, ppc64l -stdpar=multicore -lblas
GPU cuBLAS All -stdpar=gpu -cudalib=cublas

The following example (godbolt):

#include <experimental/mdspan>
#include <experimental/linalg>
#include <vector>
#include <array>


namespace stdex = std::experimental;


int main()
{
 constexpr size_t N = 4;
 constexpr size_t M = 2;


 std::vector<double> A_vec(N*M);
 std::vector<double> x_vec(M);
 std::array<double, N> y_vec(N);


 stdex::mdspan A(A_vec.data(), N, M);
 stdex::mdspan x(x_vec.data(), M);
 stdex::mdspan y(y_vec.data(), N);


 for(int i = 0; i < A.extent(0); ++i)
   for(int j = 0; j < A.extent(1); ++j)
     A(i,j) = 100.0 * i + j;


 for(int j = 0; j < x.extent(0); ++j) x(j) = 1.0 * j;
 for(int i = 0; i < y.extent(0); ++i) y(i) = -1.0 * i;


 stdex::linalg::matrix_vector_product(A, x, y); // y = A * x


 // y = 0.5 * y + 2 * A * x
 stdex::linalg::matrix_vector_product(std::execution::par,
 stdex::linalg::scaled(2.0, A), x,
 stdex::linalg::scaled(0.5, y), y);


 // Print the results:
 for (int i = 0; i < N; ++i) std::printf("y[%d] = %f\n", i, y(i));
 return 0;
}

is compiled as follows for GPU execution:

nvc++ -std=c++17 -stdpar=gpu -cudalib=cublas -o example example.cpp

And as follows for CPU execution:

nvc++ -std=c++17 -stdpar=multicore -o example example.cpp -lblas

and produces the same outputs in both cases:

y[0] = 2.500000
y[1] = 252.500000
y[2] = 502.500000
y[3] = 752.500000

8.2.7.  Stdpar C++ Larger Example: LULESH

The LULESH hydrodynamics mini-app was developed at Lawrence Livermore National Laboratory to stress test compilers and model performance of hydrodynamics applications. It is about 9,000 lines of C++ code, of which 2,800 lines are the core computation that should be parallelized.

We ported LULESH to C++ Parallel Algorithms and made the port available on LULESH's GitHub repository. To compile it, install the NVIDIA HPC SDK, check out the 2.0.2-dev branch of the LULESH repository, go to the correct directory, and run make.

git clone --branch 2.0.2-dev https://github.com/LLNL/LULESH.git
cd LULESH/stdpar/build
make run

While LULESH is too large to show the entire source code here, there are some key code sequences that demonstrate the use of stdpar.

The LULESH code has many loops with large bodies and no loop-carried dependencies, making them good candidates for parallelization. Most of these were easily converted into calls to std::for_each_n with the std::execution::par policy, where the body of the lambda passed to std::for_each_n is identical to the original loop body.

The function CalcMonotonicQRegionForElems is an example of this. The loop header written for OpenMP looks as follows:

  #pragma omp parallel for firstprivate(qlc_monoq, qqc_monoq, \
                    monoq_limiter_mult, monoq_max_slope, ptiny)
  for ( Index_t i = 0 ; i < domain.regElemSize(r); ++i ) {

This loop header in the C++ Parallel Algorithms version becomes the following:

  std::for_each_n(
    std::execution::par, counting_iterator(0), domain.regElemSize(r),
    [=, &domain](Index_t i) {

The loop body, which in this case is almost 200 lines long, becomes the body of the lambda but is otherwise unchanged from the OpenMP version.

In a number of places, an explicit for loop was changed to use C++ Parallel Algorithms that better express the intent of the code, such as the function CalcPressureForElems:

  #pragma omp parallel for firstprivate(length)
  for (Index_t i = 0; i < length ; ++i) {
    Real_t c1s = Real_t(2.0)/Real_t(3.0) ;
    bvc[i] = c1s * (compression[i] + Real_t(1.));
    pbvc[i] = c1s;
  }

This function was rewritten as as follows:

  constexpr Real_t cls = Real_t(2.0) / Real_t(3.0);
  std::transform(std::execution::par,
    compression, compression + length, bvc,
    [=](Real_t compression_i) {
      return cls * (compression_i + Real_t(1.0));
    });
  std::fill(std::execution::par, pbvc, pbvc + length, cls);

8.2.8.  Interoperability with OpenACC

A subset of OpenACC features can be used when compiling Stdpar code for GPUs. Such a subset is documented in this section. To activate OpenACC directives recognition with Stdpar code add -acc command line flag to nvc++.
nvc++ -stdpar -acc example.cpp
      
OpenACC functionality is detailed in the OpenACC specification and the NVHPC compiler specific differences are detailed in Using OpenACC of this guide.

Combining OpenACC features with Stdpar offers greater flexibility in how code is written. For instance, it allows external functions to be called from within parallel algorithms. Additionally, it provides opportunities for performance tuning, such as through explicit data management.

8.2.8.1. Data Management Directives

C⁠+⁠+ parallel algorithms can be offloaded to the GPU when the data accessed in such algorithms is managed through the OpenACC directives. With data fully managed through the OpenACC directives, Stdpar code can run with all GPU Memory Modes including Separate Memory Mode (compiled with -gpu=mem:separate).

The following data directives are supported:
  • OpenACC structured data construct directive
  • OpenACC unstructured enter/exit data directives
  • OpenACC host_data directive
  • OpenACC update directive

Only the data that are captured by reference or pointer-like data captured by values as well as pointer-like data passed as arguments in the parallel algorithm lambdas can be managed through OpenACC. Any non-pointer variables that are captured by value in the parallel algorithm lambda or non-pointer data passed in as lambda arguments are managed by the C++ implementation. A copy of such data is automatically created in the memory accessible from the GPU. For additional details refer to Data Management in Parallel Algorithms.

OpenACC data management can serve two main purposes:
  • Explicit Data Management: This is necessary for data that cannot be managed implicitly, such as on platforms without full CUDA Unified Memory support and when data is not allocated in the CUDA Managed Memory segment.
  • Performance Tuning: Even when data is located in the GPU-accessible memory, performance can be optimized via OpenACC features. Many OpenACC data directives and clauses provide hints to the CUDA device driver, which can improve implicit data management.
Data management strategies may differ depending on the specific goals being pursued. These differences are outlined where applicable.
General Rules
All directives, except host_data, can be used for data management tasks such as allocating memory in the GPU and copying data between the CPU and the GPU. These directives can be used to ensure that the data is present on the device during the execution of parallel algorithms. The host_data construct, on the other hand, is used for address translation between CPU and GPU address spaces when data is accessed in parallel algorithms.
int n = get_n();
T* in  = new T[nelem];
T* out = new T[nelem];
// Data captured by the lambda are managed explicitly with OpenACC
#pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem])
#pragma acc host_data use_device(n, in, out)
{
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [&,in,out](auto i) {
                  out[i] = in[i] * n;
                });
}
#pragma acc exit data copyout(out[0:nelem])
        
In the above example all data accessed from std::for_each through the lambda capture are managed explicitly through the OpenACC data directives. Since the data inside the parallel algorithms are either captured by reference or capturing a pointer, the application code must ensure that such data is accessible from the GPU. To make non-GPU resident data accessible in the parallel region, such a region must be enclosed into the host_data construct region with all variables that are managed explicitly via OpenACC runtime listed in the use_device clause. The data need to be present (copied or created) at the time the host_data directive is encountered/executed at runtime and the data must also be present for the duration of parallel algorithm execution. The implications of the above are such that lambdas accessing variables enclosed in use_device regions can not be additionally invoked from the host code (from outside the parallel region executing on the GPU) because the variable addresses from the GPU obtained through host_data may not be accessible on the CPU.
Note: If the iterator in the above example would be a pointer type it would require explicit data management in addition to the data captured by the lambda.
If the example below is compiled for Separate Memory Mode (-gpu=mem:separate) calling fn from within a parallel std::for_each works fine but not from outside of any parallel algorithm function since the data resident on GPU would need to be accessed from the CPU.
int n = get_n();
T* in  = new T[nelem];
T* out = new T[nelem];
#pragma acc enter data copyin(n, in[0:nelem]) create(out[0:nelem])
#pragma acc host_data use_device(n, in, out)
{
  auto fn = [&,in,out](auto i) { out[i] = in[i] * n;};
  std::for_each(std::execution::par_unseq, r.begin(), r.end(), fn);
  // The following line would not be legal, fn accesses variables in GPU memory
  //std::for_each(r.begin(), r.end(), fn);
}
#pragma acc exit data copyout(out[0:nelem])
        
Note: The behavior of using use_device with non-pointer data type is such that all occurrences of non-pointer variables inside the host_data region are converted to using the addresses of the variable in the GPU address space before accessing that variable. This is essentially equivalent to translating original occurrences of such variable var into dvar = *acc_device(&var).
Composite Data Types

Composite data types with pointer members can also be managed explicitly but require explicit deep copy to work correctly including pointer attach/detach.

struct S {
  float *ptr;
}

int idx[N] = {/*...*/};
float arr[N];
S s{arr};
// Deep copying ptr member with OpenACC
#pragma acc enter data copyin(s.ptr[0:N])
#pragma acc enter data copyin(s, idx)
#pragma acc data attach(s.ptr)
#pragma acc host_data use_device(s, idx)
{
  std::for_each_n(std::execution::par, idx, N,
                  [&](int i) { s.ptr[i] += 5.0; });
}
#pragma acc exit data copyout(s.ptr[0:N])
#pragma acc exit data copyout(s)
        
When variable of struct S type in the above example is copied to the device, a deep copy is performed with the content pointed by S.ptr copied separately. The pointer attachment is used to ensure the address of the pointer is changed to the device memory equivalent before it is accessed from the GPU. Depending on the order of the copies, the pointer attach clause may not be required.
Note: In the above example the pointer-like iterator idx is managed through the OpenACC directives in addition to the data captured by the lambda.
Standard Containers
If the standard containers with non-contiguous storage must be used in host code with explicit data management to GPU memory, the only viable option is to access the raw data directly using the raw pointer to data (e.g. obtained via data() member of std::vector) unless the iterator over the data can be used.
std::vector<T> in(nelem);
std::vector<T> out(nelem);
T *inptr=in.data(),*outptr=out.data();
#pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem])
#pragma acc host_data use_device(inptr,outptr)
{ 
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [=](auto i) {
                  outptr[i] = inptr[i];       
                });
}
        
In the above example vector elements are accessed through raw pointers to their elements obtained through vector::data() member, they are explicitly management through the OpenACC data clauses.
Static Storage Data
Global or static variables can be made accessible in the parallel algorithms using OpenACC data directives similarly to other variables.
int glob_arr[N] = {/*...*/};
void foo(){
#pragma acc data copy(glob_arr)
#pragma acc host_data use_device(glob_arr)
  {
    std::for_each_n(std::execution::par, glob_arr, N,
                    [](int &e) { e += 1; });
  }
}
        
In the above example the global array glob_arr is updated on the GPU with help of OpenACC data directives.
Member Functions
When the data members are managed inside the member functions the implicit object pointer this needs to be explicitly managed for correctness as accessing members is always done through the dereference of the object itself.
struct S {
  float *ptr;

  void update_member() {
#pragma acc data copy(ptr[0:N], this)
#pragma acc host_data use_device(ptr, this)
  {
      std::for_each(std::execution::par, ptr, ptr + N,
                    [=](float &e) { ptr[&e - ptr] += 5.0; });
   }
  }
};
        
GPU Memory Mode Related Differences

In Separate Memory Mode all data must be managed explicitly via extra device allocations and memcpy between the host and device and the address translations. This also applies to variables with automatic or static storage duration in Managed Memory Mode.

In Unified Memory Mode all data is automatically managed by the CUDA device driver. Additionally in Managed Memory Mode all dynamic allocations are managed by the CUDA device driver. Use of data clauses and directives can only propagate memory usage hints to the CUDA device driver which are used to improve the data management performance. More details can be found in Memory Model and OpenACC with CUDA Unified Memory .

All the data managed by the CUDA device driver can benefit from the simplified uses of the OpenACC features, particularly:
  • Use of host_data directive is not required since the host and device address of data in unified shared memory is identical.
  • Use of pointer attach or detach is not required since the host and device pointers in unified shared memory are identical.
The following example illustrates simplified data managment with only OpenACC data construct enclosing the std::for_each with Unified Memory Mode.
int n = get_n();
T* in = new T[nelem];
T* out = new T[nelem];
#pragma acc data copyin(in[0:nelem]) copyout(out[0:nelem])
{ 
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [&](auto i) {
                  out[i] = in[i] * n;       
                });
}
        
In the above example we leverage OpenACC explicit data management construct to indicate how data is used on GPU for the computation executed in std::for_each:
  • in is moved into the GPU memory;
  • out is moved from the GPU memory.
Both in and out are captured by reference and therefore their host address is used in the lambda of std::for_each. The scalar variable n is not managed. The use of host_data construct is not required.
When standard containers are used in data directives and clauses, the underlying data collection can be managed too. For example, in order to indicate that elements of the std::vector are accessed from the GPU the application code must first retrieve the pointer to the array elements using its data() member. Then such pointers can be used in the regular data directives.
std::vector<T> in(nelem);
std::vector<T> out(nelem);
T *inptr=in.data(), *outptr=out.data();
#pragma acc data copyin(inptr[0:nelem]) copyout(outptr[0:nelem])
{ 
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [&](auto i) {
                  out[i] = in[i];       
                });
}
        
The above example demonstrates the use of OpenACC data directives with a raw pointer to elements of std::vector which can improve memory performance for data in unified memory and the full deep copy of vector content using attach/detach is not required.
int n = get_n();
T* in = new T[nelem];
T* out = new T[nelem];
#pragma acc enter data copyin(n)
#pragma acc host_data use_device(n)
{
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [&, in, out](auto i) {
                  out[i] = in[i] * n;
                });
}
#pragma acc enter data delete(n)
        
In the above example, in and out are dynamically allocated and managed by CUDA device driver with Managed Memory Mode, n is on the stack and therefore managed explicitly via OpenACC directives.

8.2.8.2. External Device Function Annotations

Using OpenACC routine directive annotations allows calling external device functions.
// In file1.cpp
extern int foo();

void bar()
{ 
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [=](auto i) {
                  ou[i] = foo();       
                });
}
// In file2.cpp
#pragma acc routine
int foo(){
  return 4;
}
      
The above code can be compiled/linked as follows:
nvc++ -stdpar file1.cpp
nvc++ -acc file2.cpp
nvc++ -stdpar -acc file1.o file2.o
      

8.2.9.  Getting Started with Parallel Algorithms for GPUs

To get started, download and install the NVIDIA HPC SDK on your x86-64, OpenPOWER, or Arm CPU-based system running a supported version of Linux.

The NVIDIA HPC SDK is freely downloadable and includes a perpetual use license for all NVIDIA Registered Developers, including access to future release updates as they are issued. After you have the NVIDIA HPC SDK installed on your system, the nvc++ compiler is available under the /opt/nvidia/hpc_sdk directory structure.

  • To use the compilers including nvc++ on a Linux/x86-64 system, add the directory /opt/nvidia/hpc_sdk/Linux_x86_64/24.11/compilers/bin to your path.
  • On an OpenPOWER or Arm CPU-based system, replace Linux_x86_64 with Linux_ppc64le or Linux_aarch64, respectively.

8.2.9.1.  Supported NVIDIA GPUs

The NVC⁠+⁠+ compiler can automatically offload C⁠+⁠+ Parallel Algorithms to NVIDIA GPUs based on the Volta architecture or newer. These architectures include features -- such as independent thread scheduling and hardware optimizations for CUDA Unified Memory -- that were specifically designed to support high-performance, general-purpose parallel programming models like the C⁠+⁠+ Parallel Algorithms.

The NVC⁠+⁠+ compiler provides limited support for C⁠+⁠+ Parallel Algorithms on the Pascal architecture, which does not have the independent thread scheduling necessary to properly support the std::execution::par policy. When compiling for the Pascal architecture (-⁠gpu=cc60), NVC⁠+⁠+ compiles algorithms with the std::execution::par policy for serial execution on the CPU. Only algorithms with the std::execution::par_unseq policy will be scheduled to run on Pascal GPUs.

8.2.9.2.  Supported CUDA Versions

The NVC++ compiler is built on CUDA libraries and technologies and uses CUDA to accelerate C++ Parallel Algorithms on NVIDIA GPUs. A GPU-accelerated system on which NVC++-compiled applications are to be run must have a CUDA 11.2 or newer device driver installed.

The NVIDIA HPC SDK compilers ship with an integrated CUDA toolchain, header files, and libraries to use during compilation, so it is not necessary to have a CUDA Toolkit installed on the system.

When -⁠stdpar is specified, NVC++ compiles using the CUDA toolchain version that best matches the CUDA driver installed on the system on which compilation is performed. To compile using a different version of the CUDA toolchain, use the -⁠gpu=cudaX.Y option. For example, use the -⁠gpu=cuda11.8 option to specify that your program should be compiled for a CUDA 11.8 system using the CUDA 11.8 toolchain.

8.3. Stdpar Fortran

Fortran 2008 introduced the do concurrent (DC) loop construct signaling that loop iterations have no interdependencies. With -⁠stdpar such loop iterations will be executed in parallel on the GPU when -⁠stdpar (or -⁠stdpar=gpu) is passed to nvfortran or using CPU threads when -⁠stdpar=multicore is passed to nvfortran. More details can be found in the following blog post on the NVIDIA website: Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK.

8.3.1. Calling Routines in DO CONCURRENT on the GPU

When compiling for the GPU, calling routines in the body of do concurrent loop can be constrained. PURE routines can generally be called inside the do concurrent loop body. The compiler detects that such routines are to be compiled for the GPU target. External routines, however, can't be called from within the DC loop unless they are explicitly annotated with the OpenACC routine directive (refert to Interoperability with OpenACC) or CUDA device attribute (refer to Interoperability with CUDA Fortran).

The following example will compile successfully.
module m
contains
pure subroutine foo()
return
end subroutine
end module m

program dc
use m
implicit none
integer :: i

do concurrent (i=1:10)
  call foo()
enddo
end program
      
The following example, however, doesn't compile unless foo is either
  • annotated with !$acc routine,
  • or attributed with attributes(device) and compiled as Stdpar and CUDA Fortran.
program dc
implicit none
interface
  pure subroutine foo()
  end subroutine foo
end interface
integer :: i

do concurrent (i=1:10)
  call foo()
enddo
end program
      

8.3.2. GPU Data Management

If -⁠gpu=mem:managed is enabled by default or is explicitly passed on the command line, some data accesses in do concurrent loops are invalid. For example, accessing global variables in the routines called from the do concurrent loop does not perform expected value updates in the CPU code.

Additionally, there are rare instances where the compiler cannot accurately determine variable sizes for implicit data movements between CPU and GPU. As demonstrated in the following example, a is an assumed-size array, and its access region inside the DC construct cannot be determined at compile time because the element index positions are taken from another array b initialized outside of the routine. Such code does not update a as expected and may result in a memory violation and undefined behavior.

subroutine r(a, b)
  integer :: a(*)
  integer :: b(:)
  do concurrent (i = 1 : size(b))
    a(b(i)) = i 
  enddo
end subroutine
    

There are no limitations on the variable accessed in do concurrent loops described above when the code is compiled with -⁠gpu=mem:unified, whether this option is enabled by default or explicitly via an option on the command line.

8.3.3. Interoperability with OpenACC

OpenACC features can be used when compiling Stdpar code for GPUs. To activate OpenACC directives recognition with Stdpar code add -acc command line flag to nvfortran.
nvfortran -stdpar -acc example.f90
      
OpenACC functionality and interoperability with DO-CONCURRENT loop is detailed in the OpenACC specification and the NVIDIA HPC compiler specific differences are detailed in Using OpenACC of this guide.
Using OpenACC features can enhance functionality of DC-loop for example with the following:
  • Explicit data management to improve performance of CPU-GPU implicit data movements or even leverage separate memory compiling on the GPU when compiling with -gpu=mem:separate passed in.
  • Tuning DC-loop execution on the GPU e.g. GPU kernels launch configuration.
  • Executing DC-loops asynchronously.
  • Calling external routines from within DC-loops.
  • Atomic operations in DC-loops.

Examples

Some examples of using OpenACC directives with DC-loops are provided below.

The following example demonstrates how the data accessed inside the DC-loop are fully managed in the OpenACC data construct.
!$acc data copyin(b) copyout(a)
do concurrent (j=1:N)
  do i=1,K
    a(j,i) = b(j,i)
  end do
end do
!$acc end data
        
While in the above example the data construct is used for GPU data management, the same effect can be achieved with the use of data clauses on the compute construct enclosing DC-loop.
The following example shows how the scheduling of DC loop on the GPU is controlled through the clauses on the compute construct.
!$acc parallel loop num_gangs(50000) vector_length(32)
do concurrent (i=1:K,j=1:N)
  a(j,i) = real(j) 
end do
        
Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously.
!$acc parallel loop async
do concurrent (j=1:N)
  a(j) = j
end do

b = foo()

#pragma acc wait

c = sum(a) + b
        
In the previous example, array a is filled in with values asynchronously in DC-loop.

8.3.4. Interoperability with CUDA Fortran

CUDA Fortran features can also be used when compiling Stdpar code for GPUs. To recognize CUDA Fortran features in your source code, compile with the -cuda command line flag using nvfortran.

 nvfortran -stdpar -cuda example.f90 
Using CUDA Fortran extensions can enhance the functionality of a do concurrent (DC) loop and Stdpar program, for several cases:
  • Explicit data locality, accessing CUDA Fortran attributed arrays or other data with the device, managed, unified, or constant attributes from within DC-loops.
  • Tuning DC-loop execution on the GPU e.g. controlling the GPU kernels launch configuration.
  • Executing DC-loops asynchronously using a specific CUDA stream.
  • Calling external, user-defined CUDA device routines from within DC-loops.
  • Using CUDA Atomic operations in DC-loops, or other CUDA-specific device-side runtime library calls.
  • Inserting CUDA Runtime API calls for memory tuning hints outside of DC-loops.

Examples

Some examples of using CUDA Fortran features with DC-loops are provided below. The following example demonstrates how a DC-loop can access CUDA Fortran device data, run on a specific CUDA stream, call the CUDA Runtime API for creating a stream, and hide non-standard features behind the CUF sentinel for code portability.

 !@cuf use cudafor
 !@cuf integer(kind=cuda_stream_kind) :: istrm
       real, allocatable :: a(:,:), b(:,:)
 !@cuf attributes(device) :: a  ! A is device array only, not unified/managed
       . . .
 !@cuf istat = cudaStreamCreate(istrm)
       . . .
       a(:,:) = 0.0
       . . .
 !$cuf kernel do(1) <<< *, *, stream=istrm>>>
       do concurrent (j=1:N)
         do i=1,K
           a(j,i) = a(j,i) + 2.0 * b(j,i)
         end do
       end do 

This program demonstrates how to call low-level CUDA device functions from within a DC-loop. The function can be written in either CUDA Fortran or CUDA C++, depending on the interface. The CUDA C function must be compiled for relocatable device code. This can be used for accessing features in CUDA and NVIDIA GPUs not readily available in directive-based models or standard languages.

 module mcuda
   contains
     attributes(host,device) pure integer function std_dbg(itype)
     integer, value :: itype
     if (itype.eq.1) then
       std_dbg = threadIdx%x
     else if (itype.eq.2) then
       std_dbg = blockIdx%x
     else
       std_dbg = (blockIdx%x-1)*blockDim%x + threadIdx%x
     end if
     end function
 end module

 program test
 use mcuda
 integer, parameter :: N = 2000
 integer, allocatable :: a(:), b(:), c(:)
 allocate(a(N),b(N),c(N))

 do concurrent (j=1:N)
   a(j) = std_dbg(1)
   b(j) = std_dbg(2)
   c(j) = std_dbg(3)
 end do

 print *,a(1),a(N/2),a(N)
 print *,b(1),b(N/2),b(N)
 print *,c(1),c(N/2),c(N)
 end 

Many functions from the CUDA Fortran cudadevice module are available within do concurrent loops, not just atomics. This code snippet shows two uses:

 real :: tmp(4), x, y
 ...
 block; use cudadevice
 do concurrent (i=1:K,j=1:N)
   x = real(j) + a(i,j)
   y = atomicAdd(b(1,j), x)
 end do

 do concurrent (j=1:N)
   x = real(j)
   tmp(1:4) = __ldca(a(1:4,j))
   tmp(1:4) = tmp(1:4) + x
   call __stwt(b(1:4,j), tmp)
 end do
 end block 

9. PCAST

Parallel Compiler Assisted Software Testing (PCAST) is a set of API calls and compiler directives useful in testing program correctness. Numerical results produced by a program can diverge when parts of the program are mapped onto a GPU, when new or additional compiler options are used, or when changes are made to the program itself. PCAST can help you determine where these divergences begin, and pinpoint the changes that cause them. It is useful in other situations as well, including when using new libraries, determining whether parallel execution is safe, or porting programs from one ISA or type of processor to another.

9.1. Overview

PCAST Comparisons can be performed in two ways. The first saves the initial run's data into a file through the pcast_compare call or directive. Add the calls or directives to your application where you want intermediate results to be compared. Then, execute the program to save the "golden" results where the values are known to be correct. During subsequent runs of the program, the same pcast_compare calls or directives will compare the computed intermediate results to the saved "golden" results and report the differences.

The second approach works in conjunction with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly, both on the CPU and GPU. GPU results are compared against the CPU results, and differences reported. This is essentially like the first case where the CPU-calculated values are treated as the "golden" results. GPU to CPU comparisons can be done implicitly at the end of data regions with the autocompare flag or explicitly after kernels with the acc_compare call or directive.

With the autocompare flag, OpenACC regions will run redundantly on the CPU and GPU. On an OpenACC region exit where data is to be downloaded from device to host, PCAST will compare the values calculated on the CPU with those calculated in the GPU. Comparisons done with autocompare or acc_compare are handled in memory and do not write results to an intermediate file.

The following table outlines the supported data types that can be used with PCAST. Short, integer, long, and half precision data types are not supported with ABS, REL, ULP, or IEEE options; only a bit-for-bit comparison is supported.

For floating-point types, PCAST can calculate absolute, relative, and unit-last-place differences. Absolute differences measures only the absolute value of the difference (subtraction) between two values, i.e. abs(A-B). Relative differences are calculated as a ratio between the difference of values, A-B, and the previous value A; abs((A-B)/A). Unit-least precision (Unit-last place) is a measure of the smallest distance between two values A and B. With the ULP option set, PCAST will report if the calculated ULP between two numbers is greater than some threshold.

Table 23. Supported Types for Tolerance Measurements
C/C++ Type Fortran Type ABS REL ULP IEEE
float real, real(4) Yes Yes Yes Yes
double double precision, real(8) Yes Yes Yes Yes
float _Complex complex, complex(4) Yes Yes Yes Yes
double _Complex complex(8) Yes Yes Yes Yes
- real(2) No No No No
(un)signed short integer(2) N/A N/A N/A N/A
(un)signed int integer, integer(4) N/A N/A N/A N/A
(un)signed long integer(8) N/A N/A N/A N/A

9.2. PCAST with a "Golden" File

The run-time call pcast_compare highlights differences between successive program runs. It has two modes of operation, depending on the presence of a data file named pcast_compare.dat by default. If the file does not exist, pcast_compare assumes this is the first "golden" run. It will create the file and fill it with the computed data at each call to pcast_compare. If the file exists, pcast_compare assumes it is a test run. It will read the file and compare the computed data with the saved data from the file. The default behavior is to consider the first 50 differences to be a reportable error, no matter how small.

By default, the pcast_compare.dat file is in the same directory as the executable. The behavior of pcast_compare, and other comparison parameters, can be changed at runtime with the PCAST_COMPARE environment variable discussed in the Environment Variables section.

The signature of pcast_compare for C++ and C is:

 void pcast_compare(void*, char*, size_t, char*, char*, char*, int);
    

The signature of pcast_compare for Fortran is:

 subroutine pcast_compare(a, datatype, len, varname, filename, funcname, lineno)
     type(*), dimension(..) :: a
     character(*) :: datatype, varname, filename, funcname
     integer(8),value :: len
     integer(4),value :: lineno
    

The call takes seven arguments:

  1. The address of the data to be saved or compared.
  2. A string containing the data type.
  3. The number of elements to compare.
  4. A string treated as the variable name.
  5. A string treated as the source file name.
  6. A string treated as the function name.
  7. An integer treated as a line number.

For example, the pcast_compare runtime call can be invoked like the following:

 pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1);
    
 call pcast_compare(a, 'real', n, 'a', 'pcast_compare1.f90', 'program', 9)
    

The caller should give meaningful names to the last four arguments. They can be anything, since they only serve to annotate the report. It is imperative that the identifiers are not modified between comparisons; comparisons must be called in the same order for each program run. If, for example, you are calling pcast_compare inside a loop, it is reasonable to set the last argument to be the loop index.

There also exists a directive form of the pcast_compare, which is functionally the same as the runtime call. It can be used at any point in the program to compare the current value of data to that recorded in the golden file, same as the runtime call. There are two benefits to using the directive over the API call:

  1. The directive syntax is much simpler than the API syntax. Most of what the compare call needs to output data to the user can be gleaned by the compiler at compile-time (The type, variable name, file name, function name, and line number).
     #pragma nvidia compare(a[0:n])
        
    as opposed to:
     pcast_compare(a, "float", N, "a", "pcast_compare03.c", "main", 1);
        
  2. The directive is only enabled when the -Mpcast flag is set, so the source need not be changed when testing is complete. Consider the following usage examples:
     #pragma nvidia compare(a[0:N]) // C++ and C
     !$nvf compare(a(1:N)) ! Fortran
        

The directive interface is given below in C++ or C style, and in Fortran. Note that for Fortran, var-list is a variable name, a subarray specification, an array element, or a composite variable member.

 #pragma nvidia compare (var-list) // C++ and C
 !$nvf compare (var-list) ! Fortran
    

Let's look at an example of

 #include <stdlib.h>
 #include <openacc.h>

 int main() {
     int size = 1000;
     int i, t;
     float *a1;
     float *a2;

     a1 = (float*)malloc(sizeof(float)*size);
     a2 = (float*)malloc(sizeof(float)*size);

     for (i = 0; i < size; i++) {
       a1[i] = 1.0f;
       a2[i] = 2.0f;
     }

     for (t = 0; t < 5; t++) {
     for(i = 0; i < size; i++) {
       a2[i] += a1[i];
     }
     pcast_compare(a2, "float", size, "a2", "example.c", "main", 23);
     }
     return 0;
 }
    

Compile the example using these compiler options:

 > nvc -fast -o a.out example.c
    

Compiling with redundant or autocompare options are not required to use pcast_compare. Once again, running the compiled executable using the options below, results in the following output:

 > PCAST_COMPARE=summary,rel=1 ./out.o
 datafile pcast_compare.dat created with 5 blocks, 5000 elements, 20000 bytes
 > PCAST_COMPARE=summary,rel=1 ./out.o
 datafile pcast_compare.dat compared with 5 blocks, 5000 elements, 20000 bytes
 no errors found
  relative tolerance = 0.100000, rel=1
    

Running the program for the first time, the data file "pcast_compare.dat" is created. Subsequent runs compare calculated data against this file. Use the PCAST_COMPARE environment variable to set the name of the file, or force the program to create a new file on the disk with PCAST_COMPARE=create.

The same example above can be written with the compare directive. Notice how much more concise the directive is to the update host and pcast_compare calls.

 #include <stdlib.h>
 #include <openacc.h>

 int main() {
     int size = 1000;
     int i, t;
     float *a1;
     float *a2;

     a1 = (float*)malloc(sizeof(float)*size);
     a2 = (float*)malloc(sizeof(float)*size);

     for (i = 0; i < size; i++) {
       a1[i] = 1.0f;
       a2[i] = 2.0f;
     }

     for (t = 0; t < 5; t++) {
     for(i = 0; i < size; i++) {
       a2[i] += a1[i];
     }
     #pragma nvidia compare(a2[0:size])
     }
     return 0;
 }
    

With the directive, you will want to add "-Mpcast" to the compilation line to enable the directive. Other than that, the output from this program is identical to the runtime example above.

9.3. PCAST with OpenACC

PCAST can also be used with the NVIDIA OpenACC implementation to compare GPU computation against the same program running on a CPU. In this case, all compute constructs are performed redundantly on both the CPU and GPU. The CPU results are considered to be the "golden master" copy which GPU results are compared against.

There are two ways to perform comparisons with GPU-calculated results. The first is with the explicit call or directive acc_compare. To use acc_compare, you must compile with -acc -gpu=redundant to force the CPU and GPU to compute results redundantly. Then, insert calls to acc_compare or put an acc compare directive at points where you want to compare the GPU-computed values against those computed by the CPU.

The second approach is to turn on autocompare mode by compiling with -acc -gpu=autocompare. In autocompare mode, PCAST will automatically perform a comparison at each point where data is moved from the device to the host. It does not require the programmer to add any additional directives or runtime calls; it's a convenient way to do all comparisons at the end of a data region. If there are multiple compute kernels within a data region, and you're only interested in one specific kernel, you should use the previously-mentioned acc_compare to target a specific kernel. Note that autocompare mode implies -gpu=redundant.

During redundant execution, the compiler will generate both CPU and GPU code for each compute construct. At runtime, both the CPU and GPU versions will execute redundantly, with the CPU code reading and modifying values in system memory and the GPU reading and modifying values in device memory. Insert calls to acc_compare() calls (or the equivalent acc compare directive) at points where you want to compare the GPU-computed values against CPU-computed values. PCAST treats the values generated by the CPU code as the "golden" values. It will compare those results against GPU values. Unlike pcast_compare, acc_compare does not write to an intermediary file; the comparisons are done in-memory.

acc_compare only has two arguments: a pointer to the data to be compared, hostptr, and the number of elements to compare, count. The type can be inferred in the OpenACC runtime, so it doesn't need to be specified. The C++ and C interface is given below:

 void acc_compare(void *, size_t);
    

And in Fortran:

 subroutine acc_compare(a)
 subroutine acc_compare(a, len)
     type(*), dimension(*) :: a
     integer(8), value :: len
    

You can call acc_compare on any variable or array that is present in device memory. You can also call acc_compare_all (no arguments) to compare all values that are present in device memory against the correponding values in host memory.

 void acc_compare_all()
    
 subroutine acc_compare_all()
    

Directive forms of the acc_compare calls exist. They work the same as the API calls and can be used in lieu of them. Similar to PCAST compare directives, acc compare directives are ignored when redundant or autocompare modes are not enabled on the compilation line.

The acc compare directive takes one or more arguments, or the 'all' clause (which corresponds to acc_compare_all(). The interfaces are given below in C++ or C, and Fortran respectively. Argument "var-list" can be a variable name, a sub-array specification, and array element, or a composite variable member.

 #pragma acc compare [ (var-list) | all ]
    
 $!acc compare [ (var-list) | all ]
    

For example:

 #pragma acc compare(a[0:N])
 #pragma acc compare all
 !$acc compare(a, b)
 !$acc compare(a(1:N))
 !$acc compare all
    

Consider the following OpenACC program that uses the acc_compare() API call and an acc compare directive. This Fortran example uses real*4 and real*8 arrays.

 program main
     use openacc
     implicit none
     parameter N = 1000
     integer :: i
     real :: a(N)
     real*4 :: b(N)
     real(4) :: c(N)
     double precision :: d(N)
     real*8 :: e(N)
     real(8) :: f(N)

     d = 1.0d0
     e = 0.1d0

     !$acc data copyout(a, b, c, f) copyin(d, e)

     !$acc parallel loop
     do i = 1,N
     a(i) = 1.0
     b(i) = 2.0
     c(i) = 0.0
     enddo
     !$acc end parallel

     !$acc compare(a(1:N), b(1:N), c(1:N))

     !$acc parallel loop
     do i = 1,N
     f(i) = d(i) * e(i)
     enddo
     !$acc end parallel

     !$acc compare(f)

     !$acc parallel loop
     do i = 1,N
     a(i) = 1.0
     b(i) = 1.0
     c(i) = 1.0
     enddo
     !$acc end parallel

     call acc_compare(a, N)
     call acc_compare(b, N)
     call acc_compare(c, N)

     !$acc parallel loop
     do i = 1,N
     f(i) = 1.0D0
     enddo
     !$acc end parallel

     call acc_compare_all()

     !$acc parallel loop
     do i = 1,N
     a(i) = 3.14;
     b(i) = 3.14;
     c(i) = 3.14;
     f(i) = 3.14d0;
     enddo
     !$acc end parallel

     ! In redundant mode, no comparison is performed here. In
     ! autocompare mode, a comparison is made for a, b, c, and f (but
     ! not e and d), since they are copied out of the data region.

     !$acc end data

     call verify(N, a, b, c, f)
 end program

 subroutine verify(N, a, b, c, f)
     integer, intent(in) :: N
     real, intent(in) :: a(N)
     real*4, intent(in) :: b(N)
     real(4), intent(in) :: c(N)
     real(8), intent(in) :: f(N)
     integer :: i, errcnt

     errcnt = 0
     do i=1,N
     if(abs(a(i) - 3.14e0) .gt. 1.0e-06) then
	errcnt = errcnt + 1
     endif
     end do
     do i=1,N
     if(abs(b(i) - 3.14e0) .gt. 1.0e-06) then
	errcnt = errcnt + 1
     endif
     end do
     do i=1,N
     if(abs(c(i) - 3.14e0) .gt. 1.0e-06) then
	errcnt = errcnt + 1
     endif
     end do
     do i=1,N
     if(abs(f(i) - 3.14d0) .gt. 1.0d-06) then
	errcnt = errcnt + 1
     endif
     end do

     if(errcnt /= 0) then
     write (*, *) "FAILED"
     else
     write (*, *) "PASSED"
     endif
 end subroutine verify
    

The program can be compiled with the following command:

 > nvfortran -fast -acc -gpu=redundant -Minfo=accel example.F90
 main:
      16, Generating copyout(a(:),b(:))
	 Generating copyin(e(:))
	 Generating copyout(f(:),c(:))
	 Generating copyin(d(:))
      18, Generating Tesla code
	 19, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
      26, Generating acc compare(c(:),b(:),a(:))
      28, Generating Tesla code
	 29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
      34, Generating acc compare(f(:))
      36, Generating Tesla code
	 37, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
      48, Generating Tesla code
	 49, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
      56, Generating Tesla code
	 57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    

Here, you can see where the acc compare directives are generated on lines 26 and 34. The program can be run with the following command:

 > ./a.out
  PASSED
    

As you can see, no PCAST output is generated when the comparisons match. We can get more information with the summary option:

 > PCAST_COMPARE=summary ./a.out
     PASSED
 compared 13 blocks, 13000 elements, 68000 bytes
 no errors found
     absolute tolerance = 0.00000000000000000e+00, abs=0
    

There are 13 blocks compared. Let's count the blocks in the compare calls.

 !$acc compare(a(1:N), b(1:N), c(1:N))
    

Compares three blocks, one each for a, b, and c.

 !$acc compare(f)
    

Compares one block for f.

 call acc_compare(a, N)
 call acc_compare(b, N)
 call acc_compare(c, N)
    

Each call compares one block for their respective array.

 call acc_compare_all()
    

Compares one block for each array present on the device (a, b, c, d, e, and f) for a total of 6 blocks.

If the same example is compiled with autocompare, we'll see four additional comparisons, since the four arrays that are copied out (with the copyout clause) are compared at the end of the data region.

 > nvfortran -fast -acc -gpu=autocompare example.F90
 > PCAST_COMPARE=summary ./a.out
 PASSED
 compared 17 blocks, 17000 elements, 88000 bytes
 no errors found
	absolute tolerance = 0.00000000000000000e+00, abs=0
    

9.4. Limitations

There are currently a few limitations with using PCAST that are worth keeping in mind.

  • Comparisons are not thread-safe. If you are using PCAST with multiple threads, ensure that only one thread is doing the comparisons. This is especially true if you are using PCAST with MPI. If you use pcast_compare with MPI, you must make sure that only one thread is writing to the comparison file. Or, use a script to set PCAST_COMPARE to encode the file name with the MPI rank.
  • Comparisons must be done with like types; you cannot compare one type with another. It is not possible to, for example, check for differing results after changing from double precision to single. Comparisons are limited to those present in table Table 23. Currently there is no support for structured or derived types.
  • The -gpu=mem:managed or -gpu=mem:unified options are incompatible with autocompare and acc_compare. Both the CPU and GPU need to calculate result separately and to do so they must have their own working memory spaces.
  • If you do any data movement on the device, you must account for it on the host. For example, if you are using CUDA-aware MPI or GPU-accelerated libraries that modify device data, then you must also make the host aware of the changes. In these cases it is helpful to use the host_data clause, which allows you to use device addresses within host code.

9.5. Environment Variables

Behavior of PCAST/Autocompare is controlled through the PCAST_COMPARE variable. Options can be specified in a comma-separated list: PCAST_COMPARE=<opt1>,<opt2>,...

If no options are specified, the default is to perform comparisons with abs=0. Comparison options are not mutually exclusive. PCAST can compare absolute differences with some n=3 and relative differences with a different threshold, e.g. n=5; PCAST_COMPARE=abs=3,rel=5,....

You can specify either an absolute or relative location to be used with the datafile option. The parent directory should be owned by the same user executing the comparisons and the datafile should have the appropriate read/write permissions set.

Table 24. PCAST_COMPARE Options
Option Description
abs=n Compare absolute difference; tolerate differences up to 10^(-n), only applicable to floating point types. Default value is 0
create Specifies that this is the run that will produce the reference file (pcast_compare only)
compare Specifies that the current run will be compared with a reference file (pcast_compare only)
datafile="name" Name of the file that data will be saved to, or compared against. If empty will use the default, 'pcast_compare.dat' (pcast_compare only)
disable Calls to pcast_compare, acc_compare, acc_compare_all, and directives (pcast compare, acc compare, and acc compare) all immediately return from the runtime with no effect. Note that this doesn't disable redundant execution; that will require a recompile.
ieee Compare IEEE NaN checks (only implemented for floats and doubles)
outputfile="name" Save comparison output to a specific file. Default behavior is to output to stderr
patch Patch errors (outside tolerance) with correct values
patchall Patch all differences (inside and outside tolerance) with correct values
rel=n Compare relative difference; tolerated differences up to 10^(-n), only applicable to floating point types. Default value is 0.
report=n Report up to n (default of 50) passes/fails
reportall Report all passes and fails (overrides limit set in report=n)
reportpass Report passes; respects limit set with report=n
silent Suppress output - overrides all other output options, including summary and verbose
stop Stop at first differences
summary Print summary of comparisons at end of run
ulp=n Compare Unit of Least Precision difference (only for floats and doubles)
verbose Outputs more details of comparison (including patches)
verboseautocompare Outputs verbose reporting of what and where the host is comparing (autocompare only)

10. Using MPI

MPI (the Message Passing Interface) is an industry-standard application programming interface designed for rapid data exchange between processors in a distributed-memory environment. MPI is computer software used in scalable computer systems that allows the processes of a parallel application to communicate with one another.

The NVIDIA HPC SDK includes a pre-compiled version of Open MPI. You can build using alternate versions of MPI with the -I, -L, and -l options.

This section describes how to use Open MPI with the NVIDIA HPC Compilers.

10.1. Using Open MPI on Linux

The NVIDIA HPC Compilers for Linux ship with a pre-compiled version of Open MPI that includes everything required to compile, execute and debug MPI programs using Open MPI.

To build an application using Open MPI, use the Open MPI compiler wrappers: mpicc, mpic⁠+⁠+ and mpifort. These wrappers automatically set up the compiler commands with the correct include file search paths, library directories, and link libraries.