Introduction

This document provides a reference for calling CUDA Library functions from NVIDIA Fortran. It can be used from Fortran code using the OpenACC or OpenMP programming models, or from NVIDIA CUDA Fortran. Currently, the CUDA libraries which NVIDIA provides pre-built interface modules for, and which are documented here, are:

cuBLAS, an implementation of the BLAS.
cuFFT, a library of Fast Fourier Transform (FFT) routines.
cuRAND, a library for random number generation.
cuSPARSE, a library of linear algebra routines used with sparse matrices.
cuSOLVER, a library of equation solvers used with dense or other matrices.
cuTENSOR, a library for tensor primitive operations.
NCCL, a collective communications libraries.
NVSHMEM, a library implementation of OpenSHMEM on GPUs.
NVTX, an API for annotating application events, code ranges, and resources.

The OpenACC Application Program Interface is a collection of compiler directives and runtime routines that allows the programmer to specify loops and regions of code for offloading from a host CPU to an attached accelerator, such as a GPU. The OpenACC API was designed and is maintained by an industry consortium. See the OpenACC website for more information about the OpenACC API.

OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution from Fortran (and other languages). The OpenMP target offload capabilities are similar in many respects to OpenACC. The methods for passing device arrays to library functions from host code differ only in syntax compared to those used in OpenACC. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the OpenMP organization’s website.

CUDA Fortran is a small set of extensions to Fortran that supports and is built upon the CUDA computing architecture. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran, and is an analog to NVIDIA’s CUDA C compiler. Compared to the NVIDIA Accelerator and OpenACC directives-based model and compilers, CUDA Fortran is a lower-level explicit programming model with substantial runtime library components that give expert programmers direct control of all aspects of GPGPU programming.

This document does not contain explanations or purposes of the library functions, nor does it contain details of the approach used in the CUDA implementation to target GPUs. For that information, please see the appropriate library document that comes with the NVIDIA CUDA Toolkit. This document does provide the Fortran module contents: derived types, enumerations, and interfaces, to make use of the libraries from Fortran rather than from C or C++.

Many of the examples used in this document are provided in the HPC compiler and tools distribution, along with Makefiles, and are stored in the yearly directory, such as 2020/examples/CUDA-Libraries.

Fortran Interfaces and Wrappers

Almost all of the function interfaces shown in this document make use of features from the Fortran 2003 iso_c_binding intrinsic module. This module provides a standard way for dealing with isues such as inter-language data types, capitalization, adding underscores to symbol names, or passing arguments by value.

Often, the iso_c_binding module enables Fortran programs containing properly written interfaces to call directly into the C library functions. In some cases, NVIDIA has written small wrappers around the C library function, to make the Fortran call site more “Fortran-like”, hiding some issues exposed in the C interfaces like handle management, host vs. device pointer management, or character and complex data type issues.

In a small number of cases, the C Library may contain multiple entry points to handle different data types, perhaps an int in one function and a size_t in another, otherwise the functions are identical. In these cases, NVIDIA may provide just one generic Fortran interface, and will call the appropriate C function under the hood.

Using CUDA Libraries from OpenACC Host Code

All of the libraries covered in this document contain functions which are callable from OpenACC host code. Most functions take some arguments which are expected to be device pointers (the address of a variable in device global memory). There are several ways to do that in OpenACC.

If the call is lexically nested within an OpenACC data directive, the NVIDIA Fortran compiler, in the presence of an explicit interface such as those provided by the NVIDIA library modules, will default to passing the device pointer when required.

subroutine hostcall(a, b, n)
use cublas
real a(n), b(n)
!$acc data copy(a, b)
call cublasSswap(n, a, 1, b, 1)
!$acc end data

return
end

A Fortran interface is made explicit when you use the module that contains it, as in the line use cublas in the example above. If you look ahead to the actual interface for cublasSswap, you will see that the arrays a and b are declared with the CUDA Fortran device attribute, so they take only device addresses as arguments.

It is more acceptable and general when using OpenACC to pass device pointers to subprograms by using the host_data clause as most implementations don’t have a way to mark arguments as device pointers. The host_data construct with the use_device clause makes the device addresses available in host code for passing to the subprogram.

use cufft
use openacc
. . .
!$acc data copyin(a), copyout(b,c)
ierr = cufftPlan2D(iplan1,m,n,CUFFT_C2C)
ierr = ierr + cufftSetStream(iplan1,acc_get_cuda_stream(acc_async_sync))
!$acc host_data use_device(a,b,c)
ierr = ierr + cufftExecC2C(iplan1,a,b,CUFFT_FORWARD)
ierr = ierr + cufftExecC2C(iplan1,b,c,CUFFT_INVERSE)
!$acc end host_data

! scale c
!$acc kernels
c = c / (m*n)
!$acc end kernels
!$acc end data

This code snippet also shows an example of sharing the stream that OpenACC and the cuFFT library use. Every library in this document has a function for setting the CUDA stream which the library runs on. Usually, when using OpenACC, you want the OpenACC kernels to run on the same stream as the library functions. In the case above, this guarantees that the kernel c = c / (m*n) does not start until the FFT operations complete. The function acc_get_cuda_stream and the definition for acc_async_sync are in the openacc module.

Using CUDA Libraries from OpenACC Device Code

Two libraries are currently available from within OpenACC compute regions. Certain functions in both the openacc_curand module and the nvshmem module are marked acc routine seq.

The cuRAND device library is all contained within CUDA header files. In device code, it is designed to return one or a small number of random numbers per thread. The thread’s random generators run independently of each other, and it is usually advised for performance reasons to give each thread a different seed, rather than a different offset.

program t
use openacc_curand
integer, parameter :: n = 500
real a(n,n,4)
type(curandStateXORWOW) :: h
integer(8) :: seed, seq, offset
a = 0.0
!$acc parallel num_gangs(n) vector_length(n) copy(a)
!$acc loop gang
do j = 1, n
!$acc loop vector private(h)
  do i = 1, n
    seed = 12345_8 + j*n*n + i*2
    seq = 0_8
    offset = 0_8
    call curand_init(seed, seq, offset, h)
!$acc loop seq
    do k = 1, 4
      a(i,j,k) = curand_uniform(h)
    end do
  end do
end do
!$acc end parallel
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end

When using the openacc_curand module, since all the code is contained in CUDA header files, you do not need any additional libraries on the link line.

Using CUDA Libraries from CUDA Fortran Host Code

The predominant usage model for the library functions listed in this document is to call them from CUDA Host code. CUDA Fortran allows some special capabilities in that the compiler is able to recognize the device and managed attribute in resolving generic interfaces. Device actual arguments can only match the interface’s device dummy arguments; managed actual arguments, by precedence, match managed dummy arguments first, then device dummies, then host.

program testisamax  ! link with -cudalib=cublas -lblas
use cublas
real*4              x(1000)
real*4, device  :: xd(1000)
real*4, managed :: xm(1000)

call random_number(x)

! Call host BLAS
j = isamax(1000,x,1)

xd = x
! Call cuBLAS
k = isamax(1000,xd,1)
print *,j.eq.k

xm = x
! Also calls cuBLAS
k = isamax(1000,xm,1)
print *,j.eq.k
end

Using the cudafor module, the full set of CUDA functionality is available to programmers for managing CUDA events, streams, synchronization, and asynchronous behaviors. CUDA Fortran can be used in OpenMP programs, and the CUDA Libraries in this document are thread safe with respect to host CPU threads. Further examples are included in chapter Examples.

Using CUDA Libraries from CUDA Fortran Device Code

The cuRAND and NVSHMEM libraries have functions callable from CUDA Fortran device code, and their interfaces are accessed via the curand_device and nvshmem modules, respectively. The module interfaces are very similar to the modules used in OpenACC device code, but for CUDA Fortran, each subroutine and function is declared attributes([host,]device), and the subroutines and functions do not need to be marked as acc routine seq.

module mrand
    use curand_device
    integer, parameter :: n = 500
    contains
    attributes(global) subroutine randsub(a)
    real, device :: a(n,n,4)
    type(curandStateXORWOW) :: h
    integer(8) :: seed, seq, offset
    j = blockIdx%x; i = threadIdx%x
    seed = 12345_8 + j*n*n + i*2
    seq = 0_8
    offset = 0_8
    call curand_init(seed, seq, offset, h)
    do k = 1, 4
        a(i,j,k) = curand_uniform(h)
    end do
    end subroutine
end module

program t   ! nvfortran t.cuf
use mrand
use cudafor ! recognize maxval, minval, sum w/managed
real, managed :: a(n,n,4)
a = 0.0
call randsub<<<n,n>>>(a)
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end program

Pointer Modes in cuBLAS and cuSPARSE

Because the NVIDIA Fortran compiler can distinguish between host and device arguments, the NVIDIA modules for interfacing to cuBLAS and cuSPARSE handle pointer modes differently than CUDA C, which requires setting the mode explicitly for scalar arguments. Examples of scalar arguments which can reside either on the host or device are the alpha and beta scale factors to the *gemm functions.

Typically, when using the normal “non-_v2” interfaces in the cuBLAS and cuSPARSE modules, the runtime wrappers will implicitly add the setting and restoring of the library pointer modes behind the scenes. This adds some negligible but non-zero overhead to the calls.

To avoid the implicit getting and setting of the pointer mode with every invocation of a library function do the following:

For the BLAS, use the cublas_v2 module, and the v2 entry points, such as cublasIsamax_v2. It is the programmer’s responsibility to properly set the pointer mode when needed. Examples of scalar arguments which do require setting the pointer mode are the alpha and beta scale factors passed to the *gemm routines, and the scalar results returned from the v2 versions of the *amax(), *amin(), *asum(), *rotg(), *rotmg(), *nrm2(), and *dot() functions. In the v2 interfaces shown in the chapter 2, these scalar arguments will have the comment ! device or host variable. Examples of scalar arguments which do not require setting the pointer mode are increments, extents, and lengths such as incx, incy, n, lda, ldb, and ldc.
For the cuSPARSE library, each function listed in chapter 5 which contains scalar arguments with the comment ! device or host variable has a corresponding v2 interface, though it is not documented here. For instance, in addition to the interface named cusparseSaxpyi, there is another interface named cusparseSaxpyi_v2 with the exact same argument list which calls into the cuSPARSE library directly and will not implicitly get or set the library pointer mode.

The CUDA default pointer mode is that the scalar arguments reside on the host. The NVIDIA runtime does not change that setting.

Writing Your Own CUDA Interfaces

Despite the large number of interfaces included in the modules described in this document, users will have the need from time-to-time to write their own interfaces to new libraries or their own tuned CUDA, perhaps written in C/C++. There are some standard techniques to use, and some non-standard NVIDIA extensions which can make creating working interfaces easier.

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
end interface cufftExecC2C

This interface calls the C library function directly. You can deal with Fortran’s capitalization issues by putting the properly capitalized C function in the bind(C) attribute. If the C function expects input arguments passed by value, you can add the value attribute to the dummy declaration as well. A nice feature of Fortran is that the interface can change, but the code at the call site may not have to. The compiler changes the details of the call to fit the interface.

Now suppose a user of this interface would like to call this function with REAL data (F77 code is notorious for mixing REAL and COMPLEX declarations). There are two ways to do this:

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
    integer function cufftExecR2R( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        real, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecR2R
end interface cufftExecC2C

Here the C name hasn’t changed. The compiler will now accept actual arguments corresponding to idata and odata that are declared REAL. A generic interface is created named cufftExecC2C. If you have problems debugging your generic interface, as a debugging aid you can try calling the specific name, cufftExecR2R in this case, to help diagnose the problem.

A commonly used extension which is supported by NVIDIA is ignore_tkr. A programmer can use it in an interface to instruct the compiler to ignore any combination of the type, kind, and rank during the interface matching process. The previous example using ignore_tkr looks like this:

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        !dir$ ignore_tkr(tr) idata, (tr) odata
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
end interface cufftExecC2C

Now the compiler will ignore both the type and rank (F77 could also be sloppy in its handling of array dimensions) of idata and odata when matching the call site to the interface. An unfortunate side-effect is that the interface will now allow integer, logical, and character data for idata and odata. It is up to the implementor to determine if that is acceptable.

A final aid, specific to NVIDIA, worth mentioning here is ignore_tkr (d), which ignores the device attribute of an actual argument during interface matching.

Of course, if you write a wrapper, a narrow strip of code between the Fortran call and your library function, you are not limited by the simple transormations that a compiler can do, such as those listed here. As mentioned earlier, many of the interfaces provided in the cuBLAS and cuSPARSE modules use wrappers.

A common request is a way for Fortran programmers to take advantage of the thrust library. Explaining thrust and C++ programming is outside of the scope of this document, but this simple example can show how to take advantage of the excellent sort capabilities in thrust:

// Filename: csort.cu
// nvcc -c -arch sm_35 csort.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>

extern "C" {

    //Sort for integer arrays
    void thrust_int_sort_wrapper( int *data, int N)
    {
    thrust::device_ptr <int> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }

    //Sort for float arrays
    void thrust_float_sort_wrapper( float *data, int N)
    {
    thrust::device_ptr <float> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }

    //Sort for double arrays
    void thrust_double_sort_wrapper( double *data, int N)
    {
    thrust::device_ptr <double> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }
}

Set up interface to the sort subroutine in Fortran and calls are simple:

program t
interface sort
   subroutine sort_int(array, n) &
        bind(C,name='thrust_int_sort_wrapper')
        integer(4), device, dimension(*) :: array
        integer(4), value :: n
    end subroutine
end interface
integer(4), parameter :: n = 100
integer(4), device :: a_d(n)
integer(4) :: a_h(n)
!$cuf kernel do
do i = 1, n
    a_d(i) = 1 + mod(47*i,n)
end do
call sort(a_d, n)
a_h = a_d
nres  = count(a_h .eq. (/(i,i=1,n)/))
if (nres.eq.n) then
    print *,"test PASSED"
else
    print *,"test FAILED"
endif
end

NVIDIA Fortran Compiler Options

The NVIDIA Fortran compiler driver is called nvfortran. General information on the compiler options which can be passed to nvfortran can be obtained by typing nvfortran -help. To enable targeting NVIDIA GPUs using OpenACC, use nvfortran -acc=gpu. To enable targeting NVIDIA GPUs using CUDA Fortran, use nvfortran -cuda. CUDA Fortran is also supported by the NVIDIA Fortran compilers when the filename uses the .cuf extension. Uppercase file extensions, .F90 or .CUF, for example, may also be used, in which case the program is processed by the preprocessor before being compiled.

Other options which are pertinent to the examples in this document are:

-⁠cudalib[=cublas|cufft|cufftw|curand|cusolver|cusparse|cutensor|nvblas|nccl|nvshmem|nvlamath|nvtx]: this option adds the appropriate versions of the CUDA-optimized libraries to the link line. It handles static and dynamic linking, and platform (Linux, Windows) differences unobtrusively.
-⁠gpu=cc70: this option compiles for compute capability 7.0. Certain library functionality may require minimum compute capability of 6.0, 7.0, or higher.
-⁠gpu=cudaX.Y: this option compiles and links with a particular CUDA Toolkit version. Certain library functionality may require a newer (or older, for deprecated functions) CUDA runtime version.