NVIDIA Fortran CUDA Interfaces
Preface
This document describes the NVIDIA Fortran interfaces to cuBLAS, cuFFT, cuRAND, cuSPARSE, and other CUDA Libraries used in scientific and engineering applications built upon the CUDA computing architecture.
Intended Audience
This guide is intended for application programmers, scientists and engineers proficient in programming with the Fortran language. This guide assumes some familiarity with either CUDA Fortran or OpenACC.
Organization
The organization of this document is as follows:
- Introduction
contains a general introduction to Fortran interfaces, OpenACC, CUDA Fortran, and CUDA Library functions
- BLAS Runtime Library
describes the Fortran interfaces to the various cuBLAS libraries
- FFT Runtime Library APIs
describes the module types, definitions and Fortran interfaces to the cuFFT library
- Random Number Runtime APIs
describes the Fortran interfaces to the host and device cuRAND libraries
- Sparse Matrix Runtime APIs
describes the module types, definitions and Fortran interfaces to the cuSPARSE Library
- Matrix Solver Runtime APIs
describes the module types, definitions and Fortran interfaces to the cuSOLVER Library
- Tensor Primitives Runtime APIs
describes the module types, definitions and Fortran interfaces to the cuTENSOR Library
- NVIDIA Collective Communications Library APIs
describes the module types, definitions and Fortran interfaces to the NCCL Library
- NVSHMEM Communication Library APIs
describes the module types, definitions and Fortran interfaces to the NVSHMEM Library
- NVTX Profiling Library APIs
describes the module types, definitions and Fortran interfaces to the NVTX API and Library
- Examples
provides sample code and an explanation of each of the simple examples.
Conventions
This guide uses the following conventions:
- italic
is used for emphasis.
Constant Width
is used for filenames, directories, arguments, options, examples, and for language statements in the text, including assembly language statements.
- Bold
is used for commands.
- [ item1 ]
in general, square brackets indicate optional items. In this case item1 is optional. In the context of p/t-sets, square brackets are required to specify a p/t-set.
- { item2 | item 3 }
braces indicate that a selection is required. In this case, you must select either item2 or item3.
- filename …
ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed.
FORTRAN
Fortran language statements are shown in the text of this guide using a reduced fixed point size.
C++ and C
C++ and C language statements are shown in the test of this guide using a reduced fixed point size.
Terminology
If there are terms in this guide with which you are unfamiliar, see the NVIDIA HPC glossary.
Related Publications
The following documents contain additional information related to OpenACC and CUDA Fortran programming, CUDA, and the CUDA Libraries.
ISO/IEC 1539-1:1997, Information Technology – Programming Languages – FORTRAN, Geneva, 1997 (Fortran 95).
NVIDIA CUDA Programming Guides, NVIDIA. Available online at docs.nvidia.com/cuda.
NVIDIA HPC Compiler User’s Guide, Release 2024. Available online at docs.nvidia.com/hpc-sdk.
1. Introduction
This document provides a reference for calling CUDA Library functions from NVIDIA Fortran. It can be used from Fortran code using the OpenACC or OpenMP programming models, or from NVIDIA CUDA Fortran. Currently, the CUDA libraries which NVIDIA provides pre-built interface modules for, and which are documented here, are:
cuBLAS, an implementation of the BLAS.
cuFFT, a library of Fast Fourier Transform (FFT) routines.
cuRAND, a library for random number generation.
cuSPARSE, a library of linear algebra routines used with sparse matrices.
cuSOLVER, a library of equation solvers used with dense or other matrices.
cuTENSOR, a library for tensor primitive operations.
NCCL, a collective communications librarys.
NVSHMEM, a library implementation of OpenSHMEM on GPUs.
NVTX, an API for annotating application events, code ranges, and resources.
The OpenACC Application Program Interface is a collection of compiler directives and runtime routines that allows the programmer to specify loops and regions of code for offloading from a host CPU to an attached accelerator, such as a GPU. The OpenACC API was designed and is maintained by an industry consortium. See the OpenACC website for more information about the OpenACC API.
OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution from Fortran (and other languages). The OpenMP target offload capabilities are similar in many respects to OpenACC. The methods for passing device arrays to library functions from host code differ only in syntax compared to those used in OpenACC. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the OpenMP organization’s website.
CUDA Fortran is a small set of extensions to Fortran that supports and is built upon the CUDA computing architecture. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran, and is an analog to NVIDIA’s CUDA C compiler. Compared to the NVIDIA Accelerator and OpenACC directives-based model and compilers, CUDA Fortran is a lower-level explicit programming model with substantial runtime library components that give expert programmers direct control of all aspects of GPGPU programming.
This document does not contain explanations or purposes of the library functions, nor does it contain details of the approach used in the CUDA implementation to target GPUs. For that information, please see the appropriate library document that comes with the NVIDIA CUDA Toolkit. This document does provide the Fortran module contents: derived types, enumerations, and interfaces, to make use of the libraries from Fortran rather than from C or C++.
Many of the examples used in this document are provided in the HPC compiler and tools distribution, along with Makefiles, and are stored in the yearly directory, such as 2020/examples/CUDA-Libraries.
1.1. Fortran Interfaces and Wrappers
Almost all of the function interfaces shown in this document make use of features from the Fortran 2003 iso_c_binding intrinsic module. This module provides a standard way for dealing with isues such as inter-language data types, capitalization, adding underscores to symbol names, or passing arguments by value.
Often, the iso_c_binding module enables Fortran programs containing properly written interfaces to call directly into the C library functions. In some cases, NVIDIA has written small wrappers around the C library function, to make the Fortran call site more “Fortran-like”, hiding some issues exposed in the C interfaces like handle management, host vs. device pointer management, or character and complex data type issues.
In a small number of cases, the C Library may contain multiple entry points to handle different data types, perhaps an int in one function and a size_t in another, otherwise the functions are identical. In these cases, NVIDIA may provide just one generic Fortran interface, and will call the appropriate C function under the hood.
1.2. Using CUDA Libraries from OpenACC Host Code
All of the libraries covered in this document contain functions which are callable from OpenACC host code. Most functions take some arguments which are expected to be device pointers (the address of a variable in device global memory). There are several ways to do that in OpenACC.
If the call is lexically nested within an OpenACC data directive, the NVIDIA Fortran compiler, in the presence of an explicit interface such as those provided by the NVIDIA library modules, will default to passing the device pointer when required.
subroutine hostcall(a, b, n)
use cublas
real a(n), b(n)
!$acc data copy(a, b)
call cublasSswap(n, a, 1, b, 1)
!$acc end data
return
end
A Fortran interface is made explicit when you use the module that contains it, as in the line use cublas in the example above. If you look ahead to the actual interface for cublasSswap, you will see that the arrays a and b are declared with the CUDA Fortran device attribute, so they take only device addresses as arguments.
It is more acceptable and general when using OpenACC to pass device pointers to subprograms by using the host_data clause as most implementations don’t have a way to mark arguments as device pointers. The host_data construct with the use_device clause makes the device addresses available in host code for passing to the subprogram.
use cufft
use openacc
. . .
!$acc data copyin(a), copyout(b,c)
ierr = cufftPlan2D(iplan1,m,n,CUFFT_C2C)
ierr = ierr + cufftSetStream(iplan1,acc_get_cuda_stream(acc_async_sync))
!$acc host_data use_device(a,b,c)
ierr = ierr + cufftExecC2C(iplan1,a,b,CUFFT_FORWARD)
ierr = ierr + cufftExecC2C(iplan1,b,c,CUFFT_INVERSE)
!$acc end host_data
! scale c
!$acc kernels
c = c / (m*n)
!$acc end kernels
!$acc end data
This code snippet also shows an example of sharing the stream that OpenACC and the cuFFT library use. Every library in this document has a function for setting the CUDA stream which the library runs on. Usually, when using OpenACC, you want the OpenACC kernels to run on the same stream as the library functions. In the case above, this guarantees that the kernel c = c / (m*n)
does not start until the FFT operations complete. The function acc_get_cuda_stream and the definition for acc_async_sync are in the openacc module.
1.3. Using CUDA Libraries from OpenACC Device Code
Two libraries are currently available from within OpenACC compute regions. Certain functions in both the openacc_curand module and the nvshmem module are marked acc routine seq
.
The cuRAND device library is all contained within CUDA header files. In device code, it is designed to return one or a small number of random numbers per thread. The thread’s random generators run independently of each other, and it is usually advised for performance reasons to give each thread a different seed, rather than a different offset.
program t
use openacc_curand
integer, parameter :: n = 500
real a(n,n,4)
type(curandStateXORWOW) :: h
integer(8) :: seed, seq, offset
a = 0.0
!$acc parallel num_gangs(n) vector_length(n) copy(a)
!$acc loop gang
do j = 1, n
!$acc loop vector private(h)
do i = 1, n
seed = 12345_8 + j*n*n + i*2
seq = 0_8
offset = 0_8
call curand_init(seed, seq, offset, h)
!$acc loop seq
do k = 1, 4
a(i,j,k) = curand_uniform(h)
end do
end do
end do
!$acc end parallel
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end
When using the openacc_curand module, since all the code is contained in CUDA header files, you do not need any additional libraries on the link line.
1.4. Using CUDA Libraries from CUDA Fortran Host Code
The predominant usage model for the library functions listed in this document is to call them from CUDA Host code. CUDA Fortran allows some special capabilities in that the compiler is able to recognize the device and managed attribute in resolving generic interfaces. Device actual arguments can only match the interface’s device dummy arguments; managed actual arguments, by precedence, match managed dummy arguments first, then device dummies, then host.
program testisamax ! link with -cudalib=cublas -lblas
use cublas
real*4 x(1000)
real*4, device :: xd(1000)
real*4, managed :: xm(1000)
call random_number(x)
! Call host BLAS
j = isamax(1000,x,1)
xd = x
! Call cuBLAS
k = isamax(1000,xd,1)
print *,j.eq.k
xm = x
! Also calls cuBLAS
k = isamax(1000,xm,1)
print *,j.eq.k
end
Using the cudafor
module, the full set of CUDA functionality is available to programmers for managing CUDA events, streams, synchronization, and asynchronous behaviors. CUDA Fortran can be used in OpenMP programs, and the CUDA Libraries in this document are thread safe with respect to host CPU threads. Further examples are included in chapter Examples.
1.5. Using CUDA Libraries from CUDA Fortran Device Code
The cuRAND and NVSHMEM libraries have functions callable from CUDA Fortran device code, and their interfaces are accessed via the curand_device and nvshmem modules, respectively. The module interfaces are very similar to the modules used in OpenACC device code, but for CUDA Fortran, each subroutine and function is declared attributes([host,]device), and the subroutines and functions do not need to be marked as acc routine seq
.
module mrand
use curand_device
integer, parameter :: n = 500
contains
attributes(global) subroutine randsub(a)
real, device :: a(n,n,4)
type(curandStateXORWOW) :: h
integer(8) :: seed, seq, offset
j = blockIdx%x; i = threadIdx%x
seed = 12345_8 + j*n*n + i*2
seq = 0_8
offset = 0_8
call curand_init(seed, seq, offset, h)
do k = 1, 4
a(i,j,k) = curand_uniform(h)
end do
end subroutine
end module
program t ! nvfortran t.cuf
use mrand
use cudafor ! recognize maxval, minval, sum w/managed
real, managed :: a(n,n,4)
a = 0.0
call randsub<<<n,n>>>(a)
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end program
1.6. Pointer Modes in cuBLAS and cuSPARSE
Because the NVIDIA Fortran compiler can distinguish between host and device arguments, the NVIDIA modules for interfacing to cuBLAS and cuSPARSE handle pointer modes differently than CUDA C, which requires setting the mode explicitly for scalar arguments. Examples of scalar arguments which can reside either on the host or device are the alpha and beta scale factors to the *gemm functions.
Typically, when using the normal “non-_v2” interfaces in the cuBLAS and cuSPARSE modules, the runtime wrappers will implicitly add the setting and restoring of the library pointer modes behind the scenes. This adds some negligible but non-zero overhead to the calls.
To avoid the implicit getting and setting of the pointer mode with every invocation of a library function do the following:
For the BLAS, use the
cublas_v2
module, and the v2 entry points, such ascublasIsamax_v2
. It is the programmer’s responsibility to properly set the pointer mode when needed. Examples of scalar arguments which do require setting the pointer mode are the alpha and beta scale factors passed to the *gemm routines, and the scalar results returned from the v2 versions of the *amax(), *amin(), *asum(), *rotg(), *rotmg(), *nrm2(), and *dot() functions. In the v2 interfaces shown in the chapter 2, these scalar arguments will have the comment! device or host variable
. Examples of scalar arguments which do not require setting the pointer mode are increments, extents, and lengths such as incx, incy, n, lda, ldb, and ldc.For the cuSPARSE library, each function listed in chapter 5 which contains scalar arguments with the comment
! device or host variable
has a corresponding v2 interface, though it is not documented here. For instance, in addition to the interface namedcusparseSaxpyi
, there is another interface namedcusparseSaxpyi_v2
with the exact same argument list which calls into the cuSPARSE library directly and will not implicitly get or set the library pointer mode.
The CUDA default pointer mode is that the scalar arguments reside on the host. The NVIDIA runtime does not change that setting.
1.7. Writing Your Own CUDA Interfaces
Despite the large number of interfaces included in the modules described in this document, users will have the need from time-to-time to write their own interfaces to new libraries or their own tuned CUDA, perhaps written in C/C++. There are some standard techniques to use, and some non-standard NVIDIA extensions which can make creating working interfaces easier.
! cufftExecC2C
interface cufftExecC2C
integer function cufftExecC2C( plan, idata, odata, direction ) &
bind(C,name='cufftExecC2C')
integer, value :: plan
complex, device, dimension(*) :: idata, odata
integer, value :: direction
end function cufftExecC2C
end interface cufftExecC2C
This interface calls the C library function directly. You can deal with Fortran’s capitalization issues by putting the properly capitalized C function in the bind(C)
attribute. If the C function expects input arguments passed by value, you can add the value
attribute to the dummy declaration as well. A nice feature of Fortran is that the interface can change, but the code at the call site may not have to. The compiler changes the details of the call to fit the interface.
Now suppose a user of this interface would like to call this function with REAL data (F77 code is notorious for mixing REAL and COMPLEX declarations). There are two ways to do this:
! cufftExecC2C
interface cufftExecC2C
integer function cufftExecC2C( plan, idata, odata, direction ) &
bind(C,name='cufftExecC2C')
integer, value :: plan
complex, device, dimension(*) :: idata, odata
integer, value :: direction
end function cufftExecC2C
integer function cufftExecR2R( plan, idata, odata, direction ) &
bind(C,name='cufftExecC2C')
integer, value :: plan
real, device, dimension(*) :: idata, odata
integer, value :: direction
end function cufftExecR2R
end interface cufftExecC2C
Here the C name hasn’t changed. The compiler will now accept actual arguments corresponding to idata and odata that are declared REAL. A generic interface is created named cufftExecC2C
. If you have problems debugging your generic interface, as a debugging aid you can try calling the specific name, cufftExecR2R
in this case, to help diagnose the problem.
A commonly used extension which is supported by NVIDIA is ignore_tkr
. A programmer can use it in an interface to instruct the compiler to ignore any combination of the type, kind, and rank during the interface matching process. The previous example using ignore_tkr
looks like this:
! cufftExecC2C
interface cufftExecC2C
integer function cufftExecC2C( plan, idata, odata, direction ) &
bind(C,name='cufftExecC2C')
integer, value :: plan
!dir$ ignore_tkr(tr) idata, (tr) odata
complex, device, dimension(*) :: idata, odata
integer, value :: direction
end function cufftExecC2C
end interface cufftExecC2C
Now the compiler will ignore both the type and rank (F77 could also be sloppy in its handling of array dimensions) of idata and odata when matching the call site to the interface. An unfortunate side-effect is that the interface will now allow integer, logical, and character data for idata and odata. It is up to the implementor to determine if that is acceptable.
A final aid, specific to NVIDIA, worth mentioning here is ignore_tkr (d)
, which ignores the device attribute of an actual argument during interface matching.
Of course, if you write a wrapper, a narrow strip of code between the Fortran call and your library function, you are not limited by the simple transormations that a compiler can do, such as those listed here. As mentioned earlier, many of the interfaces provided in the cuBLAS and cuSPARSE modules use wrappers.
A common request is a way for Fortran programmers to take advantage of the thrust library. Explaining thrust and C++ programming is outside of the scope of this document, but this simple example can show how to take advantage of the excellent sort capabilities in thrust:
// Filename: csort.cu
// nvcc -c -arch sm_35 csort.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>
extern "C" {
//Sort for integer arrays
void thrust_int_sort_wrapper( int *data, int N)
{
thrust::device_ptr <int> dev_ptr(data);
thrust::sort(dev_ptr, dev_ptr+N);
}
//Sort for float arrays
void thrust_float_sort_wrapper( float *data, int N)
{
thrust::device_ptr <float> dev_ptr(data);
thrust::sort(dev_ptr, dev_ptr+N);
}
//Sort for double arrays
void thrust_double_sort_wrapper( double *data, int N)
{
thrust::device_ptr <double> dev_ptr(data);
thrust::sort(dev_ptr, dev_ptr+N);
}
}
Set up interface to the sort subroutine in Fortran and calls are simple:
program t
interface sort
subroutine sort_int(array, n) &
bind(C,name='thrust_int_sort_wrapper')
integer(4), device, dimension(*) :: array
integer(4), value :: n
end subroutine
end interface
integer(4), parameter :: n = 100
integer(4), device :: a_d(n)
integer(4) :: a_h(n)
!$cuf kernel do
do i = 1, n
a_d(i) = 1 + mod(47*i,n)
end do
call sort(a_d, n)
a_h = a_d
nres = count(a_h .eq. (/(i,i=1,n)/))
if (nres.eq.n) then
print *,"test PASSED"
else
print *,"test FAILED"
endif
end
1.8. NVIDIA Fortran Compiler Options
The NVIDIA Fortran compiler driver is called nvfortran. General information on the compiler options which can be passed to nvfortran can be obtained by typing nvfortran -help. To enable targeting NVIDIA GPUs using OpenACC, use nvfortran -acc=gpu. To enable targeting NVIDIA GPUs using CUDA Fortran, use nvfortran -cuda. CUDA Fortran is also supported by the NVIDIA Fortran compilers when the filename uses the .cuf extension. Uppercase file extensions, .F90 or .CUF, for example, may also be used, in which case the program is processed by the preprocessor before being compiled.
Other options which are pertinent to the examples in this document are:
-cudalib[=cublas|cufft|cufftw|curand|cusolver|cusparse|cutensor|nvblas|nccl|nvshmem|nvlamath|nvtx]: this option adds the appropriate versions of the CUDA-optimized libraries to the link line. It handles static and dynamic linking, and platform (Linux, Windows) differences unobtrusively.
-gpu=cc70: this option compiles for compute capability 7.0. Certain library functionality may require minimum compute capability of 6.0, 7.0, or higher.
-gpu=cudaX.Y: this option compiles and links with a particular CUDA Toolkit version. Certain library functionality may require a newer (or older, for deprecated functions) CUDA runtime version.
2. BLAS Runtime APIs
This section describes the Fortran interfaces to the CUDA BLAS libraries. There are currently four separate collections of function entry points which are commonly referred to as the cuBLAS:
The original CUDA implementation of the BLAS routines, referred to as the legacy API, which are callable from the host and expect and operate on device data.
The newer “v2” CUDA implementation of the BLAS routines, plus some extensions for batched operations. These are also callable from the host and operate on device data. In Fortran terms, these entry points have been changed from subroutines to functions which return status.
The cuBLAS XT library which can target multiple GPUs using only host-resident data.
The cuBLAS MP library which can target multiple GPUs using distributed device data, similar to the ScaLAPACK PBLAS functions. The cublasMp and cusolverMp libraries are built, in part, upon a communications library named CAL, which is documented in another section of this document.
NVIDIA currently ships with four Fortran modules which programmers can use to call into this cuBLAS functionality:
cublas, which provides interfaces to into the main cublas library. Both the legacy and v2 names are supported. In this module, the cublas names (such as cublasSaxpy) use the legacy calling conventions. Interfaces to a host BLAS library (for instance libblas.a in the NVIDIA distribution) are also included in the cublas module. These interfaces are exposed by adding the line
use cublas
to your program unit.
cublas_v2, which is similar to the cublas module in most ways except the cublas names (such as cublasSaxpy) use the v2 calling conventions. For instance, instead of a subroutine, cublasSaxpy is a function which takes a handle as the first argument and returns an integer containing the status of the call. These interfaces are exposed by adding the line
use cublas_v2
to your program unit.
cublasxt, which interfaces directly to the cublasXT API. These interfaces are exposed by adding the line
use cublasxt
to your program unit.
cublasmp, which provides interfaces into the cublasMp API. These interfaces are exposed by adding the line
use cublasMp
to your program unit.
The v2 routines are integer functions that return an error status code; they return a value of CUBLAS_STATUS_SUCCESS if the call was successful, or other cuBLAS status return value if there was an error.
Documented interfaces to the traditional BLAS names in the subsequent sections, which contain the comment ! device or host variable
should not be confused with the pointer mode issue from section 1.6. The traditional BLAS names are overloaded generic names in the cublas
module. For instance, in this interface
subroutine scopy(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
The arrays x and y can either both be device arrays, in which case cublasScopy
is called via the generic interface, or they can both be host arrays, in which case scopy
from the host BLAS library is called. Using CUDA Fortran managed data as actual arguments to scopy
poses an interesting case, and calling cublasScopy
is chosen by default. If you wish to call the host library version of scopy with managed data, don’t expose the generic scopy interface at the call site.
Unless a specific kind is provided, in the following interfaces the plain integer type implies integer(4) and the plain real type implies real(4).
2.1. CUBLAS Definitions and Helper Functions
This section contains definitions and data types used in the cuBLAS library and interfaces to the cuBLAS Helper Functions.
The cublas module contains the following derived type definitions:
TYPE cublasHandle
TYPE(C_PTR) :: handle
END TYPE
The cuBLAS module contains the following enumerations:
enum, bind(c)
enumerator :: CUBLAS_STATUS_SUCCESS =0
enumerator :: CUBLAS_STATUS_NOT_INITIALIZED =1
enumerator :: CUBLAS_STATUS_ALLOC_FAILED =3
enumerator :: CUBLAS_STATUS_INVALID_VALUE =7
enumerator :: CUBLAS_STATUS_ARCH_MISMATCH =8
enumerator :: CUBLAS_STATUS_MAPPING_ERROR =11
enumerator :: CUBLAS_STATUS_EXECUTION_FAILED=13
enumerator :: CUBLAS_STATUS_INTERNAL_ERROR =14
end enum
enum, bind(c)
enumerator :: CUBLAS_FILL_MODE_LOWER=0
enumerator :: CUBLAS_FILL_MODE_UPPER=1
end enum
enum, bind(c)
enumerator :: CUBLAS_DIAG_NON_UNIT=0
enumerator :: CUBLAS_DIAG_UNIT=1
end enum
enum, bind(c)
enumerator :: CUBLAS_SIDE_LEFT =0
enumerator :: CUBLAS_SIDE_RIGHT=1
end enum
enum, bind(c)
enumerator :: CUBLAS_OP_N=0
enumerator :: CUBLAS_OP_T=1
enumerator :: CUBLAS_OP_C=2
end enum
enum, bind(c)
enumerator :: CUBLAS_POINTER_MODE_HOST = 0
enumerator :: CUBLAS_POINTER_MODE_DEVICE = 1
end enum
2.1.1. cublasCreate
This function initializes the CUBLAS library and creates a handle to an opaque structure holding the CUBLAS library context. It allocates hardware resources on the host and device and must be called prior to making any other CUBLAS library calls. The CUBLAS library context is tied to the current CUDA device. To use the library on multiple devices, one CUBLAS handle needs to be created for each device. Furthermore, for a given device, multiple CUBLAS handles with different configuration can be created. Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences. For multi-threaded applications that use the same device from different threads, the recommended programming model is to create one CUBLAS handle per thread and use that CUBLAS handle for the entire life of the thread.
integer(4) function cublasCreate(handle)
type(cublasHandle) :: handle
2.1.2. cublasDestroy
This function releases hardware resources used by the CUBLAS library. This function is usually the last call with a particular handle to the CUBLAS library. Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences.
integer(4) function cublasDestroy(handle)
type(cublasHandle) :: handle
2.1.3. cublasGetVersion
This function returns the version number of the cuBLAS library.
integer(4) function cublasGetVersion(handle, version)
type(cublasHandle) :: handle
integer(4) :: version
2.1.4. cublasSetStream
This function sets the cuBLAS library stream, which will be used to execute all subsequent calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use the default NULL stream. In particular, this routine can be used to change the stream between kernel launches and then to reset the cuBLAS library stream back to NULL.
integer(4) function cublasSetStream(handle, stream)
type(cublasHandle) :: handle
integer(kind=cuda_stream_kind()) :: stream
2.1.5. cublasGetStream
This function gets the cuBLAS library stream, which is being used to execute all calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use the default NULL stream.
integer(4) function cublasGetStream(handle, stream)
type(cublasHandle) :: handle
integer(kind=cuda_stream_kind()) :: stream
2.1.6. cublasGetStatusName
This function returns the cuBLAS status name associated with a given status value.
character(128) function cublasGetStatusName(ierr)
integer(4) :: ierr
2.1.7. cublasGetStatusString
This function returns the cuBLAS status string associated with a given status value.
character(128) function cublasGetStatusString(ierr)
integer(4) :: ierr
2.1.8. cublasGetPointerMode
This function obtains the pointer mode used by the cuBLAS library. In the cublas
module, the pointer mode is set and reset on a call-by-call basis depending on the whether the device attribute is set on scalar actual arguments. See section 1.6 for a discussion of pointer modes.
integer(4) function cublasGetPointerMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.9. cublasSetPointerMode
This function sets the pointer mode used by the cuBLAS library. When using the cublas
module, the pointer mode is set on a call-by-call basis depending on the whether the device attribute is set on scalar actual arguments. When using the cublas_v2
module with v2 interfaces, it is the programmer’s responsibility to make calls to cublasSetPointerMode
so scalar arguments are handled correctly by the library. See section 1.6 for a discussion of pointer modes.
integer(4) function cublasSetPointerMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.10. cublasGetAtomicsMode
This function obtains the atomics mode used by the cuBLAS library.
integer(4) function cublasGetAtomicsMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.11. cublasSetAtomicsMode
This function sets the atomics mode used by the cuBLAS library. Some routines in the cuBLAS library have alternate implementations that use atomics to accumulate results. These alternate implementations may run faster but may also generate results which are not identical from one run to the other. The default is to not allow atomics in cuBLAS functions.
integer(4) function cublasSetAtomicsMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.12. cublasGetMathMode
This function obtains the math mode used by the cuBLAS library.
integer(4) function cublasGetMathMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.13. cublasSetMathMode
This function sets the math mode used by the cuBLAS library. Some routines in the cuBLAS library allow you to choose the compute precision used to generate results. These alternate approaches may run faster but may also generate different, less accurate results.
integer(4) function cublasSetMathMode(handle, mode)
type(cublasHandle) :: handle
integer(4) :: mode
2.1.14. cublasGetSmCountTarget
This function obtains the SM count target used by the cuBLAS library.
integer(4) function cublasGetSmCountTarget(handle, counttarget)
type(cublasHandle) :: handle
integer(4) :: counttarget
2.1.15. cublasSetSmCountTarget
This function sets the SM count target used by the cuBLAS library.
integer(4) function cublasSetSmCountTarget(handle, counttarget)
type(cublasHandle) :: handle
integer(4) :: counttarget
2.1.16. cublasGetHandle
This function gets the cuBLAS handle currently in use by a thread. The CUDA Fortran runtime keeps track of a CPU thread’s current handle, if you are either using the legacy BLAS API, or do not wish to pass the handle through to low-level functions or subroutines manually.
type(cublashandle) function cublasGetHandle()
integer(4) function cublasGetHandle(handle)
type(cublasHandle) :: handle
2.1.17. cublasSetVector
This function copies n elements from a vector x in host memory space to a vector y in GPU memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy or array assignment statements.
integer(4) function cublassetvector(n, elemsize, x, incx, y, incy)
integer :: n, elemsize, incx, incy
integer*1, dimension(*) :: x
integer*1, device, dimension(*) :: y
2.1.18. cublasGetVector
This function copies n elements from a vector x in GPU memory space to a vector y in host memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy or array assignment statements.
integer(4) function cublasgetvector(n, elemsize, x, incx, y, incy)
integer :: n, elemsize, incx, incy
integer*1, device, dimension(*) :: x
integer*1, dimension(*) :: y
2.1.19. cublasSetMatrix
This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy, cudaMemcpy2D, or array assignment statements.
integer(4) function cublassetmatrix(rows, cols, elemsize, a, lda, b, ldb)
integer :: rows, cols, elemsize, lda, ldb
integer*1, dimension(lda, *) :: a
integer*1, device, dimension(ldb, *) :: b
2.1.20. cublasGetMatrix
This function copies a tile of rows x cols elements from a matrix A in GPU memory space to a matrix B in host memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy, cudaMemcpy2D, or array assignment statements.
integer(4) function cublasgetmatrix(rows, cols, elemsize, a, lda, b, ldb)
integer :: rows, cols, elemsize, lda, ldb
integer*1, device, dimension(lda, *) :: a
integer*1, dimension(ldb, *) :: b
2.1.21. cublasSetVectorAsync
This function copies n elements from a vector x in host memory space to a vector y in GPU memory space, asynchronously, on the given CUDA stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync.
integer(4) function cublassetvectorasync(n, elemsize, x, incx, y, incy, stream)
integer :: n, elemsize, incx, incy
integer*1, dimension(*) :: x
integer*1, device, dimension(*) :: y
integer(kind=cuda_stream_kind()) :: stream
2.1.22. cublasGetVectorAsync
This function copies n elements from a vector x in host memory space to a vector y in GPU memory space, asynchronously, on the given CUDA stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync.
integer(4) function cublasgetvectorasync(n, elemsize, x, incx, y, incy, stream)
integer :: n, elemsize, incx, incy
integer*1, device, dimension(*) :: x
integer*1, dimension(*) :: y
integer(kind=cuda_stream_kind()) :: stream
2.1.23. cublasSetMatrixAsync
This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space, asynchronously using the specified stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync or cudaMemcpy2DAsync.
integer(4) function cublassetmatrixasync(rows, cols, elemsize, a, lda, b, ldb, stream)
integer :: rows, cols, elemsize, lda, ldb
integer*1, dimension(lda, *) :: a
integer*1, device, dimension(ldb, *) :: b
integer(kind=cuda_stream_kind()) :: stream
2.1.24. cublasGetMatrixAsync
This function copies a tile of rows x cols elements from a matrix A in GPU memory space to a matrix B in host memory space, asynchronously, using the specified stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync or cudaMemcpy2DAsync.
integer(4) function cublasgetmatrixasync(rows, cols, elemsize, a, lda, b, ldb, stream)
integer :: rows, cols, elemsize, lda, ldb
integer*1, device, dimension(lda, *) :: a
integer*1, dimension(ldb, *) :: b
integer(kind=cuda_stream_kind()) :: stream
2.2. Single Precision Functions and Subroutines
This section contains interfaces to the single precision BLAS and cuBLAS functions and subroutines.
2.2.1. isamax
ISAMAX finds the index of the element having the maximum absolute value.
integer(4) function isamax(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x ! device or host variable
integer :: incx
integer(4) function cublasIsamax(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer(4) function cublasIsamax_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer, device :: res ! device or host variable
2.2.2. isamin
ISAMIN finds the index of the element having the minimum absolute value.
integer(4) function isamin(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x ! device or host variable
integer :: incx
integer(4) function cublasIsamin(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer(4) function cublasIsamin_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer, device :: res ! device or host variable
2.2.3. sasum
SASUM takes the sum of the absolute values.
real(4) function sasum(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x ! device or host variable
integer :: incx
real(4) function cublasSasum(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer(4) function cublasSasum_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
real(4), device :: res ! device or host variable
2.2.4. saxpy
SAXPY constant times a vector plus a vector.
subroutine saxpy(n, a, x, incx, y, incy)
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasSaxpy(n, a, x, incx, y, incy)
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasSaxpy_v2(h, n, a, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x, y
integer :: incx, incy
2.2.5. scopy
SCOPY copies a vector, x, to a vector, y.
subroutine scopy(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasScopy(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasScopy_v2(h, n, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
2.2.6. sdot
SDOT forms the dot product of two vectors.
real(4) function sdot(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
real(4) function cublasSdot(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasSdot_v2(h, n, x, incx, y, incy, res)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
real(4), device :: res ! device or host variable
2.2.7. snrm2
SNRM2 returns the euclidean norm of a vector via the function name, so that SNRM2 := sqrt( x’*x ).
real(4) function snrm2(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x ! device or host variable
integer :: incx
real(4) function cublasSnrm2(n, x, incx)
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
integer(4) function cublasSnrm2_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x
integer :: incx
real(4), device :: res ! device or host variable
2.2.8. srot
SROT applies a plane rotation.
subroutine srot(n, x, incx, y, incy, sc, ss)
integer :: n
real(4), device :: sc, ss ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasSrot(n, x, incx, y, incy, sc, ss)
integer :: n
real(4), device :: sc, ss ! device or host variable
real(4), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasSrot_v2(h, n, x, incx, y, incy, sc, ss)
type(cublasHandle) :: h
integer :: n
real(4), device :: sc, ss ! device or host variable
real(4), device, dimension(*) :: x, y
integer :: incx, incy
2.2.9. srotg
SROTG constructs a Givens plane rotation.
subroutine srotg(sa, sb, sc, ss)
real(4), device :: sa, sb, sc, ss ! device or host variable
subroutine cublasSrotg(sa, sb, sc, ss)
real(4), device :: sa, sb, sc, ss ! device or host variable
integer(4) function cublasSrotg_v2(h, sa, sb, sc, ss)
type(cublasHandle) :: h
real(4), device :: sa, sb, sc, ss ! device or host variable
2.2.10. srotm
SROTM applies the modified Givens transformation, H, to the 2 by N matrix (SX**T) , where **T indicates transpose. The elements of SX are in (SX**T) SX(LX+I*INCX), I = 0 to N-1, where LX = 1 if INCX .GE. 0, ELSE LX = (-INCX)*N, and similarly for SY using LY and INCY. With SPARAM(1)=SFLAG, H has one of the following forms.. SFLAG=-1.E0 SFLAG=0.E0 SFLAG=1.E0 SFLAG=-2.E0 (SH11 SH12) (1.E0 SH12) (SH11 1.E0) (1.E0 0.E0) H=( ) ( ) ( ) ( ) (SH21 SH22), (SH21 1.E0), (-1.E0 SH22), (0.E0 1.E0). See SROTMG for a description of data storage in SPARAM.
subroutine srotm(n, x, incx, y, incy, param)
integer :: n
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasSrotm(n, x, incx, y, incy, param)
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
real(4), device :: param(*) ! device or host variable
integer(4) function cublasSrotm_v2(h, n, x, incx, y, incy, param)
type(cublasHandle) :: h
integer :: n
real(4), device :: param(*) ! device or host variable
real(4), device, dimension(*) :: x, y
integer :: incx, incy
2.2.11. srotmg
SROTMG constructs the modified Givens transformation matrix H which zeros the second component of the 2-vector (SQRT(SD1)*SX1,SQRT(SD2)*SY2)**T. With SPARAM(1)=SFLAG, H has one of the following forms..SFLAG=-1.E0 SFLAG=0.E0 SFLAG=1.E0 SFLAG=-2.E0 (SH11 SH12) (1.E0 SH12) (SH11 1.E0) (1.E0 0.E0) H=( ) ( ) ( ) ( ) (SH21 SH22), (SH21 1.E0), (-1.E0 SH22), (0.E0 1.E0).
Locations 2-4 of SPARAM contain SH11,SH21,SH12, and SH22 respectively. (Values of 1.E0, -1.E0, or 0.E0 implied by the value of SPARAM(1) are not stored in SPARAM.)
subroutine srotmg(d1, d2, x1, y1, param)
real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable
subroutine cublasSrotmg(d1, d2, x1, y1, param)
real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable
integer(4) function cublasSrotmg_v2(h, d1, d2, x1, y1, param)
type(cublasHandle) :: h
real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable
2.2.12. sscal
SSCAL scales a vector by a constant.
subroutine sscal(n, a, x, incx)
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
integer :: incx
subroutine cublasSscal(n, a, x, incx)
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x
integer :: incx
integer(4) function cublasSscal_v2(h, n, a, x, incx)
type(cublasHandle) :: h
integer :: n
real(4), device :: a ! device or host variable
real(4), device, dimension(*) :: x
integer :: incx
2.2.13. sswap
SSWAP interchanges two vectors.
subroutine sswap(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasSswap(n, x, incx, y, incy)
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasSswap_v2(h, n, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(4), device, dimension(*) :: x, y
integer :: incx, incy
2.2.14. sgbmv
SGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.
subroutine sgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, kl, ku, lda, incx, incy
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, kl, ku, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: m, n, kl, ku, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
2.2.15. sgemv
SGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.
subroutine sgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
2.2.16. sger
SGER performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.
subroutine sger(m, n, alpha, x, incx, y, incy, a, lda)
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasSger(m, n, alpha, x, incx, y, incy, a, lda)
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha ! device or host variable
integer(4) function cublasSger_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
type(cublasHandle) :: h
integer :: m, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha ! device or host variable
2.2.17. ssbmv
SSBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric band matrix, with k super-diagonals.
subroutine ssbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: k, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: k, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsbmv_v2(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: k, n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
2.2.18. sspmv
SSPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.
subroutine sspmv(t, n, alpha, a, x, incx, beta, y, incy)
character*1 :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSspmv(t, n, alpha, a, x, incx, beta, y, incy)
character*1 :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSspmv_v2(h, t, n, alpha, a, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y
real(4), device :: alpha, beta ! device or host variable
2.2.19. sspr
SSPR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix, supplied in packed form.
subroutine sspr(t, n, alpha, x, incx, a)
character*1 :: t
integer :: n, incx
real(4), device, dimension(*) :: a, x ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasSspr(t, n, alpha, x, incx, a)
character*1 :: t
integer :: n, incx
real(4), device, dimension(*) :: a, x
real(4), device :: alpha ! device or host variable
integer(4) function cublasSspr_v2(h, t, n, alpha, x, incx, a)
type(cublasHandle) :: h
integer :: t
integer :: n, incx
real(4), device, dimension(*) :: a, x
real(4), device :: alpha ! device or host variable
2.2.20. sspr2
SSPR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.
subroutine sspr2(t, n, alpha, x, incx, y, incy, a)
character*1 :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasSspr2(t, n, alpha, x, incx, y, incy, a)
character*1 :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y
real(4), device :: alpha ! device or host variable
integer(4) function cublasSspr2_v2(h, t, n, alpha, x, incx, y, incy, a)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy
real(4), device, dimension(*) :: a, x, y
real(4), device :: alpha ! device or host variable
2.2.21. ssymv
SSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.
subroutine ssymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: uplo
integer :: n, lda, incx, incy
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: uplo
integer :: n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: uplo
integer :: n, lda, incx, incy
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha, beta ! device or host variable
2.2.22. ssyr
SSYR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix.
subroutine ssyr(t, n, alpha, x, incx, a, lda)
character*1 :: t
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasSsyr(t, n, alpha, x, incx, a, lda)
character*1 :: t
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
real(4), device :: alpha ! device or host variable
integer(4) function cublasSsyr_v2(h, t, n, alpha, x, incx, a, lda)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
real(4), device :: alpha ! device or host variable
2.2.23. ssyr2
SSYR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix.
subroutine ssyr2(t, n, alpha, x, incx, y, incy, a, lda)
character*1 :: t
integer :: n, incx, incy, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x, y ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasSsyr2(t, n, alpha, x, incx, y, incy, a, lda)
character*1 :: t
integer :: n, incx, incy, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha ! device or host variable
integer(4) function cublasSsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x, y
real(4), device :: alpha ! device or host variable
2.2.24. stbmv
STBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.
subroutine stbmv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
subroutine cublasStbmv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
integer(4) function cublasStbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
2.2.25. stbsv
STBSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine stbsv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
subroutine cublasStbsv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
integer(4) function cublasStbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, k, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
2.2.26. stpmv
STPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.
subroutine stpmv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x ! device or host variable
subroutine cublasStpmv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x
integer(4) function cublasStpmv_v2(h, u, t, d, n, a, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x
2.2.27. stpsv
STPSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine stpsv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x ! device or host variable
subroutine cublasStpsv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x
integer(4) function cublasStpsv_v2(h, u, t, d, n, a, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx
real(4), device, dimension(*) :: a, x
2.2.28. strmv
STRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.
subroutine strmv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
subroutine cublasStrmv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
integer(4) function cublasStrmv_v2(h, u, t, d, n, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
2.2.29. strsv
STRSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine strsv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(*) :: x ! device or host variable
subroutine cublasStrsv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
integer(4) function cublasStrsv_v2(h, u, t, d, n, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx, lda
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(*) :: x
2.2.30. sgemm
SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device, dimension(ldc, *) :: c ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
2.2.31. ssymm
SSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.
subroutine ssymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: side, uplo
integer :: m, n, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device, dimension(ldc, *) :: c ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: side, uplo
integer :: m, n, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: side, uplo
integer :: m, n, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
2.2.32. ssyrk
SSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.
subroutine ssyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldc
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldc, *) :: c ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
2.2.33. ssyr2k
SSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.
subroutine ssyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device, dimension(ldc, *) :: c ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
2.2.34. ssyrkx
SSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.
subroutine ssyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device, dimension(ldc, *) :: c ! device or host variable
real(4), device :: alpha, beta ! device or host variable
subroutine cublasSsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
integer(4) function cublasSsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha, beta ! device or host variable
2.2.35. strmm
STRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.
subroutine strmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasStrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device :: alpha ! device or host variable
integer(4) function cublasStrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
type(cublasHandle) :: h
integer :: side, uplo, transa, diag
integer :: m, n, lda, ldb, ldc
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device, dimension(ldc, *) :: c
real(4), device :: alpha ! device or host variable
2.2.36. strsm
STRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.
subroutine strsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(4), device, dimension(lda, *) :: a ! device or host variable
real(4), device, dimension(ldb, *) :: b ! device or host variable
real(4), device :: alpha ! device or host variable
subroutine cublasStrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device :: alpha ! device or host variable
integer(4) function cublasStrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
type(cublasHandle) :: h
integer :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(4), device, dimension(lda, *) :: a
real(4), device, dimension(ldb, *) :: b
real(4), device :: alpha ! device or host variable
2.2.37. cublasSgemvBatched
SGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.
integer(4) function cublasSgemvBatched(h, trans, m, n, alpha, &
Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: xarray(*)
integer :: incx
real(4), device :: beta ! device or host variable
type(c_devptr), device :: yarray(*)
integer :: incy
integer :: batchCount
integer(4) function cublasSgemvBatched_v2(h, trans, m, n, alpha, &
Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
type(cublasHandle) :: h
integer :: trans
integer :: m, n
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: xarray(*)
integer :: incx
real(4), device :: beta ! device or host variable
type(c_devptr), device :: yarray(*)
integer :: incy
integer :: batchCount
2.2.38. cublasSgemmBatched
SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
integer(4) function cublasSgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
type(cublasHandle) :: h
integer :: transa ! integer or character(1) variable
integer :: transb ! integer or character(1) variable
integer :: m, n, k
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Barray(*)
integer :: ldb
real(4), device :: beta ! device or host variable
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: batchCount
integer(4) function cublasSgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
type(cublasHandle) :: h
integer :: transa
integer :: transb
integer :: m, n, k
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Barray(*)
integer :: ldb
real(4), device :: beta ! device or host variable
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: batchCount
2.2.39. cublasSgelsBatched
SGELS solves overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘T’ and m >= n: find the minimum norm solution of an undetermined system A**T * X = B. 4. If TRANS = ‘T’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**T * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.
integer(4) function cublasSgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n, nrhs
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: info(*)
integer, device :: devinfo(*)
integer :: batchCount
2.2.40. cublasSgeqrfBatched
SGEQRF computes a QR factorization of a real M-by-N matrix A: A = Q * R.
integer(4) function cublasSgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
type(cublasHandle) :: h
integer :: m, n
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Tau(*)
integer :: info(*)
integer :: batchCount
2.2.41. cublasSgetrfBatched
SGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.
integer(4) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
integer, device :: info(*)
integer :: batchCount
2.2.42. cublasSgetriBatched
SGETRI computes the inverse of a matrix using the LU factorization computed by SGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).
integer(4) function cublasSgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
type(c_devptr), device :: Carray(*)
integer :: ldc
integer, device :: info(*)
integer :: batchCount
2.2.43. cublasSgetrsBatched
SGETRS solves a system of linear equations A * X = B or A**T * X = B with a general N-by-N matrix A using the LU factorization computed by SGETRF.
integer(4) function cublasSgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: n, nrhs
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
type(c_devptr), device :: Barray(*)
integer :: ldb
integer :: info(*)
integer :: batchCount
2.2.44. cublasSmatinvBatched
cublasSmatinvBatched is a short cut of cublasSgetrfBatched plus cublasSgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasSgetrfBatched and cublasSgetriBatched.
integer(4) function cublasSmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Ainv(*)
integer :: lda_inv
integer, device :: info(*)
integer :: batchCount
2.2.45. cublasStrsmBatched
STRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.
integer(4) function cublasStrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
type(cublasHandle) :: h
integer :: side ! integer or character(1) variable
integer :: uplo ! integer or character(1) variable
integer :: trans ! integer or character(1) variable
integer :: diag ! integer or character(1) variable
integer :: m, n
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: A(*)
integer :: lda
type(c_devptr), device :: B(*)
integer :: ldb
integer :: batchCount
integer(4) function cublasStrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
type(cublasHandle) :: h
integer :: side
integer :: uplo
integer :: trans
integer :: diag
integer :: m, n
real(4), device :: alpha ! device or host variable
type(c_devptr), device :: A(*)
integer :: lda
type(c_devptr), device :: B(*)
integer :: ldb
integer :: batchCount
2.2.46. cublasSgemvStridedBatched
SGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.
integer(4) function cublasSgemvStridedBatched(h, trans, m, n, alpha, &
A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n
real(4), device :: alpha ! device or host variable
real(4), device :: A(lda,*)
integer :: lda
integer(8) :: strideA
real(4), device :: X(*)
integer :: incx
integer(8) :: strideX
real(4), device :: beta ! device or host variable
real(4), device :: Y(*)
integer :: incy
integer(8) :: strideY
integer :: batchCount
integer(4) function cublasSgemvStridedBatched_v2(h, trans, m, n, alpha, &
A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
type(cublasHandle) :: h
integer :: trans
integer :: m, n
real(4), device :: alpha ! device or host variable
real(4), device :: A(lda,*)
integer :: lda
integer(8) :: strideA
real(4), device :: X(*)
integer :: incx
integer(8) :: strideX
real(4), device :: beta ! device or host variable
real(4), device :: Y(*)
integer :: incy
integer(8) :: strideY
integer :: batchCount
2.2.47. cublasSgemmStridedBatched
SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
integer(4) function cublasSgemmStridedBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
type(cublasHandle) :: h
integer :: transa ! integer or character(1) variable
integer :: transb ! integer or character(1) variable
integer :: m, n, k
real(4), device :: alpha ! device or host variable
real(4), device :: Aarray(*)
integer :: lda
integer :: strideA
real(4), device :: Barray(*)
integer :: ldb
integer :: strideB
real(4), device :: beta ! device or host variable
real(4), device :: Carray(*)
integer :: ldc
integer :: strideC
integer :: batchCount
integer(4) function cublasSgemmStridedBatched_v2(h, transa, transb, m, n, k, alpha, &
Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
type(cublasHandle) :: h
integer :: transa
integer :: transb
integer :: m, n, k
real(4), device :: alpha ! device or host variable
real(4), device :: Aarray(*)
integer :: lda
integer :: strideA
real(4), device :: Barray(*)
integer :: ldb
integer :: strideB
real(4), device :: beta ! device or host variable
real(4), device :: Carray(*)
integer :: ldc
integer :: strideC
integer :: batchCount
2.3. Double Precision Functions and Subroutines
This section contains interfaces to the double precision BLAS and cuBLAS functions and subroutines.
2.3.1. idamax
IDAMAX finds the the index of the element having the maximum absolute value.
integer(4) function idamax(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x ! device or host variable
integer :: incx
integer(4) function cublasIdamax(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer(4) function cublasIdamax_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer, device :: res ! device or host variable
2.3.2. idamin
IDAMIN finds the index of the element having the minimum absolute value.
integer(4) function idamin(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x ! device or host variable
integer :: incx
integer(4) function cublasIdamin(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer(4) function cublasIdamin_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer, device :: res ! device or host variable
2.3.3. dasum
DASUM takes the sum of the absolute values.
real(8) function dasum(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x ! device or host variable
integer :: incx
real(8) function cublasDasum(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer(4) function cublasDasum_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
real(8), device :: res ! device or host variable
2.3.4. daxpy
DAXPY constant times a vector plus a vector.
subroutine daxpy(n, a, x, incx, y, incy)
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasDaxpy(n, a, x, incx, y, incy)
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasDaxpy_v2(h, n, a, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x, y
integer :: incx, incy
2.3.5. dcopy
DCOPY copies a vector, x, to a vector, y.
subroutine dcopy(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasDcopy(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasDcopy_v2(h, n, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
2.3.6. ddot
DDOT forms the dot product of two vectors.
real(8) function ddot(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
real(8) function cublasDdot(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasDdot_v2(h, n, x, incx, y, incy, res)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
real(8), device :: res ! device or host variable
2.3.7. dnrm2
DNRM2 returns the euclidean norm of a vector via the function name, so that DNRM2 := sqrt( x’*x )
real(8) function dnrm2(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x ! device or host variable
integer :: incx
real(8) function cublasDnrm2(n, x, incx)
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
integer(4) function cublasDnrm2_v2(h, n, x, incx, res)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x
integer :: incx
real(8), device :: res ! device or host variable
2.3.8. drot
DROT applies a plane rotation.
subroutine drot(n, x, incx, y, incy, sc, ss)
integer :: n
real(8), device :: sc, ss ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasDrot(n, x, incx, y, incy, sc, ss)
integer :: n
real(8), device :: sc, ss ! device or host variable
real(8), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasDrot_v2(h, n, x, incx, y, incy, sc, ss)
type(cublasHandle) :: h
integer :: n
real(8), device :: sc, ss ! device or host variable
real(8), device, dimension(*) :: x, y
integer :: incx, incy
2.3.9. drotg
DROTG constructs a Givens plane rotation.
subroutine drotg(sa, sb, sc, ss)
real(8), device :: sa, sb, sc, ss ! device or host variable
subroutine cublasDrotg(sa, sb, sc, ss)
real(8), device :: sa, sb, sc, ss ! device or host variable
integer(4) function cublasDrotg_v2(h, sa, sb, sc, ss)
type(cublasHandle) :: h
real(8), device :: sa, sb, sc, ss ! device or host variable
2.3.10. drotm
DROTM applies the modified Givens transformation, H, to the 2 by N matrix (DX**T) , where **T indicates transpose. The elements of DX are in (DX**T) DX(LX+I*INCX), I = 0 to N-1, where LX = 1 if INCX .GE. 0, ELSE LX = (-INCX)*N, and similarly for DY using LY and INCY. With DPARAM(1)=DFLAG, H has one of the following forms.. DFLAG=-1.D0 DFLAG=0.D0 DFLAG=1.D0 DFLAG=-2.D0 (DH11 DH12) (1.D0 DH12) (DH11 1.D0) (1.D0 0.D0) H=( ) ( ) ( ) ( ) (DH21 DH22), (DH21 1.D0), (-1.D0 DH22), (0.D0 1.D0). See DROTMG for a description of data storage in DPARAM.
subroutine drotm(n, x, incx, y, incy, param)
integer :: n
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasDrotm(n, x, incx, y, incy, param)
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
real(8), device :: param(*) ! device or host variable
integer(4) function cublasDrotm_v2(h, n, x, incx, y, incy, param)
type(cublasHandle) :: h
integer :: n
real(8), device :: param(*) ! device or host variable
real(8), device, dimension(*) :: x, y
integer :: incx, incy
2.3.11. drotmg
DROTMG constructs the modified Givens transformation matrix H which zeros the second component of the 2-vector (SQRT(DD1)*DX1,SQRT(DD2)*DY2)**T. With DPARAM(1)=DFLAG, H has one of the following forms.. DFLAG=-1.D0 DFLAG=0.D0 DFLAG=1.D0 DFLAG=-2.D0 (DH11 DH12) (1.D0 DH12) (DH11 1.D0) (1.D0 0.D0) H=( ) ( ) ( ) ( ) (DH21 DH22), (DH21 1.D0), (-1.D0 DH22), (0.D0 1.D0). Locations 2-4 of DPARAM contain DH11, DH21, DH12, and DH22 respectively. (Values of 1.D0, -1.D0, of 0.D0 implied by the value of DPARAM(1) are not stored in DPARAM.)
subroutine drotmg(d1, d2, x1, y1, param)
real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable
subroutine cublasDrotmg(d1, d2, x1, y1, param)
real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable
integer(4) function cublasDrotmg_v2(h, d1, d2, x1, y1, param)
type(cublasHandle) :: h
real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable
2.3.12. dscal
DSCAL scales a vector by a constant.
subroutine dscal(n, a, x, incx)
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
integer :: incx
subroutine cublasDscal(n, a, x, incx)
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x
integer :: incx
integer(4) function cublasDscal_v2(h, n, a, x, incx)
type(cublasHandle) :: h
integer :: n
real(8), device :: a ! device or host variable
real(8), device, dimension(*) :: x
integer :: incx
2.3.13. dswap
interchanges two vectors.
subroutine dswap(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y ! device or host variable
integer :: incx, incy
subroutine cublasDswap(n, x, incx, y, incy)
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
integer(4) function cublasDswap_v2(h, n, x, incx, y, incy)
type(cublasHandle) :: h
integer :: n
real(8), device, dimension(*) :: x, y
integer :: incx, incy
2.3.14. dgbmv
DGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.
subroutine dgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, kl, ku, lda, incx, incy
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, kl, ku, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: m, n, kl, ku, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
2.3.15. dgemv
DGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.
subroutine dgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
2.3.16. dger
DGER performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.
subroutine dger(m, n, alpha, x, incx, y, incy, a, lda)
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDger(m, n, alpha, x, incx, y, incy, a, lda)
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha ! device or host variable
integer(4) function cublasDger_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
type(cublasHandle) :: h
integer :: m, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha ! device or host variable
2.3.17. dsbmv
DSBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric band matrix, with k super-diagonals.
subroutine dsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: k, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: t
integer :: k, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsbmv_v2(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: k, n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
2.3.18. dspmv
DSPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.
subroutine dspmv(t, n, alpha, a, x, incx, beta, y, incy)
character*1 :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDspmv(t, n, alpha, a, x, incx, beta, y, incy)
character*1 :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDspmv_v2(h, t, n, alpha, a, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y
real(8), device :: alpha, beta ! device or host variable
2.3.19. dspr
DSPR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix, supplied in packed form.
subroutine dspr(t, n, alpha, x, incx, a)
character*1 :: t
integer :: n, incx
real(8), device, dimension(*) :: a, x ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDspr(t, n, alpha, x, incx, a)
character*1 :: t
integer :: n, incx
real(8), device, dimension(*) :: a, x
real(8), device :: alpha ! device or host variable
integer(4) function cublasDspr_v2(h, t, n, alpha, x, incx, a)
type(cublasHandle) :: h
integer :: t
integer :: n, incx
real(8), device, dimension(*) :: a, x
real(8), device :: alpha ! device or host variable
2.3.20. dspr2
DSPR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.
subroutine dspr2(t, n, alpha, x, incx, y, incy, a)
character*1 :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDspr2(t, n, alpha, x, incx, y, incy, a)
character*1 :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y
real(8), device :: alpha ! device or host variable
integer(4) function cublasDspr2_v2(h, t, n, alpha, x, incx, y, incy, a)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy
real(8), device, dimension(*) :: a, x, y
real(8), device :: alpha ! device or host variable
2.3.21. dsymv
DSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.
subroutine dsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: uplo
integer :: n, lda, incx, incy
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
character*1 :: uplo
integer :: n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
type(cublasHandle) :: h
integer :: uplo
integer :: n, lda, incx, incy
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha, beta ! device or host variable
2.3.22. dsyr
DSYR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix.
subroutine dsyr(t, n, alpha, x, incx, a, lda)
character*1 :: t
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDsyr(t, n, alpha, x, incx, a, lda)
character*1 :: t
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
real(8), device :: alpha ! device or host variable
integer(4) function cublasDsyr_v2(h, t, n, alpha, x, incx, a, lda)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
real(8), device :: alpha ! device or host variable
2.3.23. dsyr2
DSYR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix.
subroutine dsyr2(t, n, alpha, x, incx, y, incy, a, lda)
character*1 :: t
integer :: n, incx, incy, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x, y ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDsyr2(t, n, alpha, x, incx, y, incy, a, lda)
character*1 :: t
integer :: n, incx, incy, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha ! device or host variable
integer(4) function cublasDsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
type(cublasHandle) :: h
integer :: t
integer :: n, incx, incy, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x, y
real(8), device :: alpha ! device or host variable
2.3.24. dtbmv
DTBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.
subroutine dtbmv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
subroutine cublasDtbmv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
integer(4) function cublasDtbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
2.3.25. dtbsv
DTBSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine dtbsv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
subroutine cublasDtbsv(u, t, d, n, k, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
integer(4) function cublasDtbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, k, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
2.3.26. dtpmv
DTPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.
subroutine dtpmv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x ! device or host variable
subroutine cublasDtpmv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x
integer(4) function cublasDtpmv_v2(h, u, t, d, n, a, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x
2.3.27. dtpsv
DTPSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine dtpsv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x ! device or host variable
subroutine cublasDtpsv(u, t, d, n, a, x, incx)
character*1 :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x
integer(4) function cublasDtpsv_v2(h, u, t, d, n, a, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx
real(8), device, dimension(*) :: a, x
2.3.28. dtrmv
DTRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.
subroutine dtrmv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
subroutine cublasDtrmv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
integer(4) function cublasDtrmv_v2(h, u, t, d, n, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
2.3.29. dtrsv
DTRSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.
subroutine dtrsv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(*) :: x ! device or host variable
subroutine cublasDtrsv(u, t, d, n, a, lda, x, incx)
character*1 :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
integer(4) function cublasDtrsv_v2(h, u, t, d, n, a, lda, x, incx)
type(cublasHandle) :: h
integer :: u, t, d
integer :: n, incx, lda
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(*) :: x
2.3.30. dgemm
DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
subroutine dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device, dimension(ldc, *) :: c ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: transa, transb
integer :: m, n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
2.3.31. dsymm
DSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.
subroutine dsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: side, uplo
integer :: m, n, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device, dimension(ldc, *) :: c ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: side, uplo
integer :: m, n, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: side, uplo
integer :: m, n, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
2.3.32. dsyrk
DSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.
subroutine dsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldc
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldc, *) :: c ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
2.3.33. dsyr2k
DSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.
subroutine dsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device, dimension(ldc, *) :: c ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
2.3.34. dsyrkx
DSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.
subroutine dsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device, dimension(ldc, *) :: c ! device or host variable
real(8), device :: alpha, beta ! device or host variable
subroutine cublasDsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
character*1 :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
integer(4) function cublasDsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
type(cublasHandle) :: h
integer :: uplo, trans
integer :: n, k, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha, beta ! device or host variable
2.3.35. dtrmm
DTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.
subroutine dtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device :: alpha ! device or host variable
integer(4) function cublasDtrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
type(cublasHandle) :: h
integer :: side, uplo, transa, diag
integer :: m, n, lda, ldb, ldc
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device, dimension(ldc, *) :: c
real(8), device :: alpha ! device or host variable
2.3.36. dtrsm
DTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.
subroutine dtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(8), device, dimension(lda, *) :: a ! device or host variable
real(8), device, dimension(ldb, *) :: b ! device or host variable
real(8), device :: alpha ! device or host variable
subroutine cublasDtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
character*1 :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device :: alpha ! device or host variable
integer(4) function cublasDtrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
type(cublasHandle) :: h
integer :: side, uplo, transa, diag
integer :: m, n, lda, ldb
real(8), device, dimension(lda, *) :: a
real(8), device, dimension(ldb, *) :: b
real(8), device :: alpha ! device or host variable
2.3.37. cublasDgemvBatched
DGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.
integer(4) function cublasDgemvBatched(h, trans, m, n, alpha, &
Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: xarray(*)
integer :: incx
real(8), device :: beta ! device or host variable
type(c_devptr), device :: yarray(*)
integer :: incy
integer :: batchCount
integer(4) function cublasDgemvBatched_v2(h, trans, m, n, alpha, &
Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
type(cublasHandle) :: h
integer :: trans
integer :: m, n
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: xarray(*)
integer :: incx
real(8), device :: beta ! device or host variable
type(c_devptr), device :: yarray(*)
integer :: incy
integer :: batchCount
2.3.38. cublasDgemmBatched
DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
integer(4) function cublasDgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
type(cublasHandle) :: h
integer :: transa ! integer or character(1) variable
integer :: transb ! integer or character(1) variable
integer :: m, n, k
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Barray(*)
integer :: ldb
real(8), device :: beta ! device or host variable
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: batchCount
integer(4) function cublasDgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
type(cublasHandle) :: h
integer :: transa
integer :: transb
integer :: m, n, k
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Barray(*)
integer :: ldb
real(8), device :: beta ! device or host variable
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: batchCount
2.3.39. cublasDgelsBatched
DGELS solves overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘T’ and m >= n: find the minimum norm solution of an undetermined system A**T * X = B. 4. If TRANS = ‘T’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**T * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.
integer(4) function cublasDgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n, nrhs
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Carray(*)
integer :: ldc
integer :: info(*)
integer, device :: devinfo(*)
integer :: batchCount
2.3.40. cublasDgeqrfBatched
DGEQRF computes a QR factorization of a real M-by-N matrix A: A = Q * R.
integer(4) function cublasDgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
type(cublasHandle) :: h
integer :: m, n
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Tau(*)
integer :: info(*)
integer :: batchCount
2.3.41. cublasDgetrfBatched
DGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.
integer(4) function cublasDgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
integer, device :: info(*)
integer :: batchCount
2.3.42. cublasDgetriBatched
DGETRI computes the inverse of a matrix using the LU factorization computed by DGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).
integer(4) function cublasDgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
type(c_devptr), device :: Carray(*)
integer :: ldc
integer, device :: info(*)
integer :: batchCount
2.3.43. cublasDgetrsBatched
DGETRS solves a system of linear equations A * X = B or A**T * X = B with a general N-by-N matrix A using the LU factorization computed by DGETRF.
integer(4) function cublasDgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: n, nrhs
type(c_devptr), device :: Aarray(*)
integer :: lda
integer, device :: ipvt(*)
type(c_devptr), device :: Barray(*)
integer :: ldb
integer :: info(*)
integer :: batchCount
2.3.44. cublasDmatinvBatched
cublasDmatinvBatched is a short cut of cublasDgetrfBatched plus cublasDgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasDgetrfBatched and cublasDgetriBatched.
integer(4) function cublasDmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
type(cublasHandle) :: h
integer :: n
type(c_devptr), device :: Aarray(*)
integer :: lda
type(c_devptr), device :: Ainv(*)
integer :: lda_inv
integer, device :: info(*)
integer :: batchCount
2.3.45. cublasDtrsmBatched
DTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.
integer(4) function cublasDtrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
type(cublasHandle) :: h
integer :: side ! integer or character(1) variable
integer :: uplo ! integer or character(1) variable
integer :: trans ! integer or character(1) variable
integer :: diag ! integer or character(1) variable
integer :: m, n
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: A(*)
integer :: lda
type(c_devptr), device :: B(*)
integer :: ldb
integer :: batchCount
integer(4) function cublasDtrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
type(cublasHandle) :: h
integer :: side
integer :: uplo
integer :: trans
integer :: diag
integer :: m, n
real(8), device :: alpha ! device or host variable
type(c_devptr), device :: A(*)
integer :: lda
type(c_devptr), device :: B(*)
integer :: ldb
integer :: batchCount
2.3.46. cublasDgemvStridedBatched
DGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.
integer(4) function cublasDgemvStridedBatched(h, trans, m, n, alpha, &
A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
type(cublasHandle) :: h
integer :: trans ! integer or character(1) variable
integer :: m, n
real(8), device :: alpha ! device or host variable
real(8), device :: A(lda,*)
integer :: lda
integer(8) :: strideA
real(8), device :: X(*)
integer ::