This document includes math equations (highlighted in red) which are best viewed with Firefox version 4.0 or higher, or another MathML-aware browser. There is also a PDF version of this document.
CUBLAS (PDF) - CUDA Toolkit v5.5 (older) - Last updated July 19, 2013 - Send Feedback

## 1. Introduction

The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU), but does not auto-parallelize across multiple GPUs.

To use the CUBLAS library, the application must allocate the required matrices and vectors in the GPU memory space, fill them with data, call the sequence of desired CUBLAS functions, and then upload the results from the GPU memory space back to the host. The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

### 1.1. Data layout

For maximum compatibility with existing Fortran environments, the CUBLAS library uses column-major storage, and 1-based indexing. Since C and C++ use row-major storage, applications written in these languages can not use the native array semantics for two-dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of one-dimensional arrays. For Fortran code ported to C in mechanical fashion, one may chose to retain 1-based indexing to avoid the need to transform loops. In this case, the array index of a matrix element in row “i” and column “j” can be computed via the following macro

`#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))`

Here, ld refers to the leading dimension of the matrix, which in the case of column-major storage is the number of rows of the allocated matrix (even if only a submatrix of it is being used). For natively written C and C++ code, one would most likely choose 0-based indexing, in which case the array index of a matrix element in row “i” and column “j” can be computed via the following macro

`#define IDX2C(i,j,ld) (((j)*(ld))+(i))`

### 1.2. New and Legacy CUBLAS API

Starting with version 4.0, the CUBLAS Library provides a new updated API, in addition to the existing legacy API. This section discusses why a new API is provided, the advantages of using it, and the differences with the existing legacy API.

The new CUBLAS library API can be used by including the header file “cublas_v2.h”. It has the following features that the legacy CUBLAS API does not have:

• the handle to the CUBLAS library context is initialized using the function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs. This also allows the CUBLAS APIs to be reentrant.
• the scalars $\alpha$ and $\beta$ can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This change allows library functions to execute asynchronously using streams even when $\alpha$ and $\beta$ are generated by a previous kernel.
• when a library routine returns a scalar result, it can be returned by reference on the host or the device, instead of only being allowed to be returned by value only on the host. This change allows library routines to be called asynchronously when the scalar result is generated and returned by reference on the device resulting in maximum parallelism.
• the error status cublasStatus_t is returned by all CUBLAS library function calls. This change facilitates debugging and simplifies software development. Note that cublasStatus was renamed cublasStatus_t to be more consistent with other types in the CUBLAS library.
• the cublasAlloc() and cublasFree() functions have been deprecated. This change removes these unnecessary wrappers around cudaMalloc() and cudaFree(), respectively.
• the function cublasSetKernelStream() was renamed cublasSetStream() to be more consistent with the other CUDA libraries.

The legacy CUBLAS API, explained in more detail in the Appendix A, can be used by including the header file “cublas.h”. Since the legacy API is identical to the previously released CUBLAS library API, existing applications will work out of the box and automatically use this legacy API without any source code changes. In general, new applications should not use the legacy CUBLAS API, and existing existing applications should convert to using the new API if it requires sophisticated and optimal stream parallelism or if it calls CUBLAS routines concurrently from multiple threads. For the rest of the document, the new CUBLAS Library API will simply be referred to as the CUBLAS Library API.

As mentioned earlier the interfaces to the legacy and the CUBLAS library APIs are the header file “cublas.h” and “cublas_v2.h”, respectively. In addition, applications using the CUBLAS library need to link against the DSO cublas.so (Linux), the DLL cublas.dll (Windows), or the dynamic library cublas.dylib (Mac OS X). Note: the same dynamic library implements both the new and legacy CUBLAS APIs.

### 1.3. Example code

For sample code references please see the two examples below. They show an application written in C using the CUBLAS library API with two indexing styles (Example 1. "Application Using C and CUBLAS: 1-based indexing" and Example 2. "Application Using C and CUBLAS: 0-based Indexing").

```//Example 1. Application Using C and CUBLAS: 1-based indexing
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#define M 6
#define N 5
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))

static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int p, int q, float alpha, float beta){
cublasSscal (handle, n-p+1, &alpha, &m[IDX2F(p,q,ldm)], ldm);
cublasSscal (handle, ldm-p+1, &beta, &m[IDX2F(p,q,ldm)], 1);
}

int main (void){
cudaError_t cudaStat;
cublasStatus_t stat;
cublasHandle_t handle;
int i, j;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
a[IDX2F(i,j,M)] = (float)((i-1) * M + j);
}
}
cudaStat = cudaMalloc ((void**)&devPtrA, M*N*sizeof(*a));
if (cudaStat != cudaSuccess) {
printf ("device memory allocation failed");
return EXIT_FAILURE;
}
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
modify (handle, devPtrA, M, N, 2, 3, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
cudaFree (devPtrA);
cublasDestroy(handle);
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
printf ("%7.0f", a[IDX2F(i,j,M)]);
}
printf ("\n");
}
free(a);
return EXIT_SUCCESS;
}```
```//Example 2. Application Using C and CUBLAS: 0-based indexing
//-----------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#define M 6
#define N 5
#define IDX2C(i,j,ld) (((j)*(ld))+(i))

static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int p, int q, float alpha, float beta){
cublasSscal (handle, n-p, &alpha, &m[IDX2C(p,q,ldm)], ldm);
cublasSscal (handle, ldm-p, &beta, &m[IDX2C(p,q,ldm)], 1);
}

int main (void){
cudaError_t cudaStat;
cublasStatus_t stat;
cublasHandle_t handle;
int i, j;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
a[IDX2C(i,j,M)] = (float)(i * M + j + 1);
}
}
cudaStat = cudaMalloc ((void**)&devPtrA, M*N*sizeof(*a));
if (cudaStat != cudaSuccess) {
printf ("device memory allocation failed");
return EXIT_FAILURE;
}
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
modify (handle, devPtrA, M, N, 1, 2, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
cudaFree (devPtrA);
cublasDestroy(handle);
return EXIT_FAILURE;
}
cudaFree (devPtrA);
cublasDestroy(handle);
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
printf ("%7.0f", a[IDX2C(i,j,M)]);
}
printf ("\n");
}
free(a);
return EXIT_SUCCESS;
}```

## 2. Using the CUBLAS API

This section describes how to use the CUBLAS library API. It does not contain a detailed reference for all API datatypes and functions–those are provided in subsequent chapters. The Legacy CUBLAS API is also not covered in this section–that is handled in an Appendix.

### 2.1. Error status

All CUBLAS library function calls return the error status cublasStatus_t.

### 2.2. CUBLAS context

The application must initialize the handle to the CUBLAS library context by calling the cublasCreate() function. Then, the is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the CUBLAS library context.

This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can use cudaSetDevice() to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the CUBLAS library context, which will use the particular device associated with that host thread. Then, the CUBLAS library function calls made with different handle will automatically dispatch the computation to different devices.

The device associated with a particular CUBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestory() calls. In order for the CUBLAS library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another CUBLAS context, which will be associated with the new device, by calling cublasCreate().

The library is thread safe and its functions can be called from multiple host threads, even with the same handle.

### 2.4. A.5. Scalar Parameters

In the CUBLAS API the scalar parameters $\alpha$ and $\beta$ can be passed by reference on the host or the device

Also, the few functions that return a scalar result, such as amax(), amin, asum(), rotg(), rotmg(), dot() and nrm2(), return the resulting value by reference on the host or the device. Notice that even though these functions return immediately, similarly to matrix and vector results, the scalar result is ready only when execution of the routine on the GPU completes. This requires proper synchronization in order to read the result from the host.

These changes allow the library functions to execute completely asynchronously using streams even when $\alpha$ and $\beta$ are generated by a previous kernel. For example, this situation can arise when iterative methods for solution of linear systems and eigenvalue problems are implemented using the CUBLAS library.

### 2.5. Parallelism with Streams

If the application uses the results computed by multiple independent tasks, CUDA™ streams can be used to overlap the computation performed in these tasks.

The application can conceptually associate each stream with each task. In order to achieve the overlap of computation between the tasks, the user should create CUDA™ streams using the function cudaStreamCreate() and set the stream to be used by each individual CUBLAS library routine by calling cublasSetStream() just before calling the actual CUBLAS routine. Then, the computation performed in separate streams would be overlapped automatically when possible on the GPU. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work.

We recommend using the new CUBLAS API with scalar parameters and results passed by reference in the device memory to achieve maximum overlap of the computation when using streams.

A particular application of streams, batching of multiple small kernels, is described below.

### 2.6. Batching Kernels

In this section we will explain how to use streams to batch the execution of small kernels. For instance, suppose that we have an application where we need to make many small independent matrix-matrix multiplications with dense matrices.

It is clear that even with millions of small independent matrices we will not be able to achieve the same GFLOPS rate as with a one large matrix. For example, a single $n×n$ large matrix-matrix multiplication performs ${n}^{3}$ operations for ${n}^{2}$ input size, while 1024 $\frac{n}{32}×\frac{n}{32}$ small matrix-matrix multiplications perform $1024{\left(\frac{n}{32}\right)}^{3}=\frac{{n}^{3}}{32}$ operations for the same input size. However, it is also clear that we can achieve a significantly better performance with many small independent matrices compared with a single small matrix.

The architecture family of GPUs allows us to execute multiple kernels simultaneously. Hence, in order to batch the execution of independent kernels, we can run each of them in a separate stream. In particular, in the above example we could create 1024 CUDA™ streams using the function cudaStreamCreate(), then preface each call to cublas<t>gemm() with a call to cublasSetStream() with a different stream for each of the matrix-matrix multiplications. This will ensure that when possible the different computations will be executed concurrently. Although the user can create many streams, in practice it is not possible to have more than 16 concurrent kernels executing at the same time.

### 2.7. Cache configuration

On some devices, L1 cache and shared memory use the same hardware resources. The cache configuration can be set directly with the CUDA Runtime function cudaDeviceSetCacheConfig. The cache configuration can also be set specifically for some functions using the routine cudaFuncSetCacheConfig. Please refer to the CUDA Runtime API documentation for details about the cache configuration settings.

Because switching from one configuration to another can affect kernels concurrency, the CUBLAS Library does not set any cache configuration preference and relies on the current setting. However, some CUBLAS routines, especially Level-3 routines, rely heavily on shared memory. Thus the cache preference setting might affect adversely their performance.

### 2.8. Device API Library

Starting with release 5.0, the CUDA Toolkit now provides a static CUBLAS Library cublas_device.a that contains device routines with the same API as the regular CUBLAS Library. Those routines use internally the Dynamic Parallelism feature to launch kernel from within and thus is only available for device with compute capability at least equal to 3.5.

In order to use those library routines from the device the user must include the header file “cublas_v2.h” corresponding to the new CUBLAS API and link against the static CUBLAS library cublas_device.a.

Those device CUBLAS library routines are called from the device in exactly the same way they are called from the host, with the following exceptions:

• The legacy CUBLAS API is not supported on the device.
• The pointer mode is not supported on the device, in other words, scalar input and output parameters must be allocated on the device memory.

Furthermore, the input and output scalar parameters must be allocated and released on the device using the cudaMalloc and cudaFree routines from the Host respectively or malloc and free routines from the device, in other words, they can not be passed by reference from the local memory to the routines.

## 3. CUBLAS Datatypes Reference

### 3.1. cublasHandle_t

The cublasHandle_t type is a pointer type to an opaque structure holding the CUBLAS library context. The CUBLAS library context must be initialized using cublasCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cublasDestroy().

### 3.2. cublasStatus_t

The type is used for function status returns. All CUBLAS library functions return their status, which can have the following values.

Value Meaning

CUBLAS_STATUS_SUCCESS

The operation completed successfully.

CUBLAS_STATUS_NOT_INITIALIZED

The CUBLAS library was not initialized. This is usually caused by the lack of a prior cublasCreate() call, an error in the CUDA Runtime API called by the CUBLAS routine, or an error in the hardware setup.

To correct: call cublasCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the CUBLAS library are correctly installed.

CUBLAS_STATUS_ALLOC_FAILED

Resource allocation failed inside the CUBLAS library. This is usually caused by a cudaMalloc() failure.

To correct: prior to the function call, deallocate previously allocated memory as much as possible.

CUBLAS_STATUS_INVALID_VALUE

An unsupported value or parameter was passed to the function (a negative vector size, for example).

To correct: ensure that all the parameters being passed have valid values.

CUBLAS_STATUS_ARCH_MISMATCH

The function requires a feature absent from the device architecture; usually caused by the lack of support for double precision.

To correct: compile and run the application on a device with appropriate compute capability, which is 1.3 for double precision.

CUBLAS_STATUS_MAPPING_ERROR

An access to GPU memory space failed, which is usually caused by a failure to bind a texture.

To correct: prior to the function call, unbind any previously bound textures.

CUBLAS_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons.

To correct: check that the hardware, an appropriate version of the driver, and the CUBLAS library are correctly installed.

CUBLAS_STATUS_INTERNAL_ERROR

An internal CUBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure.

To correct: check that the hardware, an appropriate version of the driver, and the CUBLAS library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion.

### 3.3. cublasOperation_t

The cublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters ‘N’ or ‘n’ (non-transpose), ‘T’ or ‘t’ (transpose) and ‘C’ or ‘c’ (conjugate transpose) that are often used as parameters to legacy BLAS implementations.

Value Meaning

CUBLAS_OP_N

the non-transpose operation is selected

CUBLAS_OP_T

the transpose operation is selected

CUBLAS_OP_C

the conjugate transpose operation is selected

### 3.4. cublasFillMode_t

The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran characters ‘L’ or ‘l’ (lower) and ‘U’ or ‘u’ (upper) that are often used as parameters to legacy BLAS implementations.

Value Meaning

CUBLAS_FILL_MODE_LOWER

the lower part of the matrix is filled

CUBLAS_FILL_MODE_UPPER

the upper part of the matrix is filled

### 3.5. cublasDiagType_t

The type indicates whether the main diagonal of the dense matrix is unity and consequently should not be touched or modified by the function. Its values correspond to Fortran characters ‘N’ or ‘n’ (non-unit) and ‘U’ or ‘u’ (unit) that are often used as parameters to legacy BLAS implementations.

Value Meaning

CUBLAS_DIAG_NON_UNIT

the matrix diagonal has non-unit elements

CUBLAS_DIAG_UNIT

the matrix diagonal has unit elements

### 3.6. cublasSideMode_t

The type indicates whether the dense matrix is on the left or right side in the matrix equation solved by a particular function. Its values correspond to Fortran characters ‘L’ or ‘l’ (left) and ‘R’ or ‘r’ (right) that are often used as parameters to legacy BLAS implementations.

Value Meaning

CUBLAS_SIDE_LEFT

the matrix is on the left side in the equation

CUBLAS_SIDE_RIGHT

the matrix is on the right side in the equation

### 3.7. cublasPointerMode_t

The cublasPointerMode_t type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are present in the function call, all of them must conform to the same single pointer mode. The pointer mode can be set and retrieved using cublasSetPointerMode() and cublasGetPointerMode() routines, respectively.

Value Meaning

CUBLAS_POINTER_MODE_HOST

the scalars are passed by reference on the host

CUBLAS_POINTER_MODE_DEVICE

the scalars are passed by reference on the device

### 3.8. cublasAtomicsMode_t

The type indicates whether CUBLAS routines which has an alternate implementation using atomics can be used. The atomics mode can be set and queried using and routines, respectively.

Value Meaning

CUBLAS_ATOMICS_NOT_ALLOWED

the usage of atomics is not allowed

CUBLAS_ATOMICS_ALLOWED

the usage of atomics is allowed

## 4. CUBLAS Helper Function Reference

### 4.1. cublasCreate()

```cublasStatus_t
cublasCreate(cublasHandle_t *handle)```

This function initializes the CUBLAS library and creates a handle to an opaque structure holding the CUBLAS library context. It allocates hardware resources on the host and device and must be called prior to making any other CUBLAS library calls.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the initialization succeeded

CUBLAS_STATUS_NOT_INITIALIZED

the CUDA™ Runtime initialization failed

CUBLAS_STATUS_ALLOC_FAILED

the resources could not be allocated

### 4.2. cublasDestroy()

```cublasStatus_t
cublasDestroy(cublasHandle_t handle)```

This function releases hardware resources used by the CUBLAS library. The release of GPU resources may be deferred until the application exits. This function is usually the last call with a particular handle to the CUBLAS library.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the shut down succeeded

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.3. cublasGetVersion()

```cublasStatus_t
cublasGetVersion(cublasHandle_t handle, int *version)```

This function returns the version number of the CUBLAS library.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.4. cublasSetStream()

```cublasStatus_t
cublasSetStream(cublasHandle_t handle, cudaStream_t streamId)```

This function sets the CUBLAS library stream, which will be used to execute all subsequent calls to the CUBLAS library functions. If the CUBLAS library stream is not set, all kernels use the defaultNULL stream. In particular, this routine can be used to change the stream between kernel launches and then to reset the CUBLAS library stream back to NULL.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the stream was set successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.5. cublasGetStream()

```cublasStatus_t
cublasGetStream(cublasHandle_t handle, cudaStream_t *streamId)```

This function gets the CUBLAS library stream, which is being used to execute all calls to the CUBLAS library functions. If the CUBLAS library stream is not set, all kernels use the defaultNULL stream.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the stream was returned successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.6. cublasGetPointerMode()

```cublasStatus_t
cublasGetPointerMode(cublasHandle_t handle, cublasPointerMode_t *mode)```

This function obtains the pointer mode used by the CUBLAS library. Please see the section on the cublasPointerMode_t type for more details.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the pointer mode was obtained successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.7. cublasSetPointerMode()

```cublasStatus_t
cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t mode)```

This function sets the pointer mode used by the CUBLAS library. The default is for the values to be passed by reference on the host. Please see the section on the cublasPointerMode_t type for more details.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the pointer mode was set successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.8. cublasSetVector()

```cublasStatus_t
cublasSetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)```

This function copies n elements from a vector x in host memory space to a vector y in GPU memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The storage spacing between consecutive elements is given by incx for the source vector x and for the destination vector y.

In general, y points to an object, or part of an object, that was allocated via cublasAlloc(). Since column-major format for two-dimensional matrices is assumed, if a vector is part of a matrix, a vector increment equal to 1 accesses a (partial) column of that matrix. Similarly, using an increment equal to the leading dimension of the matrix results in accesses to a (partial) row of that matrix.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters incx, incy, elemSize<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.9. cublasGetVector()

```cublasStatus_t
cublasGetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)```

This function copies n elements from a vector x in GPU memory space to a vector y in host memory space. Elements in both vectors are assumed to have a size of elemSize bytes. The storage spacing between consecutive elements is given by incx for the source vector and incy for the destination vector y.

In general, x points to an object, or part of an object, that was allocated via cublasAlloc(). Since column-major format for two-dimensional matrices is assumed, if a vector is part of a matrix, a vector increment equal to 1 accesses a (partial) column of that matrix. Similarly, using an increment equal to the leading dimension of the matrix results in accesses to a (partial) row of that matrix.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters incx, incy, elemSize<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.10. cublasSetMatrix()

```cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)```

This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space. It is assumed that each element requires storage of elemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrix A and destination matrix B given in lda and ldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used. In general, B is a device pointer that points to an object, or part of an object, that was allocated in GPU memory space via cublasAlloc().

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters rows, cols<0 or elemSize, lda, ldb<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.11. cublasGetMatrix()

```cublasStatus_t
cublasGetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)```

This function copies a tile of rows x cols elements from a matrix A in GPU memory space to a matrix B in host memory space. It is assumed that each element requires storage of elemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrix A and destination matrix B given in lda and ldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used. In general, A is a device pointer that points to an object, or part of an object, that was allocated in GPU memory space via cublasAlloc().

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters rows, cols<0 or elemSize, lda, ldb<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.12. cublasSetVectorAsync()

```cublasStatus_t
cublasSetVectorAsync(int n, int elemSize, const void *hostPtr, int incx,
void *devicePtr, int incy, cudaStream_t stream)```

This function has the same functionality as cublasSetVector(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters incx, incy, elemSize<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.13. cublasGetVectorAsync()

```cublasStatus_t
cublasGetVectorAsync(int n, int elemSize, const void *devicePtr, int incx,
void *hostPtr, int incy, cudaStream_t stream)```

This function has the same functionality as cublasGetVector(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters incx, incy, elemSize<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.14. cublasSetMatrixAsync()

```cublasStatus_t
cublasSetMatrixAsync(int rows, int cols, int elemSize, const void *A,
int lda, void *B, int ldb, cudaStream_t stream)```

This function has the same functionality as cublasSetMatrix(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters rows, cols<0 or elemSize, lda, ldb<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.15. cublasGetMatrixAsync()

```cublasStatus_t
cublasGetMatrixAsync(int rows, int cols, int elemSize, const void *A,
int lda, void *B, int ldb, cudaStream_t stream)```

This function has the same functionality as cublasGetMatrix(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters rows, cols<0 or elemSize, lda, ldb<=0

CUBLAS_STATUS_MAPPING_ERROR

there was an error accessing GPU memory

### 4.16. cublasSetAtomicsMode()

`cublasStatust cublasSetAtomicsMode(cublasHandlet handle, cublasAtomicsModet mode)`

Some routines like cublas<t>symv and cublas<t>hemv have an alternate implementation that use atomics to cumulate results. This implementation is generally significantly faster but can generate results that are not strictly identical from one run to the others. Mathematically, those different results are not significant but when debugging those differences can be prejudicial.

This function allows or disallows the usage of atomics in the CUBLAS library for all routines which have an alternate implementation. When not explicitly specified in the documentation of any CUBLAS routine, it means that this routine does not have an alternate implementation that use atomics. When atomics mode is disabled, each CUBLAS routine should produce the same results from one run to the other when called with identical parameters on the same Hardware.

The value of the atomics mode is CUBLASATOMICSNOTALLOWED. Please see the section on the type for more details.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the atomics mode was set successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

### 4.17. cublasGetAtomicsMode()

cublasStatust cublasGetAtomicsMode(cublasHandlet handle, cublasAtomicsModet *mode)

This function queries the atomic mode of a specific CUBLAS context.

The value of the atomics mode is CUBLASATOMICSNOTALLOWED. Please see the section on the type for more details.

Return Value Meaning

CUBLAS_STATUS_SUCCESS

the atomics mode was queried successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

## 5. CUBLAS Level-1 Function Reference

In this chapter we describe the Level-1 Basic Linear Algebra Subprograms (BLAS1) functions that perform scalar and vector based operations. We will use abbreviations <type> for type and <t> for the corresponding short type to make a more concise and clear presentation of the implemented functions. Unless otherwise specified <type> and <t> have the following meanings:

<type> <t> Meaning

float

‘s’ or ‘S’

real single-precision

double

‘d’ or ‘D’

real double-precision

cuComplex

‘c’ or ‘C’

complex single-precision

cuDoubleComplex

‘z’ or ‘Z’

complex double-precision

When the parameters and returned values of the function differ, which sometimes happens for complex input, the <t> can also have the following meanings ‘Sc’, ‘Cs’, ‘Dz’ and ‘Zd’.

The abbreviation Re(.) and Im(.) will stand for the real and imaginary part of a number, respectively. Since imaginary part of a real number does not exist, we will consider it to be zero and can usually simply discard it from the equation where it is being used. Also, the $\stackrel{̄}{\alpha }$ will denote the complex conjugate of $\alpha$ .

In general throughout the documentation, the lower case Greek symbols $\alpha$ and $\beta$ will denote scalars, lower case English letters in bold type $x$ and $y$ will denote vectors and capital English letters $A$ , $B$ and $C$ will denote matrices.

### 5.1. cublasI<t>amax()

```cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
cublasStatus_t cublasIdamax(cublasHandle_t handle, int n,
const double *x, int incx, int *result)
cublasStatus_t cublasIcamax(cublasHandle_t handle, int n,
const cuComplex *x, int incx, int *result)
cublasStatus_t cublasIzamax(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, int *result)```

This function finds the (smallest) index of the element of the maximum magnitude. Hence, the result is the first $i$ such that $|Im\left(x\left[j\right]\right)|+|Re\left(x\left[j\right]\right)|$ is maximum for $i=1,\dots ,n$ and . Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x.

x

device

input

<type> vector with elements.

incx

input

stride between consecutive elements of x.

result

host or device

output

the resulting index, which is 0 if n,incx<=0.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ALLOC_FAILED

the reduction buffer could not be allocated

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.2. cublasI<t>amin()

```cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
cublasStatus_t cublasIdamin(cublasHandle_t handle, int n,
const double *x, int incx, int *result)
cublasStatus_t cublasIcamin(cublasHandle_t handle, int n,
const cuComplex *x, int incx, int *result)
cublasStatus_t cublasIzamin(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, int *result)```

This function finds the (smallest) index of the element of the minimum magnitude. Hence, the result is the first $i$ such that $|Im\left(x\left[j\right]\right)|+|Re\left(x\left[j\right]\right)|$ is minimum for $i=1,\dots ,n$ and Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x.

x

device

input

<type> vector with elements.

incx

input

stride between consecutive elements of x.

result

host or device

output

the resulting index, which is 0 if n,incx<=0.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ALLOC_FAILED

the reduction buffer could not be allocated

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

isamin

### 5.3. cublas<t>asum()

```cublasStatus_t  cublasSasum(cublasHandle_t handle, int n,
const float           *x, int incx, float  *result)
cublasStatus_t  cublasDasum(cublasHandle_t handle, int n,
const double          *x, int incx, double *result)
cublasStatus_t cublasScasum(cublasHandle_t handle, int n,
const cuComplex       *x, int incx, float  *result)
cublasStatus_t cublasDzasum(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, double *result)```

This function computes the sum of the absolute values of the elements of vector x. Hence, the result is ${\sum }_{i=1}^{n}|Im\left(x\left[j\right]\right)|+|Re\left(x\left[j\right]\right)|$ where . Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x.

x

device

input

<type> vector with elements.

incx

input

stride between consecutive elements of x.

result

host or device

output

the resulting index, which is 0.0 if n,incx<=0.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ALLOC_FAILED

the reduction buffer could not be allocated

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.4. cublas<t>axpy()

```cublasStatus_t cublasSaxpy(cublasHandle_t handle, int n,
const float           *alpha,
const float           *x, int incx,
float                 *y, int incy)
cublasStatus_t cublasDaxpy(cublasHandle_t handle, int n,
const double          *alpha,
const double          *x, int incx,
double                *y, int incy)
cublasStatus_t cublasCaxpy(cublasHandle_t handle, int n,
const cuComplex       *alpha,
const cuComplex       *x, int incx,
cuComplex             *y, int incy)
cublasStatus_t cublasZaxpy(cublasHandle_t handle, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex       *y, int incy)```

This function multiplies the vector x by the scalar $\alpha$ and adds it to the vector y overwriting the latest vector with the result. Hence, the performed operation is $y\left[j\right]=\alpha ×x\left[k\right]+y\left[j\right]$ for $i=1,\dots ,n$ , and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

alpha

host or device

input

<type> scalar used for multiplication.

n

input

number of elements in the vector x and y.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.5. cublas<t>copy()

```cublasStatus_t cublasScopy(cublasHandle_t handle, int n,
const float           *x, int incx,
float                 *y, int incy)
cublasStatus_t cublasDcopy(cublasHandle_t handle, int n,
const double          *x, int incx,
double                *y, int incy)
cublasStatus_t cublasCcopy(cublasHandle_t handle, int n,
const cuComplex       *x, int incx,
cuComplex             *y, int incy)
cublasStatus_t cublasZcopy(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
cuDoubleComplex       *y, int incy)```

This function copies the vector x into the vector y. Hence, the performed operation is $y\left[j\right]=x\left[k\right]$ for $i=1,\dots ,n$ , and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x and y.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

output

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.6. cublas<t>dot()

```cublasStatus_t cublasSdot (cublasHandle_t handle, int n,
const float           *x, int incx,
const float           *y, int incy,
float           *result)
cublasStatus_t cublasDdot (cublasHandle_t handle, int n,
const double          *x, int incx,
const double          *y, int incy,
double          *result)
cublasStatus_t cublasCdotu(cublasHandle_t handle, int n,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *result)
cublasStatus_t cublasCdotc(cublasHandle_t handle, int n,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *result)
cublasStatus_t cublasZdotu(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *result)
cublasStatus_t cublasZdotc(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex       *result)```

This function computes the dot product of vectors x and y. Hence, the result is ${\sum }_{i=1}^{n}\left(x\left[k\right]×y\left[j\right]\right)$ where and . Notice that in the first equation the conjugate of the element of vector should be used if the function name ends in character ‘c’ and that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vectors x and y.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

result

host or device

output

the resulting dot product, which is 0.0 if n<=0.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ALLOC_FAILED

the reduction buffer could not be allocated

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.7. cublas<t>nrm2()

```cublasStatus_t  cublasSnrm2(cublasHandle_t handle, int n,
const float           *x, int incx, float  *result)
cublasStatus_t  cublasDnrm2(cublasHandle_t handle, int n,
const double          *x, int incx, double *result)
cublasStatus_t cublasScnrm2(cublasHandle_t handle, int n,
const cuComplex       *x, int incx, float  *result)
cublasStatus_t cublasDznrm2(cublasHandle_t handle, int n,
const cuDoubleComplex *x, int incx, double *result)```

This function computes the Euclidean norm of the vector x. The code uses a multiphase model of accumulation to avoid intermediate underflow and overflow, with the result being equivalent to $\sqrt{{\sum }_{i=1}^{n}\left(x\left[j\right]×x\left[j\right]\right)}$ where in exact arithmetic. Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

result

host or device

output

the resulting norm, which is 0.0 if n,incx<=0.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ALLOC_FAILED

the reduction buffer could not be allocated

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

snrm2, snrm2, dnrm2, dnrm2, scnrm2, scnrm2, dznrm2

### 5.8. cublas<t>rot()

```cublasStatus_t  cublasSrot(cublasHandle_t handle, int n,
float           *x, int incx,
float           *y, int incy,
const float  *c, const float           *s)
cublasStatus_t  cublasDrot(cublasHandle_t handle, int n,
double          *x, int incx,
double          *y, int incy,
const double *c, const double          *s)
cublasStatus_t  cublasCrot(cublasHandle_t handle, int n,
cuComplex       *x, int incx,
cuComplex       *y, int incy,
const float  *c, const cuComplex       *s)
cublasStatus_t cublasCsrot(cublasHandle_t handle, int n,
cuComplex       *x, int incx,
cuComplex       *y, int incy,
const float  *c, const float           *s)
cublasStatus_t  cublasZrot(cublasHandle_t handle, int n,
cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy,
const double *c, const cuDoubleComplex *s)
cublasStatus_t cublasZdrot(cublasHandle_t handle, int n,
cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy,
const double *c, const double          *s)```

This function applies Givens rotation matrix

$G=\left(\begin{array}{cc}\hfill c\hfill & \hfill s\hfill \\ \hfill -s\hfill & \hfill c\hfill \end{array}\right)$

to vectors x and y.

Hence, the result is $x\left[k\right]=c×x\left[k\right]+s×y\left[j\right]$ and $y\left[j\right]=-s×x\left[k\right]+c×y\left[j\right]$ where and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vectors x and y.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

c

host or device

input

cosine element of the rotation matrix.

s

host or device

input

sine element of the rotation matrix.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.9. cublas<t>rotg()

```cublasStatus_t cublasSrotg(cublasHandle_t handle,
float           *a, float           *b,
float  *c, float           *s)
cublasStatus_t cublasDrotg(cublasHandle_t handle,
double          *a, double          *b,
double *c, double          *s)
cublasStatus_t cublasCrotg(cublasHandle_t handle,
cuComplex       *a, cuComplex       *b,
float  *c, cuComplex       *s)
cublasStatus_t cublasZrotg(cublasHandle_t handle,
cuDoubleComplex *a, cuDoubleComplex *b,
double *c, cuDoubleComplex *s)```

This function constructs the Givens rotation matrix

$G=\left(\begin{array}{cc}\hfill c\hfill & \hfill s\hfill \\ \hfill -s\hfill & \hfill c\hfill \end{array}\right)$

that zeros out the second entry of a $2×1$ vector ${\left(a,b\right)}^{T}$ .

Then, for real numbers we can write

$\left(\begin{array}{cc}\hfill c\hfill & \hfill s\hfill \\ \hfill -s\hfill & \hfill c\hfill \end{array}\right)\left(\begin{array}{c}\hfill a\hfill \\ \hfill b\hfill \end{array}\right)=\left(\begin{array}{c}\hfill r\hfill \\ \hfill 0\hfill \end{array}\right)$

where ${c}^{2}+{s}^{2}=1$ and $r={a}^{2}+{b}^{2}$ . The parameters $a$ and $b$ are overwritten with $r$ and $z$ , respectively. The value of $z$ is such that $c$ and $s$ may be recovered using the following rules:

For complex numbers we can write

$\left(\begin{array}{cc}\hfill c\hfill & \hfill s\hfill \\ \hfill -\stackrel{̄}{s}\hfill & \hfill c\hfill \end{array}\right)\left(\begin{array}{c}\hfill a\hfill \\ \hfill b\hfill \end{array}\right)=\left(\begin{array}{c}\hfill r\hfill \\ \hfill 0\hfill \end{array}\right)$

where ${c}^{2}+\left(\stackrel{̄}{s}×s\right)=1$ and $r=\frac{a}{|a|}×\parallel {\left(a,b\right)}^{T}{\parallel }_{2}$ with $\parallel {\left(a,b\right)}^{T}{\parallel }_{2}=\sqrt{|a{|}^{2}+|b{|}^{2}}$ for $a\ne 0$ and $r=b$ for $a=0$ . Finally, the parameter $a$ is overwritten with $r$ on exit.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

a

host or device

in/out

<type> scalar that is overwritten with $r$ .

b

host or device

in/out

<type> scalar that is overwritten with $z$ .

c

host or device

output

cosine element of the rotation matrix.

s

host or device

output

sine element of the rotation matrix.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.10. cublas<t>rotm()

```cublasStatus_t cublasSrotm(cublasHandle_t handle, int n, float  *x, int incx,
float  *y, int incy, const float*  param)
cublasStatus_t cublasDrotm(cublasHandle_t handle, int n, double *x, int incx,
double *y, int incy, const double* param)```

This function applies the modified Givens transformation

$H=\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

to vectors x and y.

Hence, the result is $x\left[k\right]={h}_{11}×x\left[k\right]+{h}_{12}×y\left[j\right]$ and $y\left[j\right]={h}_{21}×x\left[k\right]+{h}_{22}×y\left[j\right]$ where and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

The elements , , and of matrix $H$ are stored in param[1], param[2], param[3] and param[4], respectively. The flag=param[0] defines the following predefined values for the matrix $H$ entries

flag=-1.0 flag= 0.0 flag= 1.0 flag=-2.0

$\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill 1.0\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill 1.0\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill 1.0\hfill \\ \hfill -1.0\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill 1.0\hfill & \hfill 0.0\hfill \\ \hfill 0.0\hfill & \hfill 1.0\hfill \end{array}\right)$

Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vectors x and y.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

param

host or device

input

<type> vector of 5 elements, where param[0] and param[1-4] contain the flag and matrix $H$ .

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.11. cublas<t>rotmg()

```cublasStatus_t cublasSrotmg(cublasHandle_t handle, float  *d1, float  *d2,
float  *x1, const float  *y1, float  *param)
cublasStatus_t cublasDrotmg(cublasHandle_t handle, double *d1, double *d2,
double *x1, const double *y1, double *param)```

This function constructs the modified Givens transformation

$H=\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

that zeros out the second entry of a $2×1$ vector ${\left(\sqrt{d1}*x1,\sqrt{d2}*y1\right)}^{T}$ .

The flag=param[0] defines the following predefined values for the matrix $H$ entries

flag=-1.0 flag= 0.0 flag= 1.0 flag=-2.0

$\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill 1.0\hfill & \hfill {h}_{12}\hfill \\ \hfill {h}_{21}\hfill & \hfill 1.0\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill {h}_{11}\hfill & \hfill 1.0\hfill \\ \hfill -1.0\hfill & \hfill {h}_{22}\hfill \end{array}\right)$

$\left(\begin{array}{cc}\hfill 1.0\hfill & \hfill 0.0\hfill \\ \hfill 0.0\hfill & \hfill 1.0\hfill \end{array}\right)$

Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

d1

host or device

in/out

<type> scalar that is overwritten on exit.

d2

host or device

in/out

<type> scalar that is overwritten on exit.

x1

host or device

in/out

<type> scalar that is overwritten on exit.

y1

host or device

input

<type> scalar.

param

host or device

output

<type> vector of 5 elements, where param[0] and param[1-4] contain the flag and matrix $H$ .

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.12. cublas<t>scal()

```cublasStatus_t  cublasSscal(cublasHandle_t handle, int n,
const float           *alpha,
float           *x, int incx)
cublasStatus_t  cublasDscal(cublasHandle_t handle, int n,
const double          *alpha,
double          *x, int incx)
cublasStatus_t  cublasCscal(cublasHandle_t handle, int n,
const cuComplex       *alpha,
cuComplex       *x, int incx)
cublasStatus_t cublasCsscal(cublasHandle_t handle, int n,
const float           *alpha,
cuComplex       *x, int incx)
cublasStatus_t  cublasZscal(cublasHandle_t handle, int n,
const cuDoubleComplex *alpha,
cuDoubleComplex *x, int incx)
cublasStatus_t cublasZdscal(cublasHandle_t handle, int n,
const double          *alpha,
cuDoubleComplex *x, int incx)```

This function scales the vector x by the scalar $\alpha$ and overwrites it with the result. Hence, the performed operation is $x\left[j\right]=\alpha ×x\left[j\right]$ for $i=1,\dots ,n$ and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

alpha

host or device

input

<type> scalar used for multiplication.

n

input

number of elements in the vector x.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 5.13. cublas<t>swap()

```cublasStatus_t cublasSswap(cublasHandle_t handle, int n, float           *x,
int incx, float           *y, int incy)
cublasStatus_t cublasDswap(cublasHandle_t handle, int n, double          *x,
int incx, double          *y, int incy)
cublasStatus_t cublasCswap(cublasHandle_t handle, int n, cuComplex       *x,
int incx, cuComplex       *y, int incy)
cublasStatus_t cublasZswap(cublasHandle_t handle, int n, cuDoubleComplex *x,
int incx, cuDoubleComplex *y, int incy)```

This function interchanges the elements of vector x and y. Hence, the performed operation is $y\left[j\right]⇔x\left[k\right]$ for $i=1,\dots ,n$ , and . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

n

input

number of elements in the vector x and y.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

## 6. CUBLAS Level-2 Function Reference

In this chapter we describe the Level-2 Basic Linear Algebra Subprograms (BLAS2) functions that perform matrix-vector operations.

### 6.1. cublas<t>gbmv()

```cublasStatus_t cublasSgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const float           *alpha,
const float           *A, int lda,
const float           *x, int incx,
const float           *beta,
float           *y, int incy)
cublasStatus_t cublasDgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const double          *alpha,
const double          *A, int lda,
const double          *x, int incx,
const double          *beta,
double          *y, int incy)
cublasStatus_t cublasCgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const cuComplex       *alpha,
const cuComplex       *A, int lda,
const cuComplex       *x, int incx,
const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZgbmv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n, int kl, int ku,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the banded matrix-vector multiplication

where $A$ is a banded matrix with $kl$ subdiagonals and $ku$ superdiagonals, $x$ and $y$ are vectors, and $\alpha$ and $\beta$ are scalars. Also, for matrix $A$

The banded matrix $A$ is stored column by column, with the main diagonal stored in row $ku+1$ (starting in first position), the first superdiagonal stored in row $ku$ (starting in second position), the first subdiagonal stored in row $ku+2$ (starting in first position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(ku+1+i-j,j) for $j=1,\dots ,n$ and $i\in \left[max\left(1,j-ku\right),min\left(m,j+kl\right)\right]$ . Also, the elements in the array $A$ that do not conceptually correspond to the elements in the banded matrix (the top left $ku×ku$ and bottom right $kl×kl$ triangles) are not referenced.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

trans

input

operation op(A) that is non- or (conj.) transpose.

m

input

number of rows of matrix A.

n

input

number of columns of matrix A.

kl

input

number of subdiagonals of matrix A.

ku

input

number of superdiagonals of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimension lda x n with lda>=kl+ku+1.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements if transa == CUBLAS_OP_N and m elements otherwise.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta == 0 then y does not have to be a valid input.

y

device

in/out

<type> vector with m elements if transa == CUBLAS_OP_N and n elements otherwise.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters or

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.2. cublas<t>gemv()

```cublasStatus_t cublasSgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const float           *alpha,
const float           *A, int lda,
const float           *x, int incx,
const float           *beta,
float           *y, int incy)
cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const double          *alpha,
const double          *A, int lda,
const double          *x, int incx,
const double          *beta,
double          *y, int incy)
cublasStatus_t cublasCgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuComplex       *alpha,
const cuComplex       *A, int lda,
const cuComplex       *x, int incx,
const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZgemv(cublasHandle_t handle, cublasOperation_t trans,
int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the matrix-vector multiplication

$\mathbf{\text{y}}=\alpha \text{op}\left(A\right)\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $m×n$ matrix stored in column-major format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars. Also, for matrix $A$

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

trans

input

operation op(A) that is non- or (conj.) transpose.

m

input

number of rows of matrix A.

n

input

number of columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimension lda x n with lda >= max(1,n) if transa==CUBLAS_OP_N and lda x m with lda >= max(1,n) otherwise.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements if transa==CUBLAS_OP_N and m elements otherwise.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

in/out

<type> vector with m elements if transa==CUBLAS_OP_N and n elements otherwise.

incy

input

stride between consecutive elements of .y

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters m,n<9 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.3. cublas<t>ger()

```cublasStatus_t  cublasSger(cublasHandle_t handle, int m, int n,
const float           *alpha,
const float           *x, int incx,
const float           *y, int incy,
float           *A, int lda)
cublasStatus_t  cublasDger(cublasHandle_t handle, int m, int n,
const double          *alpha,
const double          *x, int incx,
const double          *y, int incy,
double          *A, int lda)
cublasStatus_t cublasCgeru(cublasHandle_t handle, int m, int n,
const cuComplex       *alpha,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *A, int lda)
cublasStatus_t cublasCgerc(cublasHandle_t handle, int m, int n,
const cuComplex       *alpha,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *A, int lda)
cublasStatus_t cublasZgeru(cublasHandle_t handle, int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
cublasStatus_t cublasZgerc(cublasHandle_t handle, int m, int n,
const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)```

This function performs the rank-1 update

where $A$ is a $m×n$ matrix stored in column-major format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ is a scalar.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

m

input

number of rows of matrix A.

n

input

number of columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with m elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

A

device

in/out

<type> array of dimension lda x n with lda >= max(1,m).

lda

input

leading dimension of two-dimensional array used to store matrix A.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters m,n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.4. cublas<t>sbmv()

```cublasStatus_t cublasSsbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const float  *alpha,
const float  *A, int lda,
const float  *x, int incx,
const float  *beta, float *y, int incy)
cublasStatus_t cublasDsbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const double *alpha,
const double *A, int lda,
const double *x, int incx,
const double *beta, double *y, int incy)```

This function performs the symmetric banded matrix-vector multiplication

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ symmetric banded matrix with $k$ subdiagonals and superdiagonals, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars.

If uplo == CUBLAS_FILL_MODE_LOWER then the symmetric banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row 1, the first subdiagonal in row 2 (starting at first position), the second subdiagonal in row 3 (starting at first position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+i-j,j) for $j=1,\dots ,n$ and $i\in \left[j,min\left(m,j+k\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the bottom right $k×k$ triangle) are not referenced.

If uplo == CUBLAS_FILL_MODE_UPPER then the symmetric banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row k+1, the first superdiagonal in row k (starting at second position), the second superdiagonal in row k-1 (starting at third position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+k+i-j,j) for $j=1,\dots ,n$ and $i\in \left[max\left(1,j-k\right),j\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the top left $k×k$ triangle) are not referenced.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

k

input

number of sub- and super-diagonals of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimension lda x n with \lda >= k+1.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n,k<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.5. cublas<t>spmv()

```cublasStatus_t cublasSspmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float  *alpha, const float  *AP,
const float  *x, int incx, const float  *beta,
float  *y, int incy)
cublasStatus_t cublasDspmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha, const double *AP,
const double *x, int incx, const double *beta,
double *y, int incy)```

This function performs the symmetric packed matrix-vector multiplication

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ symmetric matrix stored in packed format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars.

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix $A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix $A$ .

alpha

host or device

input

<type> scalar used for multiplication.

AP

device

input

<type> array with $A$ stored in packed format.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.6. cublas<t>spr()

```cublasStatus_t cublasSspr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float  *alpha,
const float  *x, int incx, float  *AP)
cublasStatus_t cublasDspr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const double *x, int incx, double *AP)```

This function performs the packed symmetric rank-1 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{x}}}^{T}+A$

where $A$ is a $n×n$ symmetric matrix stored in packed format, $\mathbf{x}$ is a vector, and $\alpha$ is a scalar.

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix $A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix $A$ .

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

AP

device

in/out

<type> array with $A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.7. cublas<t>spr2()

```cublasStatus_t cublasSspr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float  *alpha,
const float  *x, int incx,
const float  *y, int incy, float  *AP)
cublasStatus_t cublasDspr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const double *x, int incx,
const double *y, int incy, double *AP)```

This function performs the packed symmetric rank-2 update

$A=\alpha \left(\mathbf{\text{x}}{\mathbf{\text{y}}}^{T}+\mathbf{\text{y}}{\mathbf{\text{x}}}^{T}\right)+A$

where $A$ is a $n×n$ symmetric matrix stored in packed format, $\mathbf{x}$ is a vector, and $\alpha$ is a scalar.

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the symmetric matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix $A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix $A$ .

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

AP

device

in/out

<type> array with $A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.8. cublas<t>symv()

```cublasStatus_t cublasSsymv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float           *alpha,
const float           *A, int lda,
const float           *x, int incx, const float           *beta,
float           *y, int incy)
cublasStatus_t cublasDsymv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double          *alpha,
const double          *A, int lda,
const double          *x, int incx, const double          *beta,
double          *y, int incy)
cublasStatus_t cublasCsymv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha, /* host or device pointer */
const cuComplex       *A, int lda,
const cuComplex       *x, int incx, const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZsymv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx, const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the symmetric matrix-vector multiplication.

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ symmetric matrix stored in lower or upper mode, $\mathbf{x}$ and $\mathbf{x}$ are vectors, and $\alpha$ and $\beta$ are scalars.

This function has an alternate faster implementation using atomics that can be enabled with cublasSetAtomicsMode().

Please see the section on the function cublasSetAtomicsMode() for more details about the usage of atomics.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimension lda x n with lda>=max(1,n).

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.9. cublas<t>syr()

```cublasStatus_t cublasSsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float           *alpha,
const float           *x, int incx, float           *A, int lda)
cublasStatus_t cublasDsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double          *alpha,
const double          *x, int incx, double          *A, int lda)
cublasStatus_t cublasCsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha,
const cuComplex       *x, int incx, cuComplex       *A, int lda)
cublasStatus_t cublasZsyr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx, cuDoubleComplex *A, int lda)```

This function performs the symmetric rank-1 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{x}}}^{T}+A$

where $A$ is a $n×n$ symmetric matrix stored in column-major format, $\mathbf{x}$ is a vector, and $\alpha$ is a scalar.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

A

device

in/out

<type> array of dimensions lda x n, with lda>=max(1,n).

lda

input

leading dimension of two-dimensional array used to store matrix A.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.10. cublas<t>syr2()

```cublasStatus_t cublasSsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const float           *alpha, const float           *x, int incx,
const float           *y, int incy, float           *A, int lda
cublasStatus_t cublasDsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const double          *alpha, const double          *x, int incx,
const double          *y, int incy, double          *A, int lda
cublasStatus_t cublasCsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const cuComplex       *alpha, const cuComplex       *x, int incx,
const cuComplex       *y, int incy, cuComplex       *A, int lda
cublasStatus_t cublasZsyr2(cublasHandle_t handle, cublasFillMode_t uplo, int n,
const cuDoubleComplex *alpha, const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy, cuDoubleComplex *A, int lda```

This function performs the symmetric rank-2 update

$A=\alpha \left(\mathbf{\text{x}}{\mathbf{\text{y}}}^{T}+\mathbf{\text{y}}{\mathbf{\text{x}}}^{T}\right)+A$

where $A$ is a $n×n$ symmetric matrix stored in column-major format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ is a scalar.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

A

device

in/out

<type> array of dimensions lda x n, with lda>=max(1,n).

lda

input

leading dimension of two-dimensional array used to store matrix A.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

ssyr2, dsyr2

### 6.11. cublas<t>tbmv()

```cublasStatus_t cublasStbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const float           *A, int lda,
float           *x, int incx)
cublasStatus_t cublasDtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const double          *A, int lda,
double          *x, int incx)
cublasStatus_t cublasCtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuComplex       *A, int lda,
cuComplex       *x, int incx)
cublasStatus_t cublasZtbmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)```

This function performs the triangular banded matrix-vector multiplication

$\mathbf{\text{x}}=\text{op}\left(A\right)\mathbf{\text{x}}$

where $A$ is a triangular banded matrix, and $\mathbf{x}$ is a vector. Also, for matrix $A$

If uplo == CUBLAS_FILL_MODE_LOWER then the triangular banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row 1, the first subdiagonal in row 2 (starting at first position), the second subdiagonal in row 3 (starting at first position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+i-j,j) for $j=1,\dots ,n$ and $i\in \left[j,min\left(m,j+k\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the bottom right $k×k$ triangle) are not referenced.

If uplo == CUBLAS_FILL_MODE_UPPER then the triangular banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row k+1, the first superdiagonal in row k (starting at second position), the second superdiagonal in row k-1 (starting at third position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+k+i-j,j) for $j=1,\dots ,n$ and $i\in \left[max\left(1,j-k,j\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the top left $k×k$ triangle) are not referenced.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

k

input

number of sub- and super-diagonals of matrix .

A

device

input

<type> array of dimension lda x n, with lda>=k+1.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n,k<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_ALLOC_FAILED

the allocation of internal scratch memory failed

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.12. cublas<t>tbsv()

```cublasStatus_t cublasStbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const float           *A, int lda,
float           *x, int incx)
cublasStatus_t cublasDtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const double          *A, int lda,
double          *x, int incx)
cublasStatus_t cublasCtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuComplex       *A, int lda,
cuComplex       *x, int incx)
cublasStatus_t cublasZtbsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)```

This function solves the triangular banded linear system with a single right-hand-side

$\text{op}\left(A\right)\mathbf{\text{x}}=\mathbf{\text{b}}$

where $A$ is a triangular banded matrix, and $\mathbf{x}$ and $\mathbf{b}$ are vectors. Also, for matrix $A$

The solution $\mathbf{x}$ overwrites the right-hand-sides $\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

If uplo == CUBLAS_FILL_MODE_LOWER then the triangular banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row 1, the first subdiagonal in row 2 (starting at first position), the second subdiagonal in row 3 (starting at first position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+i-j,j) for $j=1,\dots ,n$ and $i\in \left[j,min\left(m,j+k\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the bottom right $k×k$ triangle) are not referenced.

If uplo == CUBLAS_FILL_MODE_UPPER then the triangular banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row k+1, the first superdiagonal in row k (starting at second position), the second superdiagonal in row k-1 (starting at third position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+k+i-j,j) for $j=1,\dots ,n$ and $i\in \left[max\left(1,j-k,j\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the top left $k×k$ triangle) are not referenced.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

k

input

number of sub- and super-diagonals of matrix A.

A

device

input

<type> array of dimension lda x n, with lda >= k+1.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n,k<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.13. cublas<t>tpmv()

```cublasStatus_t cublasStpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float           *AP,
float           *x, int incx)
cublasStatus_t cublasDtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double          *AP,
double          *x, int incx)
cublasStatus_t cublasCtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex       *AP,
cuComplex       *x, int incx)
cublasStatus_t cublasZtpmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *AP,
cuDoubleComplex *x, int incx)```

This function performs the triangular packed matrix-vector multiplication

$\mathbf{\text{x}}=\text{op}\left(A\right)\mathbf{\text{x}}$

where $A$ is a triangular matrix stored in packed format, and $\mathbf{x}$ is a vector. Also, for matrix $A$

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the triangular matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the triangular matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $A\left(i,j\right)$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

AP

device

input

<type> array with $A$ stored in packed format.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters \$n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_ALLOC_FAILED

the allocation of internal scratch memory failed

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.14. cublas<t>tpsv()

```cublasStatus_t cublasStpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float           *AP,
float           *x, int incx)
cublasStatus_t cublasDtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double          *AP,
double          *x, int incx)
cublasStatus_t cublasCtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex       *AP,
cuComplex       *x, int incx)
cublasStatus_t cublasZtpsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *AP,
cuDoubleComplex *x, int incx)```

This function solves the packed triangular linear system with a single right-hand-side

$\text{op}\left(A\right)\mathbf{\text{x}}=\mathbf{\text{b}}$

where $A$ is a triangular matrix stored in packed format, and $\mathbf{x}$ and $\mathbf{b}$ are vectors. Also, for matrix $A$

The solution $\mathbf{x}$ overwrites the right-hand-sides $\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the triangular matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the triangular matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

AP

device

input

<type> array with A stored in packed format.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.15. cublas<t>trmv()

```cublasStatus_t cublasStrmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float           *A, int lda,
float           *x, int incx)
cublasStatus_t cublasDtrmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double          *A, int lda,
double          *x, int incx)
cublasStatus_t cublasCtrmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex       *A, int lda,
cuComplex       *x, int incx)
cublasStatus_t cublasZtrmv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)```

This function performs the triangular matrix-vector multiplication

$\mathbf{\text{x}}=\text{op}\left(A\right)\mathbf{\text{x}}$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, and $\mathbf{x}$ is a vector. Also, for matrix $A$

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

A

device

input

<type> array of dimensions lda x n , with lda>=max(1,n).

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_ALLOC_FAILED

the allocation of internal scratch memory failed

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.16. cublas<t>trsv()

```cublasStatus_t cublasStrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const float           *A, int lda,
float           *x, int incx)
cublasStatus_t cublasDtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const double          *A, int lda,
double          *x, int incx)
cublasStatus_t cublasCtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuComplex       *A, int lda,
cuComplex       *x, int incx)
cublasStatus_t cublasZtrsv(cublasHandle_t handle, cublasFillMode_t uplo,
cublasOperation_t trans, cublasDiagType_t diag,
int n, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)```

This function solves the triangular linear system with a single right-hand-side

$\text{op}\left(A\right)\mathbf{\text{x}}=\mathbf{\text{b}}$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, and $\mathbf{x}$ and $\mathbf{b}$ are vectors. Also, for matrix $A$

The solution $\mathbf{x}$ overwrites the right-hand-sides $\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

input

operation op(A) that is non- or (conj.) transpose.

diag

input

indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

n

input

number of rows and columns of matrix A.

A

device

input

<type> array of dimension lda x n, with lda>=max(1,n).

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

in/out

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.17. cublas<t>hemv()

```cublasStatus_t cublasChemv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha,
const cuComplex       *A, int lda,
const cuComplex       *x, int incx,
const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZhemv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the Hermitian matrix-vector multiplication

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ Hermitian matrix stored in lower or upper mode, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars.

This function has an alternate faster implementation using atomics that can be enabled with

Please see the section on the for more details about the usage of atomics

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimension lda x n, with lda>=max(1,n). The imaginary parts of the diagonal elements are assumed to be zero.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.18. cublas<t>hbmv()

```cublasStatus_t cublasChbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const cuComplex       *alpha,
const cuComplex       *A, int lda,
const cuComplex       *x, int incx,
const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZhbmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, int k, const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the Hermitian banded matrix-vector multiplication

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ Hermitian banded matrix with $k$ subdiagonals and superdiagonals, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars.

If uplo == CUBLAS_FILL_MODE_LOWER then the Hermitian banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row 1, the first subdiagonal in row 2 (starting at first position), the second subdiagonal in row 3 (starting at first position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+i-j,j) for $j=1,\dots ,n$ and $i\in \left[j,min\left(m,j+k\right)\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the bottom right $k×k$ triangle) are not referenced.

If uplo == CUBLAS_FILL_MODE_UPPER then the Hermitian banded matrix $A$ is stored column by column, with the main diagonal of the matrix stored in row k+1, the first superdiagonal in row k (starting at second position), the second superdiagonal in row k-1 (starting at third position), etc. So that in general, the element $A\left(i,j\right)$ is stored in the memory location A(1+k+i-j,j) for $j=1,\dots ,n$ and $i\in \left[max\left(1,j-k\right),j\right]$ . Also, the elements in the array A that do not conceptually correspond to the elements in the banded matrix (the top left $k×k$ triangle) are not referenced.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

k

input

number of sub- and super-diagonals of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimensions lda x n, with lda>=k+1. The imaginary parts of the diagonal elements are assumed to be zero.

lda

input

leading dimension of two-dimensional array used to store matrix A.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then does not have to be a valid input.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n,k<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.19. cublas<t>hpmv()

```cublasStatus_t cublasChpmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha,
const cuComplex       *AP,
const cuComplex       *x, int incx,
const cuComplex       *beta,
cuComplex       *y, int incy)
cublasStatus_t cublasZhpmv(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *AP,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *beta,
cuDoubleComplex *y, int incy)```

This function performs the Hermitian packed matrix-vector multiplication

$\mathbf{\text{y}}=\alpha A\mathbf{\text{x}}+\beta \mathbf{\text{y}}$

where $A$ is a $n×n$ Hermitian matrix stored in packed format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ and $\beta$ are scalars.

If uplo == CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

AP

device

input

<type> array with A stored in packed format. The imaginary parts of the diagonal elements are assumed to be zero.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

beta

host or device

input

<type> scalar used for multiplication, if beta==0 then y does not have to be a valid input.

y

device

in/out

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.20. cublas<t>her()

```cublasStatus_t cublasCher(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float  *alpha,
const cuComplex       *x, int incx,
cuComplex       *A, int lda)
cublasStatus_t cublasZher(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *A, int lda)```

This function performs the Hermitian rank-1 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{x}}}^{H}+A$

where $A$ is a $n×n$ Hermitian matrix stored in column-major format, $\mathbf{x}$ is a vector, and $\alpha$ is a scalar.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

A

device

in/out

<type> array of dimensions lda x n, with lda>=max(1,n). The imaginary parts of the diagonal elements are assumed and set to zero.

lda

input

leading dimension of two-dimensional array used to store matrix A.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.21. cublas<t>her2()

```cublasStatus_t cublasCher2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *A, int lda)
cublasStatus_t cublasZher2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)```

This function performs the Hermitian rank-2 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{y}}}^{H}+\stackrel{ˉ}{\alpha }\mathbf{\text{y}}{\mathbf{\text{x}}}^{H}+A$

where $A$ is a $n×n$ Hermitian matrix stored in column-major format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ is a scalar.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

A

device

in/out

<type> array of dimension lda x n with lda>=max(1,n). The imaginary parts of the diagonal elements are assumed and set to zero.

lda

input

leading dimension of two-dimensional array used to store matrix A.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

cher2, zher2

### 6.22. cublas<t>hpr()

```cublasStatus_t cublasChpr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const float *alpha,
const cuComplex       *x, int incx,
cuComplex       *AP)
cublasStatus_t cublasZhpr(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const double *alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *AP)```

This function performs the packed Hermitian rank-1 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{x}}}^{H}+A$

where $A$ is a $n×n$ Hermitian matrix stored in packed format, $\mathbf{x}$ is a vector, and $\alpha$ is a scalar.

If uplo == CULBAS_FILL_MODE_LOWER then the elements in the lower triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CULBAS_FILL_MODE_UPPER then the elements in the upper triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

AP

device

in/out

<type> array with A stored in packed format. The imaginary parts of the diagonal elements are assumed and set to zero.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

### 6.23. cublas<t>hpr2()

```cublasStatus_t cublasChpr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuComplex       *alpha,
const cuComplex       *x, int incx,
const cuComplex       *y, int incy,
cuComplex       *AP)
cublasStatus_t cublasZhpr2(cublasHandle_t handle, cublasFillMode_t uplo,
int n, const cuDoubleComplex *alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *AP)```

This function performs the packed Hermitian rank-2 update

$A=\alpha \mathbf{\text{x}}{\mathbf{\text{y}}}^{H}+\stackrel{ˉ}{\alpha }\mathbf{\text{y}}{\mathbf{\text{x}}}^{H}+A$

where $A$ is a $n×n$ Hermitian matrix stored in packed format, $\mathbf{x}$ and $\mathbf{y}$ are vectors, and $\alpha$ is a scalar.

If uplo == CULBAS_FILL_MODE_LOWER then the elements in the lower triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+((2*n-j+1)*j)/2] for $j=1,\dots ,n$ and $i\ge j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

If uplo == CULBAS_FILL_MODE_UPPER then the elements in the upper triangular part of the Hermitian matrix $A$ are packed together column by column without gaps, so that the element $A\left(i,j\right)$ is stored in the memory location AP[i+(j*(j+1))/2] for $j=1,\dots ,n$ and $i\le j$ . Consequently, the packed format requires only $\frac{n\left(n+1\right)}{2}$ elements for storage.

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

uplo

input

indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.

n

input

number of rows and columns of matrix A.

alpha

host or device

input

<type> scalar used for multiplication.

x

device

input

<type> vector with n elements.

incx

input

stride between consecutive elements of x.

y

device

input

<type> vector with n elements.

incy

input

stride between consecutive elements of y.

AP

device

in/out

<type> array with A stored in packed format. The imaginary parts of the diagonal elements are assumed and set to zero.

The possible error values returned by this function and their meanings are listed below.

Error Value Meaning

CUBLAS_STATUS_SUCCESS

the operation completed successfully

CUBLAS_STATUS_NOT_INITIALIZED

the library was not initialized

CUBLAS_STATUS_INVALID_VALUE

the parameters n<0 or incx,incy=0

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision

CUBLAS_STATUS_EXECUTION_FAILED

the function failed to launch on the GPU

chpr2, zhpr2

## 7. CUBLAS Level-3 Function Reference

In this chapter we describe the Level-3 Basic Linear Algebra Subprograms (BLAS3) functions that perform matrix-matrix operations.

### 7.1. cublas<t>gemm()

```cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float           *alpha,
const float           *A, int lda,
const float           *B, int ldb,
const float           *beta,
float           *C, int ldc)
cublasStatus_t cublasDgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const double          *alpha,
const double          *A, int lda,
const double          *B, int ldb,
const double          *beta,
double          *C, int ldc)
cublasStatus_t cublasCgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuComplex       *alpha,
const cuComplex       *A, int lda,
const cuComplex       *B, int ldb,
const cuComplex       *beta,
cuComplex       *C, int ldc)
cublasStatus_t cublasZgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const cuDoubleComplex *alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
const cuDoubleComplex *beta,
cuDoubleComplex *C, int ldc)```

This function performs the matrix-matrix multiplication

$C=\alpha \text{op}\left(A\right)\text{op}\left(B\right)+\beta C$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $\text{op}\left(A\right)$ $m×k$ , $\text{op}\left(B\right)$ $k×n$ and $C$ $m×n$ , respectively. Also, for matrix $A$

and $\text{op}\left(B\right)$ is defined similarly for matrix $B$ .

Param. Memory In/out Meaning

handle

input

handle to the CUBLAS library context.

transa

input

operation op(A) that is non- or (conj.) transpose.

transb

input

operation op(B) that is non- or (conj.) transpose.

m

input

number of rows of matrix op(A) and C.

n

input

number of columns of matrix op(B) and C.

k

input

number of columns of op(A) and rows of op(B).

alpha

host or device

input

<type> scalar used for multiplication.

A

device

input

<type> array of dimensions lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N and lda x m with lda>=max(1,k) otherwise.

lda

input

leading dimension of two-dimensional array used to store the matrix A.