CUDA Driver API :: CUDA Toolkit Documentation

6.29. Tensor Map Object Managment

This section describes the tensor map object management functions of the low-level CUDA driver application programming interface. The tensor core API is only supported on devices of compute capability 9.0 or higher.

Functions

CUresult cuTensorMapEncodeIm2col ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const int* pixelBoxLowerCorner, const int* pixelBoxUpperCorner, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill ): Create a tensor map descriptor object representing im2col memory region.
CUresult cuTensorMapEncodeTiled ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const cuuint32_t* boxDim, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill ): Create a tensor map descriptor object representing tiled memory region.
CUresult cuTensorMapReplaceAddress ( CUtensorMap* tensorMap, void* globalAddress ): Modify an existing tensor map descriptor with an updated global address.

Functions

CUresult cuTensorMapEncodeIm2col ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const int* pixelBoxLowerCorner, const int* pixelBoxUpperCorner, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )

Create a tensor map descriptor object representing im2col memory region.

Parameters

tensorMap: - Tensor map object to create
tensorDataType: - Tensor data type
tensorRank: - Dimensionality of tensor, needs to be at least of dimension 3
globalAddress: - Starting address of memory region described by tensor
globalDim: - Array containing tensor size (number of elements) along each of the tensorRank dimensions
globalStrides: - Array containing stride size (in bytes) along each of the tensorRank - 1 dimensions
pixelBoxLowerCorner: - Array containing DHW dimentions of lower box corner
pixelBoxUpperCorner: - Array containing DHW dimentions of upper box corner
channelsPerPixel: - Number of channels per pixel
pixelsPerColumn: - Number of pixels per column
elementStrides: - Array containing traversal stride in each of the tensorRank dimensions
interleave: - Type of interleaved layout the tensor addresses
swizzle: - Bank swizzling pattern inside shared memory
l2Promotion: - L2 promotion size
oobFill: - Indicate whether zero or special NaN constant must be used to fill out-of-bound elements

Returns

CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE

Description

Creates a descriptor for Tensor Memory Access (TMA) object specified by the parameters describing a im2col memory layout and returns it in tensorMap.

Tensor map objects are only supported on devices of compute capability 9.0 or higher. Additionally, a tensor map object is an opaque value, and, as such, should only be accessed through CUDA API calls.

The parameters passed are bound to the following requirements:

tensorMap address must be aligned to 64 bytes.

tensorDataType has to be an enum from CUtensorMapDataType which is defined as:

‎    typedef enum CUtensorMapDataType_enum {
              CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,       // 1 byte
              CU_TENSOR_MAP_DATA_TYPE_UINT16,          // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_UINT32,          // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_INT32,           // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_UINT64,          // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_INT64,           // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT16,         // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT32,         // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT64,         // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,        // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,     // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,        // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ     // 4 bytes
          } CUtensorMapDataType;

tensorRank must be one of dimensions 3, 4, or 5.

globalAddress, which specifies the starting address of the memory region described, must be 32 byte aligned when interleave is CU_TENSOR_MAP_INTERLEAVE_32B and 16 byte aligned otherwise.

globalDim array, which specifies tensor size of each of the tensorRank dimensions, must be non-zero and less than or equal to 2^32.

globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the stride must be a multiple of 32 when interleave is CU_TENSOR_MAP_INTERLEAVE_32B. Each following dimension specified includes previous dimension stride:
```
‎    globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0];
          for (i = 1; i < tensorRank - 1; i++)
              globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]);
          assert(globalStrides[i] >= globalDim[i]);
```

pixelBoxLowerCorner array specifies the coordinate offsets {D, H, W} of the bounding box from top/left/front corner. The number of offsets and their precision depends on the tensor dimensionality:
- When tensorRank is 3, one signed offset within range [-32768, 32767] is supported.
- When tensorRank is 4, two signed offsets each within range [-128, 127] are supported.
- When tensorRank is 5, three offsets each within range [-16, 15] are supported.

pixelBoxUpperCorner array specifies the coordinate offsets {D, H, W} of the bounding box from bottom/right/back corner. The number of offsets and their precision depends on the tensor dimensionality:
- When tensorRank is 3, one signed offset within range [-32768, 32767] is supported.
- When tensorRank is 4, two signed offsets each within range [-128, 127] are supported.
- When tensorRank is 5, three offsets each within range [-16, 15] are supported. The bounding box specified by pixelBoxLowerCorner and pixelBoxUpperCorner must have non-zero area.

channelsPerPixel, which specifies the number of elements which must be accessed along C dimension, must be less than or equal to 256.

pixelsPerColumn, which specifies the number of elements that must be accessed along the {N, D, H, W} dimensions, must be less than or equal to 1024.

elementStrides array, which specifies the iteration step along each of the tensorRank dimensions, must be non-zero and less than or equal to 8. Note that when interleave is CU_TENSOR_MAP_INTERLEAVE_NONE, the first element of this array is ignored since TMA doesn’t support the stride for dimension zero. When all elemets of elementStrides array is one, boxDim specifies the number of elements to load. However, if the elementStrides[i] is not equal to one, then TMA loads ceil( boxDim[i] / elementStrides[i]) number of elements along i-th dimension. To load N elements along i-th dimension, boxDim[i] must be set to N * elementStrides[i].

interleave specifies the interleaved layout of type CUtensorMapInterleave, which is defined as:
```
‎    typedef enum CUtensorMapInterleave_enum {
              CU_TENSOR_MAP_INTERLEAVE_NONE = 0,
              CU_TENSOR_MAP_INTERLEAVE_16B,
              CU_TENSOR_MAP_INTERLEAVE_32B
          } CUtensorMapInterleave;
```
TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE and swizzle is not CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as boxDim[0] multiplied by element size derived from tensorDataType) must be less than or equal to the swizzle size.
- CU_TENSOR_MAP_SWIZZLE_32B implies the bounding box inner dimension will be <= 32.
- CU_TENSOR_MAP_SWIZZLE_64B implies the bounding box inner dimension will be <= 64.
- CU_TENSOR_MAP_SWIZZLE_128B implies the bounding box inner dimension will be <= 128.

swizzle, which specifies the shared memory bank swizzling pattern, has to be of type CUtensorMapSwizzle which is defined as:
```
‎    typedef enum CUtensorMapSwizzle_enum {
              CU_TENSOR_MAP_SWIZZLE_NONE = 0,
              CU_TENSOR_MAP_SWIZZLE_32B,
              CU_TENSOR_MAP_SWIZZLE_64B,
              CU_TENSOR_MAP_SWIZZLE_128B
          } CUtensorMapSwizzle;
```
Data is organized in specific order in global memory; however, it may not match the order in which data are accessed by application in the shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shard memory with shuffling across shared memory banks. Note that it’s expected that when interleave is CU_TENSOR_MAP_INTERLEAVE_32B, swizzle should be CU_TENSOR_MAP_SWIZZLE_32B mode. Other interleave modes can have any swizzling patterns.

l2Promotion specifies L2 fetch size which indicates the byte granurality at which L2 requests is filled from DRAM. It must be of type CUtensorMapL2promotion, which is defined as:

‎    typedef enum CUtensorMapL2promotion_enum {
              CU_TENSOR_MAP_L2_PROMOTION_NONE = 0,
              CU_TENSOR_MAP_L2_PROMOTION_L2_64B,
              CU_TENSOR_MAP_L2_PROMOTION_L2_128B,
              CU_TENSOR_MAP_L2_PROMOTION_L2_256B
          } CUtensorMapL2promotion;

oobFill, which indicates whether zero or a special NaN constant should be used to fill out-of-bound elements, must be of type CUtensorMapFloatOOBfill which is defined as:
```
‎    typedef enum CUtensorMapFloatOOBfill_enum {
              CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0,
              CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA
          } CUtensorMapFloatOOBfill;
```
Note that CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA can only be used when tensorDataType represents a floating data type.

CUresult cuTensorMapEncodeTiled ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const cuuint32_t* boxDim, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )

Create a tensor map descriptor object representing tiled memory region.

Parameters

tensorMap: - Tensor map object to create
tensorDataType: - Tensor data type
tensorRank: - Dimensionality of tensor
globalAddress: - Starting address of memory region described by tensor
globalDim: - Array containing tensor size (number of elements) along each of the tensorRank dimensions
globalStrides: - Array containing stride size (in bytes) along each of the tensorRank - 1 dimensions
boxDim: - Array containing traversal box size (number of elments) along each of the tensorRank dimensions. Specifies how many elements to be traversed along each tensor dimension.
elementStrides: - Array containing traversal stride in each of the tensorRank dimensions
interleave: - Type of interleaved layout the tensor addresses
swizzle: - Bank swizzling pattern inside shared memory
l2Promotion: - L2 promotion size
oobFill: - Indicate whether zero or special NaN constant must be used to fill out-of-bound elements

Returns

CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE

Description

Creates a descriptor for Tensor Memory Access (TMA) object specified by the parameters describing a tiled region and returns it in tensorMap.

The parameters passed are bound to the following requirements:

tensorMap address must be aligned to 64 bytes.

tensorDataType has to be an enum from CUtensorMapDataType which is defined as:

‎    typedef enum CUtensorMapDataType_enum {
              CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,       // 1 byte
              CU_TENSOR_MAP_DATA_TYPE_UINT16,          // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_UINT32,          // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_INT32,           // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_UINT64,          // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_INT64,           // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT16,         // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT32,         // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT64,         // 8 bytes
              CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,        // 2 bytes
              CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,     // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,        // 4 bytes
              CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ     // 4 bytes
          } CUtensorMapDataType;

tensorRank must be non-zero and less than or equal to the maximum supported dimensionality of 5. If interleave is not CU_TENSOR_MAP_INTERLEAVE_NONE, then tensorRank must additionally be greater than or equal to 3.

globalAddress, which specifies the starting address of the memory region described, must be 32 byte aligned when interleave is CU_TENSOR_MAP_INTERLEAVE_32B and 16 byte aligned otherwise.

globalDim array, which specifies tensor size of each of the tensorRank dimensions, must be non-zero and less than or equal to 2^32.

globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the stride must be a multiple of 32 when interleave is CU_TENSOR_MAP_INTERLEAVE_32B. Each following dimension specified includes previous dimension stride:
```
‎    globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0];
          for (i = 1; i < tensorRank - 1; i++)
              globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]);
          assert(globalStrides[i] >= globalDim[i]);
```

boxDim array, which specifies number of elements to be traversed along each of the tensorRank dimensions, must be non-zero and less than or equal to 256. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE, { boxDim[0] * elementSizeInBytes( tensorDataType ) } must be a multiple of 16 bytes.

elementStrides array, which specifies the iteration step along each of the tensorRank dimensions, must be non-zero and less than or equal to 8. Note that when interleave is CU_TENSOR_MAP_INTERLEAVE_NONE, the first element of this array is ignored since TMA doesn’t support the stride for dimension zero. When all elemets of elementStrides array is one, boxDim specifies the number of elements to load. However, if the elementStrides[i] is not equal to one, then TMA loads ceil( boxDim[i] / elementStrides[i]) number of elements along i-th dimension. To load N elements along i-th dimension, boxDim[i] must be set to N * elementStrides[i].

interleave specifies the interleaved layout of type CUtensorMapInterleave, which is defined as:
```
‎    typedef enum CUtensorMapInterleave_enum {
              CU_TENSOR_MAP_INTERLEAVE_NONE = 0,
              CU_TENSOR_MAP_INTERLEAVE_16B,
              CU_TENSOR_MAP_INTERLEAVE_32B
          } CUtensorMapInterleave;
```
TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE and swizzle is not CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as boxDim[0] multiplied by element size derived from tensorDataType) must be less than or equal to the swizzle size.
- CU_TENSOR_MAP_SWIZZLE_32B implies the bounding box inner dimension will be <= 32.
- CU_TENSOR_MAP_SWIZZLE_64B implies the bounding box inner dimension will be <= 64.
- CU_TENSOR_MAP_SWIZZLE_128B implies the bounding box inner dimension will be <= 128.

swizzle, which specifies the shared memory bank swizzling pattern, has to be of type CUtensorMapSwizzle which is defined as:
```
‎    typedef enum CUtensorMapSwizzle_enum {
              CU_TENSOR_MAP_SWIZZLE_NONE = 0,
              CU_TENSOR_MAP_SWIZZLE_32B,
              CU_TENSOR_MAP_SWIZZLE_64B,
              CU_TENSOR_MAP_SWIZZLE_128B
          } CUtensorMapSwizzle;
```
Data is organized in specific order in global memory; however, it may not match the order in which data are accessed by application in the shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shard memory with shuffling across shared memory banks. Note that it’s expected that when interleave is CU_TENSOR_MAP_INTERLEAVE_32B, swizzle should be CU_TENSOR_MAP_SWIZZLE_32B mode. Other interleave modes can have any swizzling patterns.

l2Promotion specifies L2 fetch size which indicates the byte granurality at which L2 requests is filled from DRAM. It must be of type CUtensorMapL2promotion, which is defined as:

‎    typedef enum CUtensorMapL2promotion_enum {
              CU_TENSOR_MAP_L2_PROMOTION_NONE = 0,
              CU_TENSOR_MAP_L2_PROMOTION_L2_64B,
              CU_TENSOR_MAP_L2_PROMOTION_L2_128B,
              CU_TENSOR_MAP_L2_PROMOTION_L2_256B
          } CUtensorMapL2promotion;

oobFill, which indicates whether zero or a special NaN constant should be used to fill out-of-bound elements, must be of type CUtensorMapFloatOOBfill which is defined as:
```
‎    typedef enum CUtensorMapFloatOOBfill_enum {
              CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0,
              CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA
          } CUtensorMapFloatOOBfill;
```
Note that CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA can only be used when tensorDataType represents a floating data type.

CUresult cuTensorMapReplaceAddress ( CUtensorMap* tensorMap, void* globalAddress )

Modify an existing tensor map descriptor with an updated global address.

Parameters

tensorMap: - Tensor map object to modify
globalAddress: - Starting address of memory region described by tensor, must follow previous alignment requirements

Returns

CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE

Description

Modifies the descriptor for Tensor Memory Access (TMA) object passed in tensorMap with an updated globalAddress.