TensorDataFlow#

Fully qualified name: cupva::TensorDataFlow

Defined in src/host/cpp_api/include/cupva_host.hpp

class TensorDataFlow : public cupva::BaseDataFlow #

TensorDataFlow is a DataFlow abstraction for accessing 3D tiles.

TensorDataFlow requires driver version >= 2006.

In DL use cases, a convolution algorithm is used to extract features from input data. It applies a 3 dimensional filter to the input data, the filter slides over a 3D volume in W and H dimensions to calculate the low level feature representation.

TensorDataFlow (TDF) provides support for data access patterns common to Convolutional Neural Network (CNN) operations. It supports reading from a source tensor in DRAM/L2SRAM with CHW layout, one 3D tile at a time, where the tile size is less than or equal to the tensor size in each of the three dimensions. Similar to RasterDataFlow (RDF), halo and roi are supported for the H and W dimensions and padding is automatically handled when reading to VMEM. The depth or C dimension does not support padding or halo. TDF also supports writing to DRAM/L2SRAM with the same restrictions as RDF: halo and padding are not supported, and the destination roi must be within the bounds of the destination size.

TensorDataFlow is designed to provide a set of high level APIs for configuring the DMA engine for tensor access pattern. The tile traversal order supported by TensorDataFlow is: Depth (forward) -> Width (Left to right) -> Height (Top to bottom)

Constraints:

One of src() or dst() must be called to select writing to or reading from the tile buffer, respectively.
The pointers used for src() or dst() must be in DRAM or SRAM.
For the 2D rectangles defined by these APIs: halo() < tile() <= roi()
When halo() != 0, dst() is not supported.
The VMEM pointer used for tileBuffer() must reserve sufficient memory to support the desired tile transfer pattern. The user is responsible for ensuring this, and there are helpful macros provided in the device APIs: TDF_SINGLE() and TDF_DOUBLE().
When tile dimensions are larger than 255, the amount of padding required to read full tiles to VMEM may be greater than the maximum supported by the DMA engine (255). Careful attention must be paid to the src() and roi() dimensions to ensure that there are no tiles at the boundaries which would result in excessive padding.
The number of tiles in each dimension must be less than or equal to 256 (maximum 16777216 tiles total).
The parameters provided to src(), dst() or roi() must define a valid tensor, considering the tile() size: width <= min(tileWidth * 256, rowStride), rowStride <= 65535, height <= tileHeight * 256, depthStride < rowStride * 256 and depth < tileDepth * 256.
A single tile, including any halo, may read out of bounds on either the left or right boundaries of the tensor’s W dimension, but not both simultaneously. Similarly, a single tile may not read out of bounds on both the top and bottom boundaries of the tensor’s H dimension simultaneously. Note that out of bounds reads in H and W dimensions will be filled by the DMA engine in accordance with the arguments specified to halo(), and out of bounds writes will be skipped.

Public Functions

TensorDataFlow &handler(const Parameter &handler)#

Set TensorDataFlow’s handler.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:: handler – The handler defined in vmem and represented by the host parameter
Returns:: TensorDataFlow& The reference of current object.

template<typename T> inline TensorDataFlow &src( T &&op, int32_t const width, int32_t const height, int32_t const depth, int32_t const rowStride = 0, int32_t const depthStride = 0, )#

Set TensorDataFlow’s source tensor.

This API is applicable to device pointers or OffsetPointers representing source memory in DRAM/SRAM.

Only one of src() or dst() can be set for a TensorDataFlow.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Template Parameters:

T – The source buffer type. Can be either U* (raw pointer), or OffsetPointer<U>.

Parameters:

op – The device pointer or OffsetPointer reference to source memory.
width – The buffer width in T-typed pixels.
height – The buffer height in T-typed pixels.
depth – The buffer depth in T-typed pixels.
rowStride – The source line-pitch in T-typed pixels. When not provided or set to zero, it will default to the values of width.
depthStride – The source 2D plane size in T-typed pixels. When not provide or set to zero, it will default to (rowStride * height) if rowStride is provided or (width * height) if rowStride is not provided.

Returns:

TensorDataFlow& The reference of current object.

template<typename T> inline TensorDataFlow &dst( T &&op, int32_t const width, int32_t const height, int32_t const depth, int32_t const rowStride = 0, int32_t const depthStride = 0, )#

Set TensorDataFlow’s destination tensor.

This API is applicable to device pointers or OffsetPointers representing destination memory in DRAM/SRAM.

Only one of src() or dst() can be set for a TensorDataFlow.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Template Parameters:

T – The destination buffer type. Can be either U* (raw pointer), or OffsetPointer<U>.

Parameters:

op – The device pointer or OffsetPointer reference to destination memory.
width – The buffer width in T-typed pixels.
height – The buffer height in T-typed pixels.
depth – The buffer depth in T-typed pixels.
rowStride – The destination line-pitch in T-typed pixels. When not provided or set to zero, it will default to the values of width.
depthStride – The destination 2D plane size in T-typed pixels. When not provide or set to zero, it will default to (rowStride * height) if rowStride is provided or (width * height) if rowStride is not provided.

Returns:

TensorDataFlow& The reference of current object.

TensorDataFlow &halo( int32_t const num, PadModeType const mode = PadModeType::PAD_CONST, int32_t const val = 0, )#

Set TensorDataFlow’s halo information uniformly for width and height dimensions.

The halo is taken into consideration when rasterizing a tensor from DRAM/SRAM into tiles in VMEM. Specifying non-zero halo causes each tile to fetch some additional ‘halo’ around the specified tile dimensions. The additional data comes from either the tensor data or the DMA’s padding engine, depending on whether the pixels to be fetched are within the bounds specified by the src() API.

A halo with num = 0 may be specified. The effect of this is to control how a partial tile is extended to a full tile in the absence of halo. When the depth of source or destination tensor is multiple of tile depth, TensorDataFlow will always write a full tile. Often, there may not be sufficient data in the src buffer to write a full tile (this will happen near the edges of an tensor which is not a multiple of the tile dimensions). User may choose to fill the remaining parts of such tiles with either boundary pixel extension or constant value padding by using this API. When the depth of source or destination tensor is not multiple of tile depth, TensorDataFlow will not transfer data for the tiles past the remainder in C dim in the last transfer.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:

num – The halo size in T-typed pixel.
mode – The padding mode going to be applied to the input.
val – The padding value.

Throws:

cupva::Exception(InvalidArgument) – The halo size is out-of-range
cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

TensorDataFlow& The reference of current object.

TensorDataFlow &halo( int32_t const numX, int32_t const numY, PadModeType const mode = PadModeType::PAD_CONST, int32_t const val = 0, )#

Set TensorDataFlow’s halo information in width and height dimensions.

The halo is taken into consideration when rasterizing a tensor from DRAM/SRAM into tiles in VMEM. Specifying non-zero halo causes each tile to fetch some additional ‘halo’ around the specified tile dimensions. The additional data comes from either the tensor data or the DMA’s padding engine, depending on whether the pixels to be fetched are within the bounds specified by the src() API.

A halo with num = 0 may be specified. The effect of this is to control how a partial tile is extended to a full tile in the absence of halo. When the depth of source or destination tensor is multiple of tile depth, TensorDataFlow will always write a full tile. Often, there may not be sufficient data in the src buffer to write a full tile (this will happen near the edges of a tensor which is not a multiple of the tile dimensions). User may choose to fill the remaining parts of such tiles with either boundary pixel extension or constant value padding by using this API. When the depth of source or destination tensor is not multiple of tile depth, TensorDataFlow will not transfer data for the tiles past the remainder in C dim in the last transfer.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:

numX – The halo size in width dimension in T-typed pixel.
numY – The halo size in height dimension in T-typed pixel.
mode – The padding mode going to be applied to the input.
val – The padding value.

Returns:

TensorDataFlow& The reference of current object.

TensorDataFlow &roi( int32_t const x, int32_t const y, int32_t const width, int32_t const height, )#

Set the TensorDataFlow’s tensor region of interest.

TensorDataFlow supports a region of interest with respect to the tensor width and height dimensions set by the src/dst API.

When the src() API is used, ie. DMA is transferring into the tileBuffer, the ROI may be set larger than the tensor to add additional padding for boundary tiles. Each tile within the ROI must contain at least one pixel overlap with the source tensor in both with and height dimensions. The limits on ROI can be defined using the following expressions:

To check the ROI (x,y):

bool isRoiXYValid{(roiX + tileWidth > 0) && (roiY + tileHeight > 0)};

To check the ROI (width, height):

int32_t roiWidthRounded{((roiWidth + (tileWidth - 1)) / tileWidth) * tileWidth};
int32_t roiHeightRounded{((roiHeight + (tileHeight - 1)) / tileHeight) * tileHeight};
bool isRoiWHValid{((roiX + roiWidthRounded - tileWidth) < srcWidth) &&
                  ((roiY + roiHeightRounded - tileHeight) < srcHeight)};

When the dst() API is used the ROI must be entirely within the tensor extents.

An error will be encountered at the time cupva::CmdProgram::compileDataFlows() is called if the ROI violates any of the restrictions.

The ROI is defined with respect to the source tensor. The effect of halo is applied to the ROI. For example, to define a tile pattern with no overlap, but with 1 pixel of padding added to the tensor extents, call halo(0) and roi(-1, -1, srcWidth + 2, srcWidth + 2). To add padding of 2 pixels and enable overlap of 1 pixel, call halo(1) and still call roi(-1, -1, width + 2, height + 2).

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:

x – The start offset X
y – The start offset Y
width – The roi width in pixels
height – The roi height in pixels

Returns:

TensorDataFlow& The reference of current object.

TensorDataFlow &tile( int32_t const width, int32_t const height, int32_t const depth = 0, )#

Set the TensorDataFlow tile size.

The tile size defines the 3D processing unit of tessellation to cover the tensor, not including halo(). The DMA will be configured to transfer the minimum number of tiles that completely tesselate the tensor.

Note that the DMA transfer may contain padding due to the ROI setting for any tiles within the ROI that extend beyond the tensor. This effect does not change the amount of data written to the tile buffer. See the roi() API for more details.

The depth parameter is optional. If not provided, it will default to the depth of the src/dst tensor. The value of this depth parameter must be less than 256 and cannot be more than the depth of the src/dst tensor.

When the depth of source or destination tensor is not a multiple of tile depth. User’s VPU kernel needs to handle the remainder in C-dim properly to avoid processing garbage data.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:

width – The tile width in pixels
height – The tile height in pixels
depth – The tile depth in pixels.

Returns:

TensorDataFlow& The reference of current object.

template<typename T> inline TensorDataFlow &tileBuffer(T *const ptr)#

Set tile buffer to move to/from.

The tile buffer is used as the DMA transfer destination if the src() API is called to set the tensor buffer, or as the DMA transfer source if the dst() API is called.

TensorDataFlow can support either single or double buffering of tiles, with the mode selected implicitly based on how many tiles will fit into the buffer.

Like RasterDataFlow, TensorDataFlow only supports tile buffers in VMEM. See the helper macros in the device APIs to assist with reserving the appropriate sized buffer for each mode.

Usage considerations

Allowed context for the API call
- Thread-safe: No
API group
- Init: Yes
- Runtime: No
- De-Init: No

Parameters:: ptr – The device pointer to the VMEM location of the tile buffer.
Returns:: TensorDataFlow& The reference of current object