RasterDataFlow#

Fully qualified name: cupva::RasterDataFlow

Defined in src/host/cpp_api/include/cupva_host.hpp

class RasterDataFlow : public cupva::BaseDataFlow#

RasterDataFlow is a DataFlow abstraction for processing tiles by raster scanning.

Using StaticDataFlow to raster scan an image requires the user to configure many parameters. Additionally, for irregular sized images or use cases requiring tile overlap (halo) or padding, multiple StaticDataFlows are typically required, with complex orchestration of triggering.

RasterDataFlow is designed to abstract this process, both on the host side (StaticDataFlow creation) and on the device side (DMA triggering). Hardware features to accelerate this use case are automatically taken advantage of.

The RasterDataFlow can rasterize arbitrary-sized 1D vectors or 2D surfaces from DRAM/SRAM to tiles in VMEM. It handles tail tiles implicitly if the vector or surface ROI size is not divisible by the tile size by padding to a full tile at the end of each row or column of tiles. For filtering cases, halo (boundary padding and overlapped tiles) is supported.

The reverse direction is also supported by RasterDataFlow - copying tiles from a buffer in VMEM to 1D vector or 2D surface in DRAM/SRAM. Halo is not supported for moving tiles from VMEM to DRAM/SRAM.

On the device side, RasterDataFlow (RDF) uses a different tile buffer layout depending on how many tiles fit in the tile buffer and what scan order flags are set (see tileBuffer() and scanOrder()). The device side macros RDF_SINGLE(), RDF_DOUBLE() and RDF_CIRCULAR() can be used to ensure that the tile buffer is large enough to achieve a desired layout. There are performance benefits for using RDF_DOUBLE and RDF_CIRCULAR, as VPU processing can be pipelined with DMA transfers. RDF_CIRCULAR also allows re-use of overlap between tiles (halo()).

If the device side handler is declared with VMEM_RDF_UNIFIED(), VPU code should use cupvaRasterDataFlowAcquire() and cupvaRasterDataFlowRelease() to sequence through the RDF. In this case, any pipelining between VPU and DMA will be handled automatically, and the same acquire/release pattern applies regardless of RDF_SINGLE, RDF_DOUBLE or RDF_CIRCULAR.

If the user wishes to have increased control over DMA triggering, declare the handler as type RasterDataFlowHandler using the VMEM() macro. When working with manual triggering, careful attention must be paid to the derived VMEM buffer layout and the implications this has for the sequence of cupvaRasterDataFlowTrig() and cupvaRasterDataFlowSync() calls:

  1. If only 1 tile can fit in the tile buffer, a single tile with halo is copied to the same address each time the handler is triggered. The device code must be careful to synchronize the transfer before accessing the tile buffer from VPU. See the device side macro RDF_SINGLE() for a helpful way to calculate the size of the tile buffer.

  2. If less than 3 tiles can fit, or at least two tiles can fit and halo is 0, one tile with halo is copied into the tile buffer at an address that ping-pongs between two locations each time the handler is triggered. This allows the user to pre-fetch the next tile while VPU accesses the current tile. The tiles are laid out in VMEM by stacking vertically, as illustrated below:

╭───────────────────────────────────╮
│               halo                │
│ ╭───────────────────────────────╮ │
│ │                               │ │
│ │                               │ │
│ │            tile1              │ │
│ │                               │ │
│ │                               │ │
│ ╰───────────────────────────────╯ │
│                                   │
├───────────────────────────────────┤
│              halo                 │
│ ╭───────────────────────────────╮ │
│ │                               │ │
│ │                               │ │
│ │            tile2              │ │
│ │                               │ │
│ │                               │ │
│ ╰───────────────────────────────╯ │
│                                   │
╰───────────────────────────────────╯

See the RDF_DOUBLE() macro for size calculation.

  1. If 3 or more tiles can fit and halo() is non-zero, one tile, including any necessary padding but excluding shared halo with the next tile in scan order, is copied each time the handler is triggered. By sharing overlap between neighboring tiles, the bandwidth pressure on DMA is reduced. The user must pre-fetch two tiles to completely fill the overlap region. Once 2 tiles are settled in the tile buffer, the user can then pre-fetch the third tile while VPU accesses the first tile, then prefetch tile4 while processing tile2, and so on. The tiles are laid out in a circular buffer following the cupva::ScanOrderType flags. For example, with default flags a traditional top-left to bottom-right row major pattern is followed:

╭───────────────┬──┬─────────────┬──┬───────────────╮
│      halo     │  │   halo      │  │   halo        │
│ ╭─────────────┼──┼─────────────┼──┼─────────────╮ │
│ │             │  │             │  │             │ │
│ │             │  │             │  │             │ │
│ │    tile1    │  │   tile2     │  │   tile3     │ │
│ │             │  │             │  │             │ │
│ │             │  │             │  │             │ │
│ ╰─────────────┼──┼─────────────┼──┼─────────────╯ │
│               │  │             │  │               │
╰───────────────┴──┴─────────────┴──┴───────────────╯
                  ▲
                  │ overlap

The tile numbers would be reversed if the HORIZONTAL_REVERSED flag were set. If the COLUMN_MAJOR flag is set, then the tiles are stacked vertically rather than horizontally. See the RDF_CIRCULAR() macro in device APIs for size calculation.

// Example 1: Rasterize a 1D vector in DRAM to 1D tiles in VMEM (no halo).
// The vector has 200 uint16_t-typed elements and the tile is 16 width.
// The "hdl" symbol was declared in VPU code using VMEM_RDF_UNIFIED().
uint16_t *vectorPtr; // vector pointer to DRAM
RasterDataFlow rs;
rs.src(vectorPtr, 200, 1, 200)
  .tileBuffer(prog["tilePtr"].ptr<uint16_t>())
  .tile(16, 1)
  .handler(prog["hdl"]);

// Example 2: Rasterize a 2D image in DRAM to 2D tiles in VMEM.
// Assume that the VPU is going to do image filtering with a 5x5 filter kernel. The halo is 2.
// The image is 200 x 100 uint32_t pixels with pitch aligned to 1024 bytes. The tile size is 16 x 16.
// After cupvaRasterDataFlowAcquire(), 16+2+2 = 20x20 pixels will be settled in the tileBuffer at the returned address.
// Data fetched for the halo outside the image extents will be filled with 0 (default mode and val parameters for halo()).
uint32_t *imagePtr; // image pointer to DRAM
uint32_t *tilePtr;  // tile pointer to VMEM
RasterDataFlow rs;
rs.src(imagePtr, 200, 100, 256)
  .tileBuffer(prog["tilePtr"].ptr<uint32_t>())
  .tile(16, 16)
  .halo(2)
  .handler(prog["hdl"]);

// Example 3: Like example 2, but operate on a 100x50 center crop of the image.
// Note that the DMA engine will only use data from the source image to fill the halo in this case,
// since no part of the ROI or halo will require data from outside the source image extents.
rs.src(imagePtr, 200, 100, 256)
  .tileBuffer(prog["tilePtr"].ptr<uint32_t>())
  .tile(16, 16)
  .roi(50, 25, 100, 50)
  .halo(2)
  .handler(prog["hdl"]);

Constraints:

  1. One of src() or dst() must be called to select writing to or reading from the tile buffer, respectively.

  2. The pointers used for src() or dst() must be in DRAM or SRAM.

  3. For the 2D rectangles defined by these APIs: halo() < tile() <= roi()

  4. When halo() != 0, dst() is not supported.

  5. The VMEM pointer used for tileBuffer() must reserve sufficient memory to support the desired tile transfer pattern. The user is responsible for ensuring this, and there are helpful macros provided in the device APIs: RDF_SINGLE(), RDF_DOUBLE(), RDF_CIRCULAR().

  6. When tile dimensions are larger than 255, the amount of padding required to read full tiles to VMEM may be greater than the maximum supported by the DMA engine (255). Careful attention must be paid to the src() and roi() dimensions to ensure that there are no tiles at the boundaries which would result in excessive padding.

  7. The number of tiles in each dimension must be less than or equal to 256 (maximum 65536 tiles total).

  8. The parameters provided to src(), dst() or roi() must define a valid 2D surface, considering the tile() size: width <= min(tileWidth * 256, linePitch), linePitch <= 65535, and height <= tileHeight * 256.

Violation of above constraints may result in an Exception from one of the RasterDataFlow APIs, or CmdProgram::compileDataFlows().

All DataFlows should be created by invoking methods on cupva::CmdProgram. DataFlows are owned by CmdProgram objects. User should only operate on DataFlows via non-owning references returned by CmdProgram objects.

Public Functions

RasterDataFlow &link(RasterDataFlow &next)#

Link the next RasterDataFlow object.

Linking the next RasterDataFlow object means when the current RasterDataFlow object is consumed, logically moves to the next one. They can’t be triggered by the VPU in the same time period and the DMA compiler tries to reuse the hardware resource.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:

next – The next RasterDataFlow object’s reference

Returns:

RasterDataFlow& The reference of current object

RasterDataFlow &handler(const Parameter &handler)#

Set RasterDataFlow’s handler.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:

handler – The handler defined in vmem and represented by the host parameter

Throws:
  • cupva::Exception(InvalidArgument) – The handler is not from VMEM

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

template<typename T>
inline RasterDataFlow &src(
T &&op,
int32_t const width,
int32_t const height,
int32_t const linePitch = 0,
)#

Set RasterDataFlow’s source surface.

This API is applicable to device pointers or OffsetPointers representing source memory in DRAM/SRAM.

If the pointer contains surface metadata, ie it was created by cupva::nvsci::mem::Import(), and the linePitch parameter is not provided (or set to 0), then the line pitch will be inferred from the surface metadata.

If OffsetPointer is used rather than raw device pointer, the linePitch provided or inferred at the time this API is called will persist even if the base pointer metadata is later changed.

Only one of src() or dst() can be set for a RasterDataFlow.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Template Parameters:

T – The source buffer type. Can be either U* (raw pointer), or OffsetPointer<U>.

Parameters:
  • op – The device pointer or OffsetPointer reference to source memory.

  • width – The buffer width in T-typed pixels.

  • height – The buffer height in T-typed pixels.

  • linePitch – The source line-pitch in T-typed pixels (optional if op base pointer is a cupva::mem::BufferType::SURFACE).

Throws:
Returns:

RasterDataFlow& The reference of current object.

template<typename T>
inline RasterDataFlow &dst(
T &&op,
int32_t const width,
int32_t const height,
int32_t const linePitch = 0,
)#

Set RasterDataFlow’s destination surface.

This API is applicable to device pointers or OffsetPointers representing destination memory in DRAM/SRAM.

If the pointer contains surface metadata, ie it was created by cupva::nvsci::mem::Import(), and the linePitch parameter is not provided (or set to 0), then the line pitch will be inferred from the surface metadata.

If OffsetPointer is used rather than raw device pointer, the linePitch provided or inferred at the time this API is called will persist even if the base pointer metadata is later changed.

Only one of src() or dst() can be set for a RasterDataFlow.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Template Parameters:

T – The destination buffer type. Can be either U* (raw pointer), or OffsetPointer<U>.

Parameters:
  • op – The device pointer or OffsetPointer reference to destination memory.

  • width – The buffer width in T-typed pixels.

  • height – The buffer height in T-typed pixels.

  • linePitch – The destination line-pitch in T-typed pixels (optional if op base pointer is a cupva::mem::BufferType::SURFACE).

Throws:
Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &halo(
int32_t const num,
PadModeType const mode = PadModeType::PAD_CONST,
int32_t const val = 0,
)#

Set RasterDataFlow’s halo information uniformly for all dimensions.

The halo is taken into consideration when rasterizing a vector/image from DRAM/SRAM into tiles in VMEM. Specifying non-zero halo causes each tile to fetch some additional ‘halo’ around the specified tile dimensions. The additional data comes from either the image data or the DMA’s padding engine, depending on whether the pixels to be fetched are within the bounds specified by the src() API.

A halo with num = 0 may be specified. The effect of this is to control how a partial tile is extended to a full tile in the absence of halo. RasterDataFlow will always write a full tile. Often, there may not be sufficient data in the src buffer to write a full tile (this will happen near the edges of an image which is not a multiple of the tile dimensions). User may choose to fill the remaining parts of such tiles with either boundary pixel extension or constant value padding by using this API.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • num – The halo size in T-typed pixel.

  • mode – The padding mode going to be applied to the input.

  • val – The padding value.

Throws:
  • cupva::Exception(InvalidArgument) – The halo size is out-of-range

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &halo(
int32_t const numX,
int32_t const numY,
PadModeType const mode = PadModeType::PAD_CONST,
int32_t const val = 0,
)#

Set RasterDataFlow’s halo information separately for X and Y dimensions.

The halo is taken into consideration when rasterizing a vector/image from DRAM/SRAM into tiles in VMEM. Specifying non-zero halo causes each tile to fetch some additional ‘halo’ around the specified tile dimensions. The additional data comes from either the image data or the DMA’s padding engine, depending on whether the pixels to be fetched are within the bounds specified by the src() API.

A halo with num = 0 may be specified. The effect of this is to control how a partial tile is extended to a full tile in the absence of halo. RasterDataFlow will always write a full tile. Often, there may not be sufficient data in the src buffer to write a full tile (this will happen near the edges of an image which is not a multiple of the tile dimensions). User may choose to fill the remaining parts of such tiles with either boundary pixel extension or constant value padding by using this API.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • numX – The halo size in T-typed pixel for horizontal X dimension.

  • numY – The halo size in T-typed pixel for vertical Y dimension.

  • mode – The padding mode to be applied to the input.

  • val – The padding value.

Throws:
  • cupva::Exception(InvalidArgument) – The halo size is out-of-range

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &roi(
int32_t const x,
int32_t const y,
int32_t const width,
int32_t const height,
)#

Set the RasterDataFlow’s surface region of interest.

RasterDataFlow supports a region of interest with respect to the surface dimensions set by the src/dst API.

When the src() API is used, ie. DMA is transferring into the tileBuffer, the ROI may be set larger than the surface to add additional padding for boundary tiles. Each tile within the ROI must contain at least one pixel overlap with the source surface in each dimension. eg. For a normal forward raster scan (ScanOrderType == 0), the limits on ROI can be defined using the following expressions:

To check the ROI (x,y):

bool isRoiXYValid{(roiX + tileWidth > 0) && (roiY + tileHeight > 0)};

To check the ROI (width, height):

int32_t roiWidthRounded{((roiWidth + (tileWidth - 1)) / tileWidth) * tileWidth};
int32_t roiHeightRounded{((roiHeight + (tileHeight - 1)) / tileHeight) * tileHeight};
bool isRoiWHValid{((roiX + roiWidthRounded - tileWidth) < srcWidth) &&
                  ((roiY + roiHeightRounded - tileHeight) < srcHeight)};

When the dst() API is used the ROI must be entirely within the surface extents.

An error will be encountered at the time cupva::CmdProgram::compileDataFlows() is called if the ROI violates any of the restrictions.

The ROI is defined with respect to the source image. The effect of halo is applied to the ROI. For example, to define a tile pattern with no overlap, but with 1 pixel of padding added to the image extents, call halo(0) and roi(-1, -1, srcWidth + 2, srcWidth + 2). To add padding of 2 pixels and enable overlap of 1 pixel, call halo(1) and still call roi(-1, -1, width + 2, height + 2).

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • x – The start offset X

  • y – The start offset Y

  • width – The roi width in pixels

  • height – The roi height in pixels

Throws:
  • cupva::Exception(InvalidArgument) – The roi size is out-of-range

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &tile(int32_t const width, int32_t const height)#

Set the RasterDataFlow tile size.

The tile size defines the 2D unit of tesselation to cover the surface, not including halo(). The DMA will be configured to transfer the minimum number of tiles that completely tesselate the surface.

For reading to VMEM, a full tile plus halo will always be available in VMEM for each DMA transfer. See halo() for more information. If reading a full tile plus halo would cause the DMA engine to read data outside of legal surface bounds, the DMA engine’s padding capability will be used instead to fill the tile. For tile dimensions larger than 255 pixels, this may require padding more pixels than can be supported by the DMA engine. Such DataFlow configurations will fail to compile.

Note that the DMA transfer may contain padding due to the ROI setting for any tiles within the ROI that extend beyond the surface. This effect does not change the amount of data written to the tile buffer. See the roi() API for more details.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • width – The tile width in pixels

  • height – The tile height in pixels

Throws:
  • cupva::Exception(InvalidArgument) – The tile size is out-of-range

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &scanOrder(uint32_t const scanOrder = 0U)#

Set the RasterDataFlow’s scan order.

The scan order defines how tiles are traversed from/to the src/dst surface. There are 8 possible scan order patterns which are expressible by bitwise ORing ScanOrderType enum values.

Note that the scanOrder() API changes how a tile is read/written to the src()/dst() surface buffer, and may have an effect on where the tile is placed within the tileBuffer() as well.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:

scanOrder – The tile traversal order as a bitmask.

Throws:

cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object.

RasterDataFlow &tileArena(int32_t const width, int32_t const height)#

Set the arena enclosing the tile.

RasterDataFlow can support reading a tile from within a larger 2D arena of memory. This can be useful to stream tiles cropped from an oversized buffer in VMEM to the destination surface. One common usecase is when VPU has produced data with some “slop”, such as due to vector size constraints or alignment to support transposed operations.

This API has no effect if the src() API is used to set the surface buffer, ie tileArena is not supported when the tile buffer is the destination for DMA transfer.

The tileArena must fully contain the tile().

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • width – The width of the tile arena

  • height – The height of the tile arena

Throws:

cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object

RasterDataFlow &transpose(TranspositionMode const mode = TRANS_MODE_1)#

Set the transposition mode.

Transposition mode controls the number of consecutive data points that transpose load operation reads before applying the line pitch address offset.

There are a few constraints regarding the usage of transpose intrinsics: In total six transposition modes (T1/T2/T4/T8/T16/T32) are supported. However, due to HW alignment constraints, only T1 mode supports all PVA data types (Byte/Halfword/Word). Please refer to section [6.3.7] in Orin PVA VPU Programmers Guide for the line pitch and data type constraints of each mode.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:

mode – The transposition mode

Throws:

cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object

template<typename T>
inline RasterDataFlow &tileBuffer(
T *const ptr,
T *const ptr2 = nullptr,
)#

Set tile buffer to rasterize to/from.

The tile buffer is used as the DMA transfer destination if the src() API is called to set the surface buffer, or as the DMA transfer source if the dst() API is called.

RasterDataFlow can support either single or double buffering of tiles, with the mode selected implicitly based on how many tiles will fit into the buffer.

If only one tile can fit, then single buffer mode is used. The user must ensure that the trigger pattern used in device code will serialize the DMA and VPU access to the tile buffer to ensure there are no data hazards.

If two or more tiles can fit, then double buffer mode is used. In double buffered mode, the user is encouraged to trigger DMA to pre-fetch one tile ahead while VPU works on the current tile, thereby hiding the DMA latency.

RasterDataFlow only supports tile buffers in VMEM. See the helper macros in the device APIs to assist with reserving the appropriate sized buffer for each mode.

When the device pointer provided was declared as a circular buffer in VMEM, overlap between adjacent tiles is reused to save DMA bandwidth. When using a circular buffer and non-zero halo(), the user must trigger DMA twice before processing the first tile in order to fill the overlap region. From then on, the behavior is the same as double buffer mode. Single buffer mode is not supported when using a circular buffer.

The optional second pointer parameter allows for full replication of the DMA transfer into two tile buffers at once. The ptr2 address must be 64B aligned, and be at least the same size as the primary tile buffer. Only 1 and 2 bytes per pixel are supported by full replication.

Usage considerations

  • Allowed context for the API call

    • Thread-safe: No

  • API group

    • Init: Yes

    • Runtime: No

    • De-Init: No

Parameters:
  • ptr – The device pointer to the VMEM location of the tile buffer.

  • ptr2 – The device pointer to a replicate tile buffer.

Throws:
  • cupva::Exception(InvalidArgument) – The buffer is not in VMEM

  • cupva::Exception(InvalidState) – The object is not instantiated correctly

Returns:

RasterDataFlow& The reference of current object