BltoPl#

Overview#

PL (Pitch Linear) is the layout of most ordinary images, where elements are arranged row by row and elements within the same row are consecutive in memory. BL (Block Linear) [1] is a proprietary layout. Compared to PL layout, on a 32 bytes x 2 rows granularity (referred to as 32x2), BL layout means the elements are permuted as follows:

Bytes within the 32x2 are permuted.
Different 32x2 blocks within the image are also permuted.

NV12 is a YUV semi-planar format that has two planes: one for the Y plane and the other for the interleaved UV plane. The UV plane is subsampled, with its image width and height being half of that of the Y plane’s. For both PL and BL, the layout refers to a per-plane basis, e.g., there is no permutation of blocks across planes.

Design Requirements#

For deployment use cases, the input is the NvSci imported buffer. Refer to PVA SDK documentation and tutorials for cuPVA NvSci Interoperability APIs.
For standalone use cases, the input is the cuPVA allocated memory. In this case, the PVA SDK’s utility functions can be used to convert between PL and BL, which is quite useful for comparing the bit exactness or profiling.

Deployment use cases refer to the case where the PVA needs to interact with other IPs. Typically, a series of IPs are needed for a specific pipeline. Standalone use cases are applications that only utilize PVA.

Implementation Details#

PVA SDK Interface#

Some of the dataflows like RasterDataFlow (RDF) can do the conversion between BL and PL on the fly. For example, when the dataflow is transferring data from DRAM to VMEM, if properly configured and the layout in DRAM is BL, the layout in VMEM will be converted to PL after the transfer. Therefore, the source code might seem to do only the memory copy from DRAM to VMEM and then memory copy from VMEM to DRAM at first sight if neglecting this capability.

The PVA SDK provides a somewhat transparent programming interface to configure the conversion between BL and PL. It tracks the surface’s metadata within PVA host’s context and associates this metadata with a device pointer pointing to the top-left of the surface. For our use case, the surface can be thought of as a bulk of device memory that stores the image. This design can be further explained as follows:

The PVA host side API will judge whether there is metadata associated with a device pointer, i.e., whether it is in some plane of a surface or a raw pointer.
If it finds that the surface has a BL layout, it will configure the dataflow to do the conversion between BL and PL since the layout within VMEM is always assumed to be PL. Note that, this conversion only happens either when transferring data from DRAM to VMEM or from VMEM to DRAM.
The non-public details of BL can be hidden within the metadata. By calling cupva::mem::GetSurfaceAttributes function, public metadata will be stored in cupva::mem::SurfaceAttributes; e.g., whether the layout is BL or PL, number of planes, offsets between planes, width, height, line pitch, etc.

Based on this design, details of BL are not exposed to users. The host side APIs for raw pointers and surface modes remain consistent. Additionally, there is no need for extra APIs for dataflow like isBLSurface or isPLSurface.

Dataflow Configuration#

Note that, there are additional restrictions on PL parameters describing BL layout:

The top-left of a BL surface must be aligned to 512 bytes.
Source address, destination address, and advance must be aligned to 64 bytes.

Violating these restrictions will result in compile failure for dataflow.

High-Level and Low-Level API#

The RDF’s lower-level device-size APIs, i.e., cupvaRasterDataFlowSync and cupvaRasterDataFlowTrig are used instead of higher-level APIs, i.e., cupvaRasterDataFlowOpen, cupvaRasterDataFlowAcquire, cupvaRasterDataFlowRelease, and cupvaRasterDataFlowClose. This preference is due to the manual control that the lower-level APIs provide, which is essential for the specific requirements of this application.

Tiling Size#

The tiling size for this operator sets \(tileX=128\) and \(tileY=64\). There is no need for overlapping tiles; RDF dataflow handles padding if image width or height is not multiples of tileX or tileY.

DMA hardware works most efficiently with tileX being multiple of 64 and tile size around ~10KiB. Larger tile sizes reduce dataflow overhead but also increase overhead for tiles around image boundaries if padding is needed.

Currently, tile size is set as two compile-time constants since double buffer is defined by macro RDF_DOUBLE whose input contains constant width and height. Some RDF functions choose between single/double/circular buffers based on buffer size constraints.

Therefore, current interface does not support pre-allocating a large buffer then selecting suitable tileX and tileY during runtime. For extremely small image sizes, possible tile sizes can be set by \(tileX=32\) and \(tileY=2\); otherwise it may saturate some fields in RDF dataflow.

Performance#

ImageSize	ImageFormat	Execution Time	Submit Latency	Total Power
1920x1080	NV12	0.432ms	0.019ms	12.633W
1920x1200	NV12	0.479ms	0.020ms	12.632W
1920x1536	NV12	0.606ms	0.020ms	12.634W
1936x1216	NV12	0.665ms	0.020ms	13.045W
3840x2160	NV12	1.675ms	0.021ms	12.643W

For detailed information on interpreting performance tables above and understanding benchmarking setup, see Performance Benchmark.

Reference#

https://docs.nvidia.com/drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/DRIVE_OS_Linux_SDK_Development_Guide/NvMedia/nvmedia_concept_surface.html