MixChannels#

Overview#

MixChannels operator is designed to support a series of copy operations, from a set of input channels to a set of output channels. The set of inputs and outputs may be given by any number of input and output tensors, where each image may have one or more channels.

Algorithm Implementation#

The algorithm is implemented as a pixel-wise copy function: it reads each input channel and writes it to the corresponding output channel.

For now, MixChannels operator supports the following two cases in the ENUM PVAMixChannelsCode:

SPLIT_RGBA8_TO_U8 = 0: split 1 RGBA8 tensor into 4 U8 tensors.
MERGE_U8_TO_RGBA8 = 2: merge 4 U8 tensors into 1 RGBA8 tensor.

An RGBA8 tensor is defined as an NVCVTensor with the “HWC” format and C = 4. Similarly, a U8 tensor is defined as an NVCVTensor with the “HWC” format and C = 1. Since this is a copy operation, the resolutions of each tensor, that is, the height and width of the input and output tensors, should be identical.

More cases will be supported in the future.

Requirements#

The sizes of the input and output tensor must match; the algorithm does not perform chroma up/downsampling.

No color or pixel range conversions are performed between input and output formats.

Dataflows#

In the case of SPLIT_RGBA8_TO_U8, a single RasterDataFlow is used to transfer the input RGBA8 tensor, while four separate RasterDataFlows are used to transfer the four output U8 tensors. These four output RasterDataFlows share the same handler and are triggered simultaneously. Conversely, for MERGE_U8_TO_RGBA8, four input RasterDataFlows are used to transfer four U8 tensors, and a single output RasterDataFlow is used to transfer one RGBA8 tensor.

The tile size is 64x64 when C = 1 for a U8 tensor, and 256x64 when C = 4 for an RGBA8 tensor. The T1 transposition mode is set for all dataflows to apply line pitch for transposed load and store.

VPU Kernels#

The VPU address generator (agen) based load and store operations are used to reorganize the plane/channel arrangement. During a transposed load, a vector of data is loaded in the H direction, while during a transposed store, the corresponding vector is stored in the H direction. In the width-channel (WC) direction, it can split one to four or merge four to one. For example, in the case of SPLIT_RGBA8_TO_U8, there are four consecutive transposed loads and each of them loads a vector of tensor data in the H direction and advance the next load in the WC direction by one pixel. Correspondingly in SPLIT_U8_TO_RGBA8, four transposed stores store the data into four separate output buffers.

Although the ping-pong buffer scheme is adopted to pipeline DMA transfers and VPU processing, the simplicity of the VPU kernels makes the operator obviously IO-bound.

Performance#

ImageSize	MixChannelsCode	Execution Time	Submit Latency	Total Power
1920x1080	SPLIT_RGBA8_TO_U8	0.704ms	0.023ms	14.594W
1920x1080	MERGE_U8_TO_RGBA8	0.573ms	0.023ms	14.894W

For detailed information on interpreting the performance table above and understanding the benchmarking setup, see Performance Benchmark.