BackgroundSubtractor#

Overview#

Background subtraction is an algorithm which is used to separate the foreground objects from the background image in the continuous video sequence. By feeding the frames from the video as input, the binary (or ternary, if including shadow regions) foreground mask and an estimate of the static background image (if enabled) will be produced.

Algorithm Description#

For each pixel, a group of gaussian models will be created based on the illumination variations of the video sequence. For a pixel from the image, the GMM can be described as:

\[p(\vec{x}) = \sum_{m=1}^{M}\pi_mN(\vec{x};\vec{\mu}_m, {\vec{\delta}_m}^2)\]

where:

\(\vec{x}\) is the pixel value
\(M\) is the total number of gaussian models
\(\pi_m\) is the weight corresponding to m-th gaussian model
\(\vec{\mu}_m\) is the mean of m-th gaussian model
\(\vec{\delta}_m\) is the standard variance of m-th gaussian model

Mahalanobis distance is used to measure of the distance between a point and a distribution. For more information, see [2]. Here we denote \(D(\vec{x}, m)\) as the Mahalanobis distance between the pixel value x and m-th model.

For any new pixel from the video sequence, it will calculate the Mahalanobis distance \(D(\vec{x}, m)\) against each gaussian model. If the distance is within the threshold, then the model is found. Otherwise it will create new gaussian model if the maximum number of models has not been reached. Based on the weight of the matched model, it will classify the pixel as foreground or background. For more information, see [1].

Implementation Details#

Each pixel is modelled by a gaussian mixture model with 3 components for each model - mean, variance and weight. The implementation uses 5 such models per pixel. The models are stored and processed either as 16 bit or 32 bit floating point values depending on the gmmDataType parameter. The weights, means and variances are arrays of size img_w * img_h * 5 each. In addition, the number of valid models is stored for each pixel. This is either 16 or 32 bit int depending on the gmmDataType parameter. This is an array of size img_w * img_h. This value is interleaved with the weights and the same RDF is used to move both the weights and the number of valid models. This is done to avoid having to create a separate RDF for the number of valid models. The gaussian mixture models are internal data structures and are not exposed to the user.

The kernel uses a tile size of 32x32 pixels. Each iteration of the kernel transfers a tile of pixels and gaussian mixture models to the VMEM. The kernel processes the tile and generates a tile of foreground mask and background image (if enabled).

The reference code uses fp32 for the gaussian mixture model components. This causes a small mismatch in the results when fp16 is used in the PVA implementation. For this reason, the tests allow for a small number of pixels to mismatch between the two implementations.

Parameters#

The following parameters will be copied to VMEM upon program command submission.

niter: Number of tiles to process in a full image.
varThreshold: Threshold for the Mahalanobis distance.
detectShadow: Whether to enable shadow detection.
shadowPixelValue: The pixel value used to mark detected shadow regions in the output mask.
learningRate: Learning rate for the gaussian mixture models.
getBGImage: Whether to get the background image.
frameNum: frame number (for debug).

Dataflow Configuration#

The program uses a total of 9 RDFs. The 4 DRAM to VMEM RDFs are for the input image and the 3 gaussian mixture model components. The 5 VMEM to DRAM RDFs are for the foreground mask, background image (if enabled), and the 3 gaussian mixture model components. The RDF to move the background image would have to be created only if enabled. However, the background creation is a parameter in the submit function and not the create function, where the RDFs need to be set up. So in the create function, 2 version of the command program are created - one with the RDF to move the background image and one without it. Depending on whether the background image is enabled or not, the correct program is selected in the submit function.

Kernel Implementation#

Each lane of the VPU vector register processes a pixel. Single vector registers are used, so we can process 8 or 16 pixels per lane depending on fp16 or fp32 mode. At the beginning of the kernel loop the 3 components of the 5 gaussian mixture models are loaded from VMEM into 15 single vector registers. Pixels and number of models are also loaded into vector registers. Except for one instance, the processing is completely done out of the these vector registers. At the end of the loop, the GMMs and num models are stored back to VMEM, in addition to the foreground mask and background image (if enabled).

The one instance in which VMEM loads and stores are needed in the middle of the loop is when we need to update only one of the 5 models selectively - when we need to update only the model that is matched to the pixel. To do this, we write out the 15 GMM registers into VMEM, and use the vast and vlookup instructions to edit only one of the 5 models. Vast stands for vector address store. This instruction can store the value in each lane of a vector register into non-adjacent VMEM locations. In other words each vector lane can be viewed as being written to a different row in its VMEM bank. This is in contrast to the vstore instruction, which stores values in the lanes of a vector register into adjacent VMEM locations. Similarly with the vlookup instruction, each lane of a vector register can be loaded from a different row (non adjacent addresses) in its VMEM bank. This is in contrast to the vload instruction, which loads the value in each lane of a vector register from adjacent VMEM locations.

References#

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize	ShadowDetection	BGImageOut	GMMDataType	Execution Time	Submit Latency	Total Power
1920x1080	OFF	OFF	FP16	17.465ms	0.064ms	12.707W
1920x1080	OFF	OFF	FP32	33.596ms	0.064ms	12.606W
1920x1080	OFF	ON	FP16	19.928ms	0.077ms	12.705W
1920x1080	OFF	ON	FP32	38.255ms	0.077ms	12.606W
1920x1080	ON	OFF	FP16	25.077ms	0.066ms	12.506W
1920x1080	ON	OFF	FP32	47.611ms	0.064ms	12.022W
1920x1080	ON	ON	FP16	27.522ms	0.081ms	12.506W
1920x1080	ON	ON	FP32	52.255ms	0.082ms	12.406W