BatchSVD#
Overview#
Singular Value Decomposition (SVD) is a general-purpose useful tool in numerical linear algebra, with wide-ranging applications in data processing, particularly for data reduction and dimensionality reduction.
At its core, SVD decomposes a given matrix A into three matrices \(U\), \(\Sigma\), and \(V^T\). Mathematically, this can be expressed as:
\(U\) is an orthogonal matrix representing the left singular vectors.
\(\Sigma\) is a diagonal matrix containing the singular values.
\(V^T\) is the transpose of an orthogonal matrix representing the right singular vectors.
The singular values in \(\Sigma\) are arranged in descending order, providing a measure of the importance or weight of each corresponding singular vector.
The Batch SVD operator performs a one-sided Jacobi SVD on a batch of matrices simultaneously, using a 32-bit float datatype. It is designed for the VPU to process one matrix SVD per vector lane. Since the VPU vector width for float is 16, the batch size is also 16 to fill the vector lanes. The total number of matrices can be divided into multiple batches, with the VPU processing one batch at a time.
For details of one-sided Jacobi SVD, please refer to Algorithm 4.1, p32 of [1].
Limitations#
There are two prerequisites for the Batch SVD operator to work.
First, the layout of input and output matrices must be in HWC format, with each element of the matrices interleaved in the C dimension. The HW dimensions correspond to matrix height and width, and all matrices must have the same height and width. The operator does not require C to be a multiple of 16; VPU will treat redundant data in VMEM (either 0 if it is the first batch or matrix data from a previous batch that has not been overwritten) as one batch without affecting the output.
Second, this operator computes an economy SVD, which requires that the input matrix height (m) is greater than or equal to its width (n). For each output matrix:
The left singular vectors (U) have dimensions \(m \times n\), matching the input matrix’s size.
The right singular vectors (V) form an \(n \times n\) matrix.
The singular values (S, representing \(\Sigma\)) are stored as an n-element vector, containing up to n valid values.
Note that unlike MATLAB where S is returned as a full diagonal matrix, here S is just a vector to conserve memory. All output matrices (U, S, and V) are in HWC format.
Implementation Details#
The one-sided Jacobi SVD algorithm performs column-cyclic Jacobi rotations on pairs of any ith and jth columns of the input matrix A to form the matrix U. The Jacobi rotations are accumulated to form the matrix V. For each column pair, the algorithm checks if they are orthogonal based on convergence criteria. The maximum convergence variable of all column pairs in a sweep is tracked to check if all matrices in a batch have converged. To prevent infinite loops, a maximum loop count is set, and the algorithm will terminate either after reaching this count or when all matrices in a batch have converged. Convergence variable can be expressed as:
If one column pair in a matrix converges (becomes orthogonal), the algorithm skips updating the U and V matrices for that matrix, regardless of the updates of other matrices in the same batch. For corner cases, such as nearly singular matrices, the above convergence check may not be sufficient. Additional checks are applied to avoid wasteful iterations: If \(\gamma\) is very close to 0, the two columns are nearly orthogonal, and if \(\alpha\) or \(\beta\) is very close to 0, one of the columns has become negligible.
The singular values are the L2 norms of the columns of the intermediate matrix U obtained from the Jacobi rotations. The left singular values are the normalized columns of this intermediate matrix U. These singular values, along with the final matrices U and V, are arranged in descending order. If any singular values are zero, it would result in NaN values in matrix U.
The VPU provides a variety of floating-point vector instructions, including multiply, add/subtract, multiply-add/subtract, reciprocal, square root, and more. Two major functions, calculate_alpha_beta_gamma and update_matrix, are executed column-wise and run in (O(n^2)) time. Within these functions, the loop iterations correspond to the matrix height. For small-sized matrices, where the loop iterations are limited to 2, 3, or 4, a specialized kernel function is called to optimize performance. For instance, in calculate_alpha_beta_gamma, if the matrix height is 2, 3, or 4, the kernel loop is fully unrolled, resulting in shorter execution cycles compared to the general kernel loop, which is unrolled by 2. Note that chess_unroll_loop(*) requires a manifest loop count.
The implementation uses 32-bit floating-point arithmetic, which inherently limits the precision of calculations. This limitation means that convergence criteria set below 1e-7 may not be met until the maximum number of iterations is reached. Additionally, when dealing with large input values, the precision can be further compromised. To mitigate these issues, it is recommended to scale or normalize the input matrix, both of which can help improve the accuracy and stability of the computations.
A double buffer scheme is employed for the input matrices, where the VPU processes the Ping batch while the DMA transfers the Pong batch to VMEM. This means that while one batch of data is being processed, another batch is being transferred, ensuring continuous operation without delays. The output matrices U, S, and V are not double buffered, but the latency of the DMA transfer is still hidden. This is because U, S, and V are sorted in descending order of singular values in the final step. During the Jacobi rotations on the next batch of matrices, the DMA can write out U, S, and V as long as the sorting is after the DMA write is finished.
Dataflows#
The SequenceDataFlow is used for transferring input and output matrices.
Since the VMEM destination of input matrices employs double buffering, special care is taken in constructing the dataflow, depending on whether the number of batches is even or odd. When the number of input matrices is not a multiple of the batch size, the tail tile will have a width equal to the remainder of the matrices count divided by the batch size.
Another option is to use RasterDataFlow to read input matrices. RasterDataFlow treats the matrices as a 1D vector of tiles, which has the advantage of handling the tail tile and VMEM double buffer implicitly. However, since tiles have no halo, unlike image input, this case is also suitable for SequenceDataFlow. To avoid the limitation of RasterDataFlow’s maximum tile count constraint (256 in one dimension) and to have flexible control of the VMEM destination buffer (if you want to further cascade other operations with SVD in one memory pass), SequenceDataFlow is chosen in this scenario.
The three output matrices, U, S, and V, are linked with TransferModeType set to CONTINUOUS and share a single SequenceDataFlow trigger. The output dataflow is updated for every tile/batch on the VPU side.
Reference check#
PVA and CPU may exhibit minor floating-point differences after a single iteration. However, these differences can accumulate over multiple iterations, potentially leading to discrepancies. If a singular value is very close to zero, the columns of U and V are not unique. Additionally, normalizing U could result in dividing a small value by another small value, causing a large difference between the PVA output and the C reference output. Therefore, the reference check for U and V only considers columns corresponding to singular values larger than a threshold (1e-3) or where the ratio of the largest singular value to it is larger than 1e-7. Sometimes, the PVA output and the C reference may exhibit U or V columns with opposite signs. If the column absolute values are close but all have opposite signs, they are still considered to pass the reference check. To compare the S outputs of the PVA and the C reference code, either the ratio of the two singular values is compared against 1 when the singular values are large, or the absolute difference is checked against a threshold when the singular values are small.
Performance#
The performance of the Batch SVD operator is data dependent.
Matrix (HWC) |
DataType |
Execution Time |
Submit Latency |
Total Power |
---|---|---|---|---|
3x3x1024 |
F32 |
0.228ms |
0.017ms |
9.78W |
4x4x1024 |
F32 |
0.479ms |
0.018ms |
9.773W |
6x6x1024 |
F32 |
1.479ms |
0.018ms |
9.673W |
12x12x1024 |
F32 |
10.154ms |
0.028ms |
9.657W |
16x16x1024 |
F32 |
22.857ms |
0.038ms |
10.423W |
For detailed information on interpreting the performance table above and understanding the benchmarking setup, see Performance Benchmark.
Reference#
[1] J. Demmel and K. Veselić, “Jacobi’s method is more accurate than QR”, LAPACK Working Note 15, 1989. https://www.netlib.org/lapack/lawnspdf/lawn15.pdf