Optimizing VPU Programs#
For this tutorial, we are estimating the ideal performance (Speed of Light, SOL) of the convolution vector code, and showing optimization techniques to help reach SOL. This tutorial is broken down into the following parts:
Determining the Speed-of-Light (SOL) performance of the convolution loop. You are introduced to the VPU’s rules for operation slotting in its Very Long Instruction Word (VLIW) execution packets.
Demonstrate how to examine device code disassembly of the vector convolution code and see how the compiler schedule compares against the SOL calculation.
Show how to apply loop unrolling and software pipelining to improve loop scheduling.
Computing the Speed-of-Light (SOL) Performance#
In this tutorial, we are optimizing the performance of a convolution kernel whose inputs are an 8-bit image and 16-bit 5x5 filter kernel, and whose output is an 8-bit image. On a 32-way SIMD machine, computing 32 5x5 filtered output pixels requires 25 vector MAC operations. If we assume that each vector MAC takes one cycle to complete, and there are no other dependencies, then the SOL performance for this kernel would be 25/32 cyc/pix = 0.78 cyc/pixel. On the VPU, this kernel is executed with 8-bit and 16-bit vector load ops, 16-bit vector MAC ops, and a 16-bit to 8-bit round/saturating store. We are doing the MAC with the vmaddh instruction, which performs 16-bit x 16-bit multiplies into a 16-bit register (where the 8-bit input is promoted to 16-bit during the load for use by the multiply instruction).
The VPU is a 7-way Very Long Instruction Word (VLIW) and Wide-SIMD vector processor. It can run up to 2 scalar operations, 2 vector operations, and 3 memory operations in one execution packet in a single cycle. In this section, we focused on getting a mental count of the operation types in the loop to give us a baseline from which to interpret the results of the loop optimization pragmas that we are reviewing in the later part of this tutorial. In the subsequent sections, we see how the instructions in our convolution loop body can fit into the available instruction slots of an execution packet to reach maximum performance.
Below is the vector “for” loop declaration, along with the appending of the
chess_unroll_loop()
andchess_prepare_for_pipelining()
pragmas. These pragmas give the compiler additional information on the behavior of the loop, which may allow it to produce a more optimal schedule.for (int i = 0; i < agens.niter; i++) chess_unroll_loop(2) /* Adjust this factor to see differences in loop performance */ chess_prepare_for_pipelining {
The following two load operations constitute 2 out of the 3 total memory ops for this loop.
dvdataInH = vuchar_dvshortx_load(in1); vcoef = vshort_load_hs(in2);
The following two vector operations constitute the totality of vector ops for this loop. The vector arithmetic operations are comprised of multiply accumulates between the input pixel and filter coefficients.
dvacc.lo = vmaddh(dvdataInH.lo, vcoef, dvacc.lo, 0, pred_madd); dvacc.hi = vmaddh(dvdataInH.hi, vcoef, dvacc.hi, 0, pred_madd);
The following store operation constitutes 1 out of the 3 total memory ops for this loop.
vstore_hb(dvacc, out, pred_store); // Note: vstore_hb supported only on Gen2 or above devices
The following two scalar operations constitute the totality of scalar ops for this loop.
pred_madd = mod_inc(pred_madd, agens.niter_acc_reset - 1); s_ctrl_store = mod_inc_pred_z(s_ctrl_store, agens.niter_acc_reset - 1, pred_store); }
The scalar operations compute the predicate variables
pred_madd
ands_ctrl_store
.pred_madd
is used to control when the accumulation register is cleared.s_ctrl_store
is used to ensure that the accumulator output is stored only at the end of the filter kernel inner loop. Note that the VPU has a specific scalar instruction to do these predicate calculations (modinc_notp, modinc) so that these types of common periodic predicate calculations can be done in one op.
To summarize the above, the convolution loop body gives us 2 scalar operations, 3 memory operations, and 2 vector operations. In theory these could all fit into the same VLIW execution packet and execute in a single clock cycle, as our earlier SOL calculations suggest. However, with the loop in its current form (assuming no usage of the pipelining and unroll pragmas) this is not the case as there is an instruction-to-instruction dependency coming from the LD -> MAC > STORE sequence of operations. Our SOL calculation is a naive approach which assumes that we can fully hide register dependencies by loop unrolling and pipelining. Doing SOL calculations in this manner is often a good place to start as it places an upper bound on the best possible performance. In this particular case, we expect be able to fit the scalar, vector, and mem ops into the same execution packets so we are not bounded by any particular operation type. In other cases, an application could be bounded by the availability of slots for either scalar, vector, or memory op types.
Examining Vector Code disassembly#
We now examine the dissassembled convolution vector code loop body from VPU Programming Basics Tutorial to evaluate how its schedule compares against the SOL estimate. The dissassembled loop body is as follows:
VLDH_S *A2++,V2 || DVLDBHU_P *A1++,V0:V1
MODINC R10,P2 || [P2] VMAddH_CA V0,V2,AC0|| [P2] VMAddH_CA V1,V2,AC1
MODINC_NOTP R10,R6,P3 || [P3] DVSTHB_P AC0:AC1,*A0++
You can access this disassembly by going to the following location in your build directory:
YOUR_BUILD_OUTPUT_DIR/samples/public/tutorial/vpu_programming_basics/vpu/vpu_programming_basics_dev_gen2.elf.lst
To locate this kernel loop in the disassembly file, search for the Hardware Loop “RPT” instruction whose loop body has the same vector instructions you used in your kernel’s source code. The body of the repeat loop (RPT) executes instructions 2 cycles after the RPT instruction is issued. The immediate field of RPT encodes the PC difference between the second delay slot (just before entering the loop) and the last packet of the loop.
This convolution loop’s schedule from VPU Programming Basics Tutorial takes 3 cycles to produce 32 intermediate output pixels. It takes 75 cycles to produce the full filter output for 32 pixels. This means that our vector code performance is three times slower than the SOL performance calculated. This is because of the dependency between the load, MAC, and store operations mentioned earlier. Currently we are using at most 3 out of the 7 available instruction slots in each execution packet. In order to use more of the instruction slots in an execution packet, we need to pipeline across iterations. At this stage, the compiler does not have any additional information about this loop that it can use to improve its scheduling. We review the concepts of loop unrolling and software pipelining, which allows us to provide more information to the compiler about this loop’s behavior, which then enables it to to produce a better schedule.
Loop Optimization Techniques#
In this section, we review two techniques that can be applied to the convolution loop to help improve its scheduling by the compiler:
Software Pipelining
Loop Unrolling
Software Pipelining#
Software pipelining is a compiler optimization technique that runs portions of a loop’s prior, current, and future iterations in the same loop body to help utilize more instruction-level paralellism. For example, this means that in one VPU execution packet, the following things could be happening with the convolution code:
The three memory units could be loading the coefficient and data for iteration “n+1”, and storing the filter output from iteration “n-1.”
The two vector units could be doing the multiply accumulate operations for iteration “n.”
The two scalar units could be computing the predicate values for the store of iteration “n.”
If we could acheive this specific instruction packing in our scheduled loop, we would be at the SOL performance of 1 cycle for each intermediate pixel output. By issuing a single instruction packet that contains instructions for both the previous, next, and current iteration, we are able to maximize usage of the 7 available instruction slots.
Loop Unrolling#
When unrolling a loop, you are replicating the code of the inner loop body multiple times, while reducing the overall iteration count by that same factor. For example, if you unroll a 100-iteration loop by a factor of 2, the loop runs a total of 50 iterations, and the inner loop body is executing iterations “n” and “n+1.” By doing this, we can generate more dense VLIW instruction packets, meaning we dispatch more instructions per cycle. For example, the loop unrolling would allow us to run the MAC operations for iteration n, while performing the load of the the input pixels for iteration n+1. Note that there is a tradeoff with the use of loop unrolling in that the larger the unrolling factor, the larger your resulting code size is in the generated executable. There is an additional limiting factor with loop unrolling in that you may run into register pressure depending on how large your loop body is, the size of your unrolling factor, and the size of your architecture’s register file. The programmer must also be cautious not to unroll by a factor that is incompatible with the number of iterations the loop actually runs. For example, if you have a loop which is supposed to execute 5 times and you unroll by a factor of 2, you may end up executing the loop 6 times.
On VPU, the user can use the following pragma to instruct the compiler to unroll the loop:
chess_unroll_loop(n)
This informs the compiler that it can attempt to unroll the loop by a factor of “n.” Selecting the right unrolling factor of a loop depends on how much vacancy there is in a single iteration due to load-to-use latency (minimum amount of cycles required before the result of a load operation is available in registers) and/or vector math operation latency (which is 6 cycles on Gen2 VPU), so we need to experiment with different unrolling factors to find one that works. Refer to Orin_PVA_VPU_Programmers_Guide.pdf in the SDK docs for additional information on pipelining/unrolling strategies.
Compiler Pragmas for Pipelining#
Often, the best performance is acheived when loop unrolling and software pipelining are done together. We enable these types of optimizations via the following two pragmas:
chess_unroll_loop(UNROLL_FACTOR)
tells the compiler to replicate the loop body “UNROLL_FACTOR” number of times and adjusts the loop iteration count accordingly.chess_prepare_for_pipelining()
tells the compiler to fold and schedule the original loop content into multiple iterations.
SW Pipelined Loop with unroll factor of 2#
Below is the vector “for” loop declaration, along with the appending of the chess_unroll_loop()
and chess_prepare_for_pipelining()
pragmas.
for (int i = 0; i < agens.niter; i++)
chess_unroll_loop(2) /* Adjust this factor to see differences in loop performance */
chess_prepare_for_pipelining
{
With these pragmas, we are instructing the compiler to unroll the loop once and apply software pipelining. Let’s examine the disassembly output for these settings:
Prologue (Computes iteration 0 & 1, and loads inputs for iterations 2 & 3)
ADDI R2,#-2,R2 || MOVSP R0,P3 || DVLDBHU_P *A0++,V0:V1|| VLDH_S *A1++,V4
DVLDBHU_P *A0++,V2:V3|| VLDH_S *A1++,V5
[P3] VMAddH_CA V0,V4,AC0|| [P3] VMAddH_CA V1,V4,AC1|| DVLDBHU_P *A0++,V0:V1|| VLDH_S *A1++,V4
RPT R2,#15
MODINC R4,P3 || NOP || NOP || NOP || NOP || NOP
MODINC_NOTP R4,R5,P4|| MODINC R4,P3 || [P3] VMAddH_CA V3,V5,AC1|| [P3] VMAddH_CA V2,V5,AC0|| [P2] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V2:V3|| VLDH_S *A1++,V5
Steady State Loop Body (Computes iterations 2 to n - 5)
MODINC_NOTP R4,R5,P2|| [P3] VMAddH_CA V0,V4,AC0|| [P3] VMAddH_CA V1,V4,AC1|| DVLDBHU_P *A0++,V0:V1|| VLDH_S *A1++,V4|| [P4] DVSTHB_P AC0:AC1,*A2++
[P2] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3
MODINC R4,P3 || MODINC_NOTP R4,R5,P4|| [P3] VMAddH_CA V2,V5,AC0|| [P3] VMAddH_CA V3,V5,AC1|| DVLDBHU_P *A0++,V2:V3|| VLDH_S *A1++,V5
Epilogue (Completes computation for remaining iterations)
MODINC R4,P3 || MODINC_NOTP R4,R5,P2|| [P3] VMAddH_CA V0,V4,AC0|| [P3] VMAddH_CA V1,V4,AC1|| [P4] DVSTHB_P AC0:AC1,*A2++
JR R15
MODINC_NOTP R4,R5,P4|| [P3] VMAddH_CA V2,V5,AC0|| [P3] VMAddH_CA V3,V5,AC1|| [P2] DVSTHB_P AC0:AC1,*A2++
[P4] DVSTHB_P AC0:AC1,*A2++
RPT R2,#14
MOVSP R0,P3
NOP
MODINC_NOTP R4,R5,P4|| VLDH_S *A1++,V4|| DVLDBHU_P *A0++,V0:V1
VLDH_S *A1++,V5 || DVLDBHU_P *A0++,V2:V3
MODINC R4,P3 || [P3] VMAddH_CA V0,V4,AC0|| [P3] VMAddH_CA V1,V4,AC1
MODINC R4,P3 || MODINC_NOTP R4,R5,P2|| [P3] VMAddH_CA V2,V5,AC0|| [P3] VMAddH_CA V3,V5,AC1|| [P2] DVSTHB_P AC0:AC1,*A2++
[P4] DVSTHB_P AC0:AC1,*A2++
In sofware pipelined loops, there is a prologue and epilogue that surrounds the steady state loop body. The prologue queues up
the pipeline and the epilogue winds it down. In cases where the loop iteration count is not known at compile time, the prologue/epilogue
has many branch statements. These branches can be avoided by specifying chess_loop_range()
. Most of the loop iterations are run in the
steady state loop body. In this case the steady-state performance is 4 cycles for every 64 intermediate output pixels. This is 2x slower
than the SOL, and does not hide the 6-cycle vector math operation latency. In two of the steady state cycles we are executing 6 instructions per execution
packet. In the other two cycles of the steady state loop we are executing only one instruction per packet so there is still more room for instruction-level
parallelism as we are not utilizing all of the available instruction slots.
SW Pipelined Loop with unroll factor of 8#
We can try to improve the paralellism by increasing the unrolling factor to 8. Below is the disassembly output for this setting:
Prologue (Computes iteration 0 & 1, and loads inputs for iterations 2 to 7)
ADDI R2,#-1,R2 || MOVSP R0,P3 || VLDH_S *A1++,V22|| DVLDBHU_P *A0++,V12:V13
VLDH_S *A1++,V20 || DVLDBHU_P *A0++,V8:V9
VLDH_S *A1++,V18 || DVLDBHU_P *A0++,V4:V5
VLDH_S *A1++,V16 || DVLDBHU_P *A0++,V0:V1
VLDH_S *A1++,V17 || DVLDBHU_P *A0++,V2:V3
RPT R2,#56 || NOP || NOP || NOP || VLDH_S *A1++,V19|| DVLDBHU_P *A0++,V6:V7
MODINC R4,P3 || NOP || [P3] VMAddH_CA V12,V22,AC0|| [P3] VMAddH_CA V13,V22,AC1|| DVLDBHU_P *A0++,V10:V11|| VLDH_S *A1++,V21|| NOP
MODINC_NOTP R4,R5,P4|| MODINC R4,P3 || [P3] VMAddH_CA V9,V20,AC1|| [P3] VMAddH_CA V8,V20,AC0|| [P2] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V14:V15|| VLDH_S *A1++,V23
Steady State Loop Body (Computes iterations 2 to n - 15)
MODINC R4,P3 || MODINC_NOTP R4,R5,P5|| [P3] VMAddH_CA V4,V18,AC0|| [P3] VMAddH_CA V5,V18,AC1|| [P4] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V12:V13|| VLDH_S *A1++,V22
MODINC R4,P3 || MODINC_NOTP R4,R5,P6|| [P3] VMAddH_CA V0,V16,AC0|| [P3] VMAddH_CA V1,V16,AC1|| [P5] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V8:V9|| VLDH_S *A1++,V20
MODINC R4,P3 || MODINC_NOTP R4,R5,P7|| [P3] VMAddH_CA V2,V17,AC0|| [P3] VMAddH_CA V3,V17,AC1|| [P6] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V4:V5|| VLDH_S *A1++,V18
MODINC R4,P3 || MODINC_NOTP R4,R5,P8|| [P3] VMAddH_CA V6,V19,AC0|| [P3] VMAddH_CA V7,V19,AC1|| [P7] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V0:V1|| VLDH_S *A1++,V16
MODINC R4,P3 || MODINC_NOTP R4,R5,P9|| [P3] VMAddH_CA V10,V21,AC0|| [P3] VMAddH_CA V11,V21,AC1|| [P8] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V2:V3|| VLDH_S *A1++,V17
MODINC_NOTP R4,R5,P10|| MODINC R4,P3|| [P3] VMAddH_CA V14,V23,AC0|| [P3] VMAddH_CA V15,V23,AC1|| [P9] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V6:V7|| VLDH_S *A1++,V19
MODINC R4,P3 || MODINC_NOTP R4,R5,P2|| [P3] VMAddH_CA V12,V22,AC0|| [P3] VMAddH_CA V13,V22,AC1|| [P10] DVSTHB_P AC0:AC1,*A2++|| DVLDBHU_P *A0++,V10:V11|| VLDH_S *A1++,V21
MODINC R4,P3 || MODINC_NOTP R4,R5,P4|| [P3] VMAddH_CA V8,V20,AC0|| [P3] VMAddH_CA V9,V20,AC1|| DVLDBHU_P *A0++,V14:V15|| VLDH_S *A1++,V23|| [P2] DVSTHB_P AC0:AC1,*A2++
Epilogue (Completes computation for remaining iterations)
MODINC R4,P3 || MODINC_NOTP R4,R5,P5|| [P3] VMAddH_CA V4,V18,AC0|| [P3] VMAddH_CA V5,V18,AC1|| [P4] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || MODINC_NOTP R4,R5,P6|| [P3] VMAddH_CA V0,V16,AC0|| [P3] VMAddH_CA V1,V16,AC1|| [P5] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || MODINC_NOTP R4,R5,P7|| [P3] VMAddH_CA V2,V17,AC0|| [P3] VMAddH_CA V3,V17,AC1|| [P6] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || MODINC_NOTP R4,R5,P8|| [P3] VMAddH_CA V6,V19,AC0|| [P3] VMAddH_CA V7,V19,AC1|| [P7] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || MODINC_NOTP R4,R5,P9|| [P3] VMAddH_CA V10,V21,AC0|| [P3] VMAddH_CA V11,V21,AC1|| [P8] DVSTHB_P AC0:AC1,*A2++
JR R15
MODINC_NOTP R4,R5,P10|| [P3] VMAddH_CA V14,V23,AC0|| [P3] VMAddH_CA V15,V23,AC1|| [P9] DVSTHB_P AC0:AC1,*A2++
NOP || [P10] DVSTHB_P AC0:AC1,*A2++
MOVSP R0,P3 || MODINC_NOTP R4,R5,P4|| VLDH_S *A1++,V22|| DVLDBHU_P *A0++,V12:V13
MODINC_NOTP R4,R5,P5|| VLDH_S *A1++,V20|| DVLDBHU_P *A0++,V8:V9
MODINC_NOTP R4,R5,P6|| VLDH_S *A1++,V18|| DVLDBHU_P *A0++,V4:V5
MODINC_NOTP R4,R5,P7|| VLDH_S *A1++,V16|| DVLDBHU_P *A0++,V0:V1
MODINC_NOTP R4,R5,P8|| VLDH_S *A1++,V17|| DVLDBHU_P *A0++,V2:V3
MODINC_NOTP R4,R5,P9|| VLDH_S *A1++,V19|| DVLDBHU_P *A0++,V6:V7
MODINC R4,P3 || [P3] VMAddH_CA V12,V22,AC0|| [P3] VMAddH_CA V13,V22,AC1|| VLDH_S *A1++,V21|| DVLDBHU_P *A0++,V10:V11
MODINC R4,P3 || [P3] VMAddH_CA V8,V20,AC0|| [P3] VMAddH_CA V9,V20,AC1|| [P2] DVSTHB_P AC0:AC1,*A2++|| VLDH_S *A1++,V23|| DVLDBHU_P *A0++,V14:V15
MODINC R4,P3 || [P3] VMAddH_CA V4,V18,AC0|| [P3] VMAddH_CA V5,V18,AC1|| [P4] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || [P3] VMAddH_CA V0,V16,AC0|| [P3] VMAddH_CA V1,V16,AC1|| [P5] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || [P3] VMAddH_CA V2,V17,AC0|| [P3] VMAddH_CA V3,V17,AC1|| [P6] DVSTHB_P AC0:AC1,*A2++
MODINC R4,P3 || [P3] VMAddH_CA V6,V19,AC0|| [P3] VMAddH_CA V7,V19,AC1|| [P7] DVSTHB_P AC0:AC1,*A2++
JR R15 || MODINC R4,P3 || [P3] VMAddH_CA V10,V21,AC0|| [P3] VMAddH_CA V11,V21,AC1|| [P8] DVSTHB_P AC0:AC1,*A2++
MODINC_NOTP R4,R5,P10|| [P3] VMAddH_CA V14,V23,AC0|| [P3] VMAddH_CA V15,V23,AC1|| [P9] DVSTHB_P AC0:AC1,*A2++
[P10] DVSTHB_P AC0:AC1,*A2++
In this case, the steady state loop performance is 8 cycles for 256 intermediate output pixels, or 1 cycle for every 32 intermediate output pixel, which is the SOL performance we computed earlier. Upon examination of the instruction packing in the steady state loop, we can see that each packet is executing 7 instructions in parallel, so we are fully utilizing the 7-way VLIW architecture, and effectively hiding load and vector math operation latencies.