VPU utilities#
Helper functions to write VPU code.
Using these APIs ensures compatibility between native and other build modes.
Macros#
- CUPVA_ALIGNED
Helper to declare aligned member variable.
- CUPVA_CIRCULAR_BUFFER_LENGTH
Calculate the VMEM buffer length (in pixel) of circular layout according to different configurations.
- CUPVA_DOUBLE_BUFFER_LENGTH
Calculate the VMEM buffer length (in pixel) of double layout according to different configurations.
- CUPVA_EXPORT
Export variable in VMEM to export table.
- CUPVA_SINGLE_BUFFER_LENGTH
Calculate the VMEM buffer length (in pixel) of single layout according to different configurations.
- EXTERN_VMEM
Declare extern VMEM buffer.
- EXTERN_VMEM_DMA_CONFIG
Declare extern table in VMEM for use with dynamic reconfiguration of DMA engine.
- EXTERN_VMEM_POINTER
Declare extern pointer in VMEM.
- EXTERN_VMEM_SURFACE
Declare extern surface handler in VMEM.
- INIT_AGEN1
Configure 1D agen from 1D AgenWrapper .
- INIT_AGEN2
Configure 2D agen from 2D AgenWrapper .
- INIT_AGEN3
Configure 3D agen from 3D AgenWrapper .
- INIT_AGEN4
Configure 4D agen from 4D AgenWrapper .
- INIT_AGEN5
Configure 5D agen from 5D AgenWrapper .
- INIT_AGEN6
Configure 6D agen from 6D AgenWrapper .
- VMEM
Declare exported VMEM data buffer.
- VMEM_NOEXPORT
Declare a private VMEM buffer.
- VMEM_POINTER
Declare pointer in VMEM.
- VMEM_SURFACE
Declare surface handler in VMEM.
Enumerations#
- BufferLayoutTypes
The VMEM buffer layout types.
- DmaTransferModeType
Valid values for DSTM/DDTM fields in DMA descriptors.
- PerfmonCounters
Available counters for Perfmon.
- TranspositionMode
transposition mode
- VmemBufferTypes
Valid VMEM types for export.
Functions#
- void cupvaAdvanceAgenCfg(AgenCFG *cfg, const int32_t step_in_bytes)
Advance the address of an AgenCFG stored in VMEM.
- void cupvaCircularBufferMemcpy(void *cb, uint32_t size)
Helper to copy the head 64-byte to tail for achieving the correct vload/vstore results.
- void cupvaFloatingPointNANErrorEnabled(bool enable)
Disable VPU exceptions when floating point calculations result in NAN.
- void * cupvaGetAgenCfgBase(AgenCFG *cfg)
Extract the base address of an agen CFG saved in VMEM.
- void cupvaICachePrefetch(uintptr_t addr_in_words, uint32_t size)
start VPU ICache prefetch
- void cupvaModifyAgenCfgBase(AgenCFG *cfg, void *const addr)
Update the base address of an agen CFG saved in VMEM.
- int32_t cupvaPerfmonReportRaw(PerfmonSample const &before, PerfmonSample const &after, PerfmonCounters counter)
Report the difference between two counters in two PerfmonSamples.
- void cupvaPerfmonTakeSample(PerfmonSample &sample)
Collect a perfmon sample.
- void cupvaPrefetchDoneWait(void)
Wait until all VPU issued icache prefetch requests have completed.
- void cupvaPrefetchReadyWait(void)
Wait until there is space in VPU prefetch FIFO for a new prefetch command.
- vintx cupvaSurfaceAddress2D(VPUSurfaceData const &surfData, dvintx coords)
compute surface address (vector operations)
- uint64_t cupvaSurfaceAddress2D(VPUSurfaceData const &surfData, uint32_t const x, uint32_t const y)
compute surface address (scalar operations)
- int printf_(const char *format,…)
Print string to host application's stdout.
- void swbrk(void)
Add software breakpoint.
Data Structures#
- AgenWrapper
Helper to configure agen by using stride instead of mod for each dim.
- PerfmonSample
Object representing a point-in-time sample of Perfmon.
- PvaDmaDescriptor
(deprecated) Data layout of 64 byte DMA descriptor.
Enumerations#
-
enum BufferLayoutTypes#
The VMEM buffer layout types.
DEFAULT_LAYOUT is translated different by the RasterDataFlow dealing with halo and non-halo configurations. In the halo pattern, the default layout is the circular while in the non-halo pattern, it is the double.
Values:
-
enumerator DEFAULT_LAYOUT#
-
enumerator SINGLE_LAYOUT#
-
enumerator DOUBLE_LAYOUT#
-
enumerator CIRCULAR_LAYOUT#
-
enumerator DEFAULT_LAYOUT#
-
enum DmaTransferModeType#
Valid values for DSTM/DDTM fields in DMA descriptors.
Values:
-
enumerator DMA_TRANS_MODE_INV#
invalid
-
enumerator DMA_TRANS_MODE_DRAM#
DRAM src/dst
-
enumerator DMA_TRANS_MODE_VMEM#
VMEM src/dst
-
enumerator DMA_TRANS_MODE_SRAM#
L2SRAM
-
enumerator DMA_TRANS_MODE_MMIO#
MMIO is valid as dst in VPU config mode only
-
enumerator DMA_TRANS_MODE_VPUCFG#
VPU config mode, valid for src only
-
enumerator DMA_TRANS_MODE_INV#
-
enum PerfmonCounters#
Available counters for Perfmon.
Used to specify a value of interest between two PerfmonSample objects.
These counters give insight into the operation of the VPS and in some cases require detailed understanding of the VPS architecture to correctly interpret. Refer to the PVA SDK reference manuals for more details of the VPS architecture.
Values:
-
enumerator PERFMON_NO_INSTRUCTION#
There is no valid instruction ready to issue, due to: 1) Pipeline bubbles from jumps/branches 2) Pipeline bubble from WFE_R5 to Active state transition 3) Cycles required to perform instruction alignment 4) I-cache misses.
-
enumerator PERFMON_VALID_INSTRUCTION#
VPU issues an instruction to the processor pipeline.
Sum of this count, PERFMON_NO_INSTRUCTION, and PERFMON_STALL_INSTRUCTION_DECODE is equal to the number of cycles VPU is in Active state between samples.
-
enumerator PERFMON_STALL_INSTRUCTION_DECODE#
Instruction decode stage is stalled due to: 1) Data hazard, or dependency through registers, between the instruction packet in ID and another packet already in the pipeline 2) Accumulated stall of ID and subsequent pipeline stages, back-pressured to stall the ID stage If this counter is high but it is not due to backpressure from other stalls, it may mean that there are many register dependencies.
This may be helped by loop unrolling and/or software pipelining.
-
enumerator PERFMON_STALL_SUPERBANK_LOAD_CONFLICT#
VMEM load stalled due to superbank conflict.
Most commonly caused by multiple loads to the same superbank in a single packet.
-
enumerator PERFMON_STALL_DATA_HAZARD_LOAD#
VMEM load stalled due to data hazard.
This occurs when there is an earlier store to an address in close proximity to the current load, and this store has not yet reached the VPU pipeline stage where the store has been completed.
The VPU is conservative in detecting Store-to-Load hazards. If there is no dependency, but close proximity (-/+ 2x64B lines), user can get around it by using Transpose Loads with Lane_Offset=0. In this case, the programmer must guarantee that overlapping stores and subsequent loads and stores are separated by a minimum number of cycles. Refer to the PVA SDK reference manuals for details.
-
enumerator PERFMON_STALL_DMA_LOAD_CONFLICT#
VMEM load stalled due to conflict with DMA.
This can happen when VPU reads from the same superbank that DMA is also reading, and DMA has higher priority. Generally VPU has priority over the DMA engine. However, after waiting too long, the DMA engine read will gain elevated priority and can cause VPU stalls.
-
enumerator PERFMON_STALL_SUPERBANK_STORE_CONFLICT#
VMEM store stalled due to superbank conflict.
Most commonly caused by multiple stores to the same superbank in a single packet.
-
enumerator PERFMON_STALL_DMA_STORE_CONFLICT#
VMEM store stalled due to conflict with DMA.
This can happen when VPU writes to the same superbank that DMA is also writing to, and DMA has higher priority. Generally VPU has priority over the DMA engine. However, after waiting too long, the DMA engine read will gain elevated priority and can cause VPU stalls.
-
enumerator PERFMON_VPU_WAITING#
Number of cycles VPU spent waiting for other engines in a low powered state.
The VPU is able to transition to a low power state while waiting on events from DMA, DLUT or its instruction cache.
-
enumerator PERFMON_ICACHE_MISSES#
Number of instruction cache misses.
-
enumerator PERFMON_ICACHE_MISS_DURATION#
Number of cycles spent waiting for instructions to become available following a cache miss.
This is a subset of PERFMON_NO_INSTRUCTION.
-
enumerator PERFMON_DLUT_BUSY#
Number of cycles during which DLUT is busy.
-
enumerator PERFMON_DLUT_WAIT_FOR_VPU#
Number of cycles DLUT spends finished before VPU acknowledges completion.
-
enumerator PERFMON_COUNT#
-
enumerator PERFMON_NO_INSTRUCTION#
-
enum TranspositionMode#
transposition mode
Values:
-
enumerator TRANS_MODE_NONE#
No transposition
-
enumerator TRANS_MODE_1#
Transpose every 1 data element
-
enumerator TRANS_MODE_2#
Transpose every 2 data elements
-
enumerator TRANS_MODE_4#
Transpose every 4 data elements
-
enumerator TRANS_MODE_8#
Transpose every 8 data elements
-
enumerator TRANS_MODE_16#
Transpose every 16 data elements
-
enumerator TRANS_MODE_32#
Transpose every 32 data elements
-
enumerator TRANS_MODE_NONE#
-
enum VmemBufferTypes#
Valid VMEM types for export.
Values:
-
enumerator VMEM_TYPE_MIN#
Invalid type; do not use
-
enumerator VMEM_TYPE_DATA#
Standard data buffer
-
enumerator VMEM_TYPE_VPUC_TABLE#
Table which can be used for VPU config
-
enumerator VMEM_TYPE_POINTER#
Deprecated - use ExtMemPointerEx instead
-
enumerator VMEM_TYPE_SYSTEM#
System reserved; do not use
-
enumerator VMEM_TYPE_POINTER_EX#
Buffer representing a pointer of type ExtMemPointerEx
-
enumerator VMEM_TYPE_MAX#
Invalid type; do not use
-
enumerator VMEM_TYPE_MIN#
Functions#
- inline void cupvaAdvanceAgenCfg(
- AgenCFG *cfg,
- const int32_t step_in_bytes,
Advance the address of an AgenCFG stored in VMEM.
This function advances the address of an AgenCFG stored in VMEM. It respects circular buffer configuration of the AgenCFG.
Pointers passed to this function must be valid and must not be NULL. Failure to do so will result in undefined behavior.
- Parameters:
cfg – [inout] AgenCFG to advance
step_in_bytes – How many bytes to advance
-
inline void cupvaCircularBufferMemcpy(void *cb, uint32_t size)#
Helper to copy the head 64-byte to tail for achieving the correct vload/vstore results.
When using agen configured for circular buffer, a load/store which crosses the circular buffer boundary may give the wrong results. It is recommended to use this function to ensure that loads are consistent.
Pointers passed to this function must be valid and must not be NULL. Failure to do so will result in undefined behavior.
- Parameters:
cb – Pointer to start of circular buffer
size – Size of circular buffer in bytes
-
void cupvaFloatingPointNANErrorEnabled(bool enable)#
Disable VPU exceptions when floating point calculations result in NAN.
Ordinarily, floating point calculations which yield NAN raise an exception which leads to immediate termination of the VPU program. This can be disabled for some sections of code when NAN may be expected and can be handled downstream.
- Parameters:
enable – true - floating point calculations resulting in NAN will terminate VPU execution false - Execution will continue after floating point NAN
-
inline void *cupvaGetAgenCfgBase(AgenCFG *cfg)#
Extract the base address of an agen CFG saved in VMEM.
Pointers passed to this function must be valid and must not be NULL. Failure to do so will result in undefined behavior.
- Parameters:
cfg – [in] AgenCFG to extract address from
- Returns:
Base address of cfg
-
void cupvaICachePrefetch(uintptr_t addr_in_words, uint32_t size)#
start VPU ICache prefetch
- Parameters:
addr_in_words –
The starting prefetch address. The address is specified in words because the compiler generates addresses in words. Therefore, function call like cupvaICachePrefetch((uintptr_t)<function_name>,
size) will work as intended.
size – The code size to prefetch. The size is specified in bytes and will be rounded up to cache line size.
-
inline void cupvaModifyAgenCfgBase(AgenCFG *cfg, void *const addr)#
Update the base address of an agen CFG saved in VMEM.
Pointers passed to this function must be valid and must not be NULL. Failure to do so will result in undefined behavior.
- Parameters:
cfg – [out] AgenCFG to update with new address
addr – [in] New address
- inline int32_t cupvaPerfmonReportRaw(
- PerfmonSample const &before,
- PerfmonSample const &after,
- PerfmonCounters counter,
Report the difference between two counters in two PerfmonSamples.
- Parameters:
before – [in] The earlier sample, collected with cupvaPerfmonTakeSample
after – [in] The later sample, collected with cupvaPerfmonTakeSample
counter – [in] The counter of interest
- Returns:
The difference between the earlier and later sample for the given counter
-
inline void cupvaPerfmonTakeSample(PerfmonSample &sample)#
Collect a perfmon sample.
This function will first immediately stop accumulating perfmon counters. It will then collect the perfmon sample. This can take some time. After the sample has been collected, perfmon counters will resume accumulating and this function will return.
Pausing and resuming perfmon accumulation during this function call minimizes the noise introduced by sampling. However, this function modifies the state of the VPU pipeline, and introduces a delay in VPU processing. DMA, DLUT and I$ are not paused while this function is running. Therefore, there still may be significant sampling noise in some cases. Users are advised to space out samples to reduce the impact of sampling noise.
- Parameters:
sample – [out] Opaque PerfmonSample to write
-
inline void cupvaPrefetchDoneWait(void)#
Wait until all VPU issued icache prefetch requests have completed.
-
inline void cupvaPrefetchReadyWait(void)#
Wait until there is space in VPU prefetch FIFO for a new prefetch command.
- inline vintx cupvaSurfaceAddress2D(
- VPUSurfaceData const &surfData,
- dvintx coords,
compute surface address (vector operations)
- Parameters:
surfData – The reference to a VPUSurfaceData object.
coords – The coords contains x coordinates in bytes and y coordinate in lines. The dvintx data type is used here as a struct with two fields: x, y, where x coordinate is stored in hi part, and y is store in lo part. For block linear surface, the x coordinate needs to be multiple of 64, and the y coordinate needs to be multiple of 2.
- Returns:
The translated surface address in bits 39:0 per lane. Higher bits contain metadata which is recognized by DynamicDataFlow APIs and should be preserved when passing to cupvaDDFUpdateSrcAddr etc.
- inline uint64_t cupvaSurfaceAddress2D(
- VPUSurfaceData const &surfData,
- uint32_t const x,
- uint32_t const y,
compute surface address (scalar operations)
- Parameters:
surfData – The reference to a VPUSurfaceData object.
x – The x coordinate in bytes. For block linear surface, the x coordinate needs to be multiple of 32.
y – The y coordinate in lines. for block linear surface, the y coordinate needs to be multiple of 2.
- Returns:
The translated surface address in bits 39:0. Higher bits contain metadata which is recognized by DynamicDataFlow APIs and should be preserved when passing to cupvaDDFUpdateSrcAddr etc.
-
int printf_(const char *format, ...)#
Print string to host application’s stdout.
This function implements buffered printf behavior for device-side debugging. Using this function requires 1) The VPU ELF must be linked with some memory reserved for the printf buffer (256B is reserved by default) and 2) The host API to set a non-zero printf buffer must have been called, see cupva::SetVPUPrintBufferSize or CupvaSetVPUPrintBufferSize. Violating (1) will cause a linking error. Violating (2) will result in no data being displayed at stdout.
The host side printf buffer is drained to the application’s stdout on every call to cupva::Fence::wait. Additionally, the buffer is drained immediately after a successful call to cupva::Stream::submit for any cupva::Stream created with cupva_utils::CreateSyncStream.
The printf mechanism results in significant VPU runtime overhead. It should never be used in a production setting.
For compatibility with native model, users should call this function with printf(…) instead of printf_(…).
This API is for use in non-safety builds only.
- Parameters:
format – The string to undergo formatting
... – Variadic arguments to pack into the output buffer according to the content of format
- Returns:
number of characters printed to the output buffer
-
inline void swbrk(void)#
Add software breakpoint.
When this function is called, VPU is paused, waiting for debugger to connect. Debugger is able to single step through it.
This API is for use in non-safety builds only