Tegra has a very capable vertex processor, so you shouldn't automatically assume that a large triangle count is the reason for poor performance. However, in order to achieve high throughput a number of resources constraints must be managed.
Typically, 20-30 cycles is reasonable for most assets, but peak throughput is achieved at around 10 cycles per vertex.
Improving Post-transform vertex reuse is the single easiest way to increase geometry throughput on Tegra.
In order to maximize post-transform reuse, it is recommended that you use indexed triangle lists for all geometry. Indexed triangle lists allow easy construction of optimized primitives without the need for degenerate link triangles.
DrawArrays
) or indexed (DrawElements
). DrawElements
is recommended.TRIANGLE_STRIP/TRIANGLE_FAN/TRIANGLES
. TRIANGLES
is recommended.Efficiently organized vertex/index data is key to reducing load on the memory system. It's recommended that you place all geometry in VBOs, and interleave vertex attribute data whenever possible (i.e., use the "array of structures" vertex layout).
![]() |
NOTE: Those are identifiable by dashboard dynamic vertex buffer object indicator. (Status bar, lower right). |
Dynamically modified vertex data should also be placed in buffer objects. However, modifying (glMapBufferOES/glBufferData/glBufferSubData
) a submitted buffer will stall the CPU until that buffer has been completely processed by the GPU.
Currently, buffer renaming is not supported. Also, it is recommended that you double buffer dynamic VBO handles, and only update the handle not already referenced by the in-flight scene.
Find a completely static scene (so that the call-trace contains identical content each time). Do this, half a dozen or so times:
trace_n.txt
.glMapBufferOES/glBufferData/glSubBufferData
) in one of the trace files.As an example, some of this with can be automated with grep:
grep -b1 glMapBufferOES trace_?.txt
The -b1
option outputs the previous one line before the match (which should capture the prior buffer binding).
It's recommended that you reduce vertex fetch memory bandwidth as much as possible, and not rely on fp32 for everything. Tegra supports half precision float, as well as signed/unsigned byte/short formats.
The memory controller in a Tegra device is much more efficient when fetching from spatially coherent addresses. Storing vertex attributes as array of structures ensures good spatial locality for attributes making up any particular vertex. This results in an efficient memory access pattern. Not following this guidance can severely impact performance.
NVIDIA® GameWorks™ Documentation Rev. 1.0.211026 ©2014-2021. NVIDIA Corporation and affiliates. All Rights Reserved.