You are here:

Vertex Shader Performance

The Tegra vertex shader unit is extremely powerful and flexible. It is a super-scalar, dual-issue unit consisting of a vector floating point (4-wide) pipeline alongside a scalar floating point "multi-function unit" pipeline (providing common transcendental function implementations; sin, sqrt, log2, etc).

It has full support for conditional operations and looping, and is capable of transforming vertices at a rate of more than 100M per second. It has pre-transform and post-transform cache support systems in order to help achieve high throughput efficiently, without loading the back-end memory system excessively.

Vertex Shader Guidelines and Optimizations

Reduce the memory footprint and optimize the layout of vertex data to minimize load on the memory system by doing the following:

Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Use compressed vertex attribute formats; Tegra supports byte and short formats, as well as half-precision(fp16) and full-precision(fp32) float vertex attribute formats. All attributes are converted to floating point (for free) and can be renormalized (according to the GLES2 spec, also for free) when fetched into the vertex unit. Judicious use of byte (normal and tangent vectors, colors), half precision float (texture mapping coordinates) can easily reduce vertex footprint, and thus vertex fetch memory bandwidth, by 50%, with negligible, or zero, impact on image quality.
Optimize geometry for pre-transform cache. Typically this involves a trivial re-ordering of vertex data (and subsequent remapping of indices) to improve cache coherency of fetched vertex attributes.

Take full advantage of Tegra’s sophisticated post-transform cache by following these guidelines:

Always use indexed primitives (glDrawElements). This reduces the average number of vertices shaded per triangle, which improves overall geometry throughput considerably.
Prefer indexed triangle lists to indexed triangle strips. The degenerate triangles that are used to link triangle strips consume precious cycles for no return; therefore, using triangle strips (indexed or otherwise) is not optimal on Tegra devices.
Optimize index buffers to minimize cache misses when fetching the indexed vertex stream. For complex vertex shaders, this can be hugely beneficial. Reusing vertices from post-transform cache reduces load on the vertex shader unit for zero cost. NVIDIA provide the NvTriStrip tool to help optimize geometry for post (and pre) transform cache.

Use vertex buffer objects (VBOs) to store ALL geometry (vertex and index) data:

By storing data in VBOs, the application makes a contract with the driver, stating that the buffer content will not change, except via certain well-defined API’s.
Mark as many VBOs as possible with GL_STATIC_DRAW. This makes a further statement to the driver that the application does not intend to modify the contents of a particular buffer in the foreseeable future. This allows the driver to deal with that buffer content as if it were immutable, without fear of the content being modified.
If the content of a VBO is likely to change, mark the buffer as GL_DYNAMIC_DRAW (or GL_STREAM_DRAW, depending on the likely update pattern).

Avoid updating dynamic VBO content for a buffer already in use by the GPU:

Attempting to access a buffer already referenced by one or more in-flight drawcalls will stall the CPU until all drawcalls referencing that buffer have been processed. Use at least a pool of round-robin VBOs for dynamic objects (reusing in least-recently-drawn fashion), and if possible, use an independent VBO set per dynamic object.

Minimize unnecessary load on the memory system when updating VBO content dynamically:

If CPU-side update of dynamic VBO content is unavoidable, the most efficient way to do this on Tegra is with glMapBuffer/glUnmapBuffer APIs. Using this approach removes the need for the driver to copy the data (as the application writes directly to the mapped buffer storage).

Character Skinning and the Vertex Unit

Moving character skinning from the CPU to the GPU is a perfect way to offload the CPU and lower memory bandwidth. OpenGL ES 2.0 makes dynamic character skinning possible on the GPU, even if the skinning method does not fit the "basic bone-palette" limitations. Even more complex skinning can be done on the GPU (e.g., bone skinning and morph deformations). By moving all skinning to the GPU, we can also avoid using dynamic vertex buffers, since all of the source data (except matrices) can be static. However, there are a few recommendations for character skinning on the GPU:

Analyze the use of bone matrices per object, and avoid passing down unused bone matrices as uniforms for a given object.
Analyze bone weights per vertex offline, and cull bones with inconsequential weights.
Since bone matrices are normally rigid transforms, consider using 3x4 matrices (a set of 3 4-vector uniforms) to represent each as a 3x3 rotation+scale and a 3D translation rather than 4x4 matrices for bones, especially if the bone palette is large. Then the final transform from world or post-deformed model space to clip space can be a single 4x4 matrix. This can cut the number of 4-vector uniforms per vertex shader by 25%.
If multiple sub-sections of a character are to be drawn with the same shader, but each with different rendering settings, consider setting the shader and its bone transform uniforms once, then interleave texture and render state changes with sub-mesh draw calls without ever changing the shader or the bone uniforms. This can greatly lower the overhead of the sub-object rendering. In this case, since the entire character’s palette of bone matrices can be sent down once, it is fine that each subsection of the mesh does not use all of the bones.
Carefully analyze the performance of multi-pass rendering algorithms with complex GPU skinning, since GPU skinning is computed for each rendering pass, not once per frame.