You are here:

How To: Optimize OpenGL ES 2.0 Performance for Tegra

Introduction

NVIDIA’s Tegra mobile system on a chip (SOC) series include an extremely powerful and flexible 3D GPU with power that is well matched to the OpenGL ES 2.0 APIs. For optimal content rendering, there are some basic guidelines and several tips that can assist developers in reaching their goals. This document will detail these recommendations, as well as a few warnings regarding features and choices that can limit performance in 3D-centric applications.

The 3D GPU in all Tegra series SOCs contains a programmable vertex shading unit and a programmable fragment shading unit, each of which are accessible via OpenGL ES 2.0’s GLSL-ES shading language. Tegra also includes a high-performance multi-core ARM CPU and a high-bandwidth memory controller (MC) to round out the components of 3D rendering.

Optimal performance is achieved by:

Maximizing the efficient use of the fragment shading unit and vertex shading unit via smart shader programming
Minimizing the use of the CPU by avoiding redundant and ill-optimized rendering methods.
Optimizing the use of memory bandwidth across the fragment unit, vertex unit and display systems.

This document will cover aspects of all of these elements. Note that all quoted numbers are relative to clock settings on the Tegra 3 based “Cardhu” development kit. Numbers on other Tegra variants will differ.

Of particular note:

Tegra 3 devices have double the raw fragment shader horsepower of Tegra 2 devices.
Tegra 3 devices include four A9 CPU cores, whereas Tegra 2 devices have two.
Tegra 3 systems are typically clocked significantly higher than Tegra 2:
- Cardhu has 520MHz GPU, 750MHZ memory, 1.4GHz CPU.
- Ventana, the Tegra 2 development system has 300MHz GPU, 300MHZ memory, 1.0GHz CPU.

Basic Performance Notes

In real-world applications, the most common performance bottlenecks are:

Fragment fill rate for applications using long shaders and/or lots of overdraw.
Memory bandwidth on devices with large screens or when using large/deep textures without mip-selecting filter modes.
Lack of CPU/GPU parallelism for applications that use redundant or GPU-unfriendly OpenGL ES code.