Open topic with navigation
The GPU clock in a Cardhu system operates at 520Mhz. Cardhu can shade (with the simplest possible fragment shader) at a rate of two pixels per GPU cycle, giving a final peak fragment throughput of 1G fr/s.
It is worth noting that early depth and stencil tests are performed before the fragment shader with a speed of 8 fragments per clock. This means that fragments can be culled by depth and stencil tests four times faster than they can be rendered. Fragments can be culled at a rate of 4G fr/s (although the practical rate may be significantly lower than this, due to limited available memory bandwidth).
Using depth and/or stencil tests to limit fully shaded overdraw is strongly recommended for all applications with non-trivial depth complexity.
glClear operates in a special optimized mode and can clear at higher than the peak 3D fragment rate. Always use
glClear for whole surface clears (in preference to clearing with 3D polygons).
Use texture formats with the lowest number of bits per texel that will fulfill the needs of the source artwork and its use in the shader. The following texture formats are supported by all current Tegra devices:
|Formats||Bits per Texel|
||32 (see note)|
For each particular use case, you should choose the most compact texture format possible for the level of quality required. The inflated memory bandwidth requirement (and subsequent performance impact) of the fatter texture formats can be dramatic. For example, the difference in memory footprint between an RGBA texture stored as DXT1, versus RGBA half-float, is 16x.
Note: Tegra does not directly support 24 bit per pixel RGB textures. These are expanded by the driver at specification time to 32 bit per pixel RGBX textures. No device memory is saved with these formats, and the reformatting process at specification time requires driver work on the CPU.
The Tegra fragment unit incorporates sophisticated hardware texture sampling features such as bilinear, trilinear and anisotropic filter modes. In some cases, using these features has a performance cost, and application developers should take care to understand the impact of their use.
GL_LINEAR_MIPMAP_NEAREST, are "free" and can be executed at 1 sample per clock, per fragment unit.
GL_NEAREST_MIPMAP_LINEAR, incur an extra 1 cycle per-sample penalty.
GL_TEXTURE_MAX_ANISOTROPY_EXT) unless it’s absolutely necessary to prevent visual artifacts. Blanket use of high anisotropic filter settings will substantially reduce fragment throughput and increase memory bandwidth consumption.
The Tegra fragment unit supports two levels of fragment variable precision: fp20 (an s.6.13 floating-point format) and fx10 (two’s complement s.1.8 format). Tegra can efficiently store twice as many temporaries, varyings, and uniforms in fx10 format than in fp20.
mediumpprecision variables are both interpreted by the compiler as 20-bit floating-point values (fp20)
lowpprecision variables are interpreted as fixed-point 10-bit values (fx10). As fx10 can only store values of range (-2, 2), it is typically used only for color computations and normalized values (e.g., perfect for blending). Floating point precision is usually required for storing coordinates (e.g., interpolated texture coordinates).
lowp" rule is regarding texture mapping coordinates, which should always be
mediump. The texture unit requires
mediumptexture coordinates, and additional ALU cycles may be required to convert
lowptexture coordinates prior to sampling.
Minimizing the number of actively-used hardware vector registers at any point in a shader is important on Tegra for maximum performance. Registers are consumed by actively-used varying and temporary variables in the shader. A register or sub-section of a register can be used by several variables if those variables have non-overlapping lifespans. The shader compiler actively optimizes these. Best performance will be found by limiting the number of actively-used variables (temporaries and uniforms) at any given time. Note that a register can hold either one fp20 variable or two fx10 variables, so use of
lowp will help maximize register usage.
One of the single heaviest consumers of memory bandwidth in any graphical Tegra application is the framebuffer. Typically millions of fragments are written, and re-written, and read (in the case of alpha-blended drawcalls) in every rendered frame.
Rendering to a 16-bit (RGB 565), rather than a full 32-bit (RGBA8888) frame buffer, saves 50% of this memory traffic in a single stroke. Often the reduction in visual quality is small; the 16-bit surface is dithered to reduce banding and other visual artifacts. If an application doesn’t require a destination alpha channel, using the 16-bit framebuffer format can be fantastic bang for the buck.
Note that owing to a quirk in the EGL specification, requesting a 16bpp RGB rendering buffer via
eglChooseConfig will return 24- or 32bpp rendering configs (if available) before any 16bpp configs. Thus, it is safest to have EGL enumerate a list of all available, matching configs and search them manually for the most suitable.
Note: Selecting a 32bpp config with a 16bpp screen format (or vice-versa) can result in decreased
As mentioned previously, Tegra can reject fragments via depth and/or stencil testing at four times the peak fragment shading rate. Thus, it is best to use depth or stencil rejection when possible, in order to increase practical fragment throughput.
Tegra can reject fragments via depth-testing at a very high rate. As a result, applications that can render opaque parts of a scene, even roughly front-to-back with depth testing enabled, can see a performance improvement. This is especially true if possibly-occluded objects with expensive fragment shaders can be drawn last.
If the application uses particularly complex fragment shaders with a large amount of overdraw, then even if front-to-back sorting is not feasible, the application can see higher performance using an initial depth-only rendering pass with the color buffer masking set to
GL_FALSE. It is important that for the depth-only pass all unnecessary load is removed from the vertex shader, otherwise the prepass is likely to be (unnecessarily) geometry limited.
Stencil-killed fragments are generally the fastest rejection cases possible, as they are 8-bit, rather than 16-bit surfaces. Stencil killing for depth complexity minimization can be more complex in terms of application setup code, and some data sets simply cannot sort geometry in this way. However, if static geometry is available pre-sorted into depth-ordered layers, stencil-kill can provide maximum performance. Applications that are fill-limited and have high per-pixel fragment depth should consider stencil-killed front-to-back rendering with depth-testing disabled. In some cases, 2D UIs done in OpenGL ES are good examples of this.
There are a number of ways to approximate a lower bound on the number of clocks required to render a fragment. These can assist in optimizing shaders. We will think of the fragment shader unit in terms of a set of pipelined "sub-units" that do different fragment-related functions. The most important sub-units are the raster sub-unit, the texture sub-unit, and the ALU sub-unit. The max number of cycles between these units is a (very) rough lower bound on execution time, although obviously dependencies between the units (ALU needing a texture lookup, texture coords needing ALU computations) complicate matters.
The raster sub-unit can execute up to four interpolate instructions per cycle. Each of the four interpolations can generate either:
The texture sub-unit can retrieve any of the following:
The ALU sub-unit can execute up to four independent scalar MAD's (Multiply-Adds) per clock (actually, these are technically Multiply-Multiply-Adds with limitations).
x = a * b + c
But with some limitations on d, it can do:
x = a * b + c * d
where d is equal to b or c, module the free register modifiers listed in a later section (e.g.
Thus, there is no way to do more than 4 adds in one cycle, so the number of adds required to render a fragment divided by 4 is the lower bound on the cycle count.
However, the ALU sub-units can be configured to perform operations other than four MADs per cycle, such as:
Generally, it is only possible to do 4 multiplications per cycle. With certain constraints it is possible to do limited cases involving 6 or 8 multiplications per clock, owing to the way that the 4 independent MADs work. But in general, the number of multiplication results required to render a fragment divided by 4 is a lower bound on the cycle count.
Immediate values (numeric constants compiled into the shader) other than 0.0 and 1.0 may be embedded as constants in the instruction code. When this occurs, the fourth ALU unit is no longer available for operation (because the immediate value is embedded into its part of the instruction word). Obviously, this reduces the number of ALU operations available per cycle from four to three, which can have a dramatic impact on performance for ALU bound shaders.
Scientific functions (e.g.,
sqrt) and reciprocal may take more than one cycle, and more than one unit, and complicate lower bound computations.
As an example, consider the shader seen earlier:
varying mediump vec2 uv;
varying mediump vec2 uv1;
varying lowp vec4 color;
uniform sampler2D tex0;
uniform sampler2D tex1;
gl_FragColor = texture2D(tex0, uv) * (color + texture2D(tex1, uv1));
Analyzing it, we see the following:
colorvaryings are needed in one cycle (4
lowpvalues and 2
mediump), so one whole cycle of "Raster Sub Unit" work.
uv2varying only is needed in the other cycle; so another fraction of (and thus a whole) "Raster Sub Unit" cycle is needed.
This indicates that across the board, this shader could require two cycles. Using a current shader compiler as of the writing of this document, the shader was compiling to two cycles.
Tegra can modify the values of fragment math operands "for free" in some key cases. The hardware provides the following (optional) modifiers for each operand, in the listed order:
(2x-1) can be implemented entirely using free input modifier operations (if the compiler does not need the modifiers on the operand for another transformation).
The compiler will apply these modifiers automatically when it can, so simply be aware of their existence, and be careful not to hide the constants used for these operations in uniforms (which are opaque to the compiler).
For example, 4-component blending as follows:
newDest.rgb = oldDest.rgb * src.a + src.rgb * (1 - src.a)
newDest.a = oldDest.a * src.a + src.a * (1 - src.a)
This is possible in 1 cycle, as it can be written as:
newDest.rgb = oldDest.rgb * src.a + src.rgb * (-(src.a - 1))
newDest.a = oldDest.a * src.a + src.a * (-(src.a - 1))
This is possible since the MAD instructions use
src.a multiple times and the
(1 – src.a) can be computed from
src.a via input operand modifiers.
Tegra can modify the values of fragment math operation results "for free" in some key cases. The hardware provides the following (optional) modifiers, for each result, which are applied in the listed order:
The compiler will apply these automatically as it can, so simply be aware of their existence, and be careful not to hide the constants used for these operations in uniforms (which are opaque to the compiler).
y = clamp(x, 0.0, 1.0);
This is free if the compiler can schedule it as a modifier to the result of some other operation. So, for example, preferring:
clamp(x, 0.0, 1.0)
can improve performance for many shaders (so long as the additional clamp doesn’t break the algorithm).
Avoid conditional code in the fragment shader. Especially avoid using uniforms or other input variables to emulate discrete sets of modes. Any discrete set of modes can and should be split into a set of specialized shaders, one per each mode.
If you need to use conditionals, it is better (where possible) to express them as ternary operators and GLSL functions that produce binary vectors (e.g.,
When a fragment shader includes
discard the per-fragment depth (and/or stencil) values cannot be written efficiently in the higher-level depth-test unit (simply because a fragment may pass the depth-test, then later be discarded).
Additional Fragment Shader Load
Instead, extra instructions to store depth (and/or stencil) buffers is added to the fragment shader. For small shaders, the additional fragment shader cycles required to perform the update can cause a substantial reduction in fragment shader performance, for larger shaders the extra instructions may have little or no impact on the instruction schedule.
Reduced Memory Efficiency
In all cases, the additional memory bandwidth required, and the reduced memory efficiency due to per-fragment rather than batched-fragment surface update, will tend to impact performance somewhat.
Zero Contribution Fragments
Primitives rendered with a discarding fragment shader typically compare some per-fragment calculated (often texture sampled) value against a uniform or constant reference value. The result of that comparison is used to determine whether the fragment is discarded or not.
Most often this technique is used to "cookie-cut" a shape (driven by a texture) from a larger polygon. All fragments shaded cost the same. If the fragment shader is 10 clocks, the discarded pixels will also all be shaded at 10 clocks per fragment (even though they contribute nothing to the rendered image).
A number of strategies are available to mitigate the impact of these zero contribution, but fully shaded, fragments. Both techniques increase overall geometry load to reduce fully shaded fragment load:
GL_EQUALand re-draw the primitive. This ensures that only the visible layer of fragments are shaded at full cost.
For non-trivial discarding fragment shaders with lots of interior overdraw (bushes, tree foliage) and fairly low vertex shader complexity, this second approach can have enormous performance benefit. If vertex shader complexity is high, the technique may cause more harm than good.
If You Must Discard, Do It Early
It is important to understand that discard is NOT to be used as an optimization strategy on Tegra; fragment shaders run to completion, regardless of whether a pixel was discarded or not. However, any texture sample instructions after the discard will not be issued if the pixel is discarded.
This does not mean that the shader doesn’t run to completion, only that the texture sample instructions effectively become noops. Placing the discard as early as possible in the fragment shader helps reduce load on the texture cache from killed pixels, and may save a little memory bandwidth.
Using negative-valued lod-bias on texture samplers tends to increase texture-fetch bandwidth; the samples fetched are from a finer (larger) mip-level than normal. The increased texel-to-pixel ratio means texture memory accesses for adjacent pixels are further apart, and more (mainly unused) texture data is fetched through the cache.
A further wrinkle exists for lod-bias specified as a uniform (rather than constant immediate value).
Lod-Bias Specified via Uniform
The Tegra fragment shader compiler currently tends to generate very slow code for this use-case (for small shaders throughput can be reduced by three times). If you absolutely must use lod-bias, the most efficient way is to specify the bias as an immediate constant in the fragment program.
Alpha blending places additional burden on the memory system, since all fragments must be read and modified before writing. The additional fragment shader cycles required to perform the blend is typically low (one cycle or so), although for small shaders the impact can be significant.
It is critically important, when rendering on Tegra based devices, to use alpha blending sparingly, only when necessary. Enabling alpha-blending for opaque geometry (driving the blend with uniform constant 1.0 or texture alpha channel 1.0) will severely degrade fragment shader performance.
It is not possible to mitigate the increased memory bandwidth from alpha blended drawcalls, but the impact of the additional fragment shader instructions required to calculate the blend can be minimized.
Compute Only Required Output Color Components
If rendering to a frame buffer with fewer than four color components (for example, single-channel luminance or three-channel RGB), or only a subset of the output channels are needed, compute and store only the required components. This tends to free partial fragment shader ALU cycles, which the compiler may be able to use to hide some of the cost of the blend.
Use Color Masking To Reduce Computed Color Components
If the shader used to calculate alpha blended fragments is also used to render opaque drawcalls, it might be difficult or awkward to modify the source shader specifically for use with alpha-blended primitives. The same result can be achieved by using
glColorMask to disable writing of unused components when blending. The compiler picks up the current color mask setting and will try to generate efficient shader code driven by the enabled component mask.
Note: Avoid masking color components when alpha blending is disabled. If color components are masked when rendering opaque primitives the compiler may be forced to generate code to read, then modify, then write the frame buffer, in order to preserve the existing color channel value.
Some OpenGL ES 2.0 states cause the driver to inject code into the user-provided fragment shader, in order in implement an instance of the shader for that particular state configuration. In this case, the driver modifies the shader source according to the particular state configuration, and re-compiles the shader.
The set of all "injected state" that can cause the generation of a new shader instance is:
glLogicOpthe fragment shader implements the operation.
Although compiling shaders is an expensive, CPU-side operation, the impact on performance is generally limited to a one-off cost the first time a shader is used. Resultant shader instances are then cached by the driver so they are available from cache subsequently.
Obviously, each shader instance occupies an entry in the shader-cache, and the number of such entries is finite.
Try to limit the number of unique injected shader variants used by your application to allow the shader-cache to work efficiently. Applications with thousands of injected shader instances will tend to suffer dire performance as they thrash the driver shader-cache, and thus will be exposed to per-drawcall shader re-compilation.
It is important to understand that shader instances may not be generated until their first use. This can cause severe performance spikes during the early, or transition, stages of an application. The best way to combat this is to "warm the shader-cache."
This is most easily done by isolating the list of all shaders and state configurations used in the application, and submitting a drawcall to the driver for each useful combination. A single degenerate triangle rendered to an off-screen buffer is all that’s needed to force the driver to generate and cache the shader instance.
Note that you should not attempt to blindly populate the cache with all possible combinations of injected state for every shader, as this is also likely to cause spikiness (an initial round of cache-thrashing until the required set of shaders settle into cache).
NVIDIA® GameWorks™ Documentation Rev. 1.0.200601 ©2014-2020. NVIDIA Corporation. All Rights Reserved.