You are here:

Memory Bandwidth

High-Level Memory Bandwidth Overview

On Tegra, with complex, blended, texture-heavy rendering, available memory bandwidth can become a performance bottleneck if care is not taken. Understanding some rough guidelines can help the developer know when they may be bumping up against such limits.

The memory interface is designed to transfer 8 bytes per memory controller (MC) clock cycle. This is a total available memory bandwidth around 6GB/s for Cardhu with 750MHz memory clock.

Memory Efficiency

The efficiency of the memory system in real-life cases is affected by a number of factors:

How many memory clients are active simultaneously; fragment unit frame buffer write, fragment unit texture-fetch, vertex unit attribute-fetch are all GPU memory clients. Bear in mind that the Tegra memory architecture is unified, so non-GPU clients compete with the GPU for memory bandwidth.
The frequency of "bus turns"; in other words, many memory read requests will be processed with greater efficiency than the same number of interspersed read/write requests.
Disparate memory access patterns from a single client will expose the client heavily to "page miss" penalty, which will tend to dramatically reduce efficiency, and dominate performance.

Based on experimental data it is safe to assume between 60-90% efficiency for fragment rendering. On Cardhu, at 6GB/s, that gives a lower bound around 3.5GB/s, and an upper bound around 5.4GB/s. The following examples assume a somewhat pessimistic 66% efficiency, so the total available memory bandwidth is around 4GB/s.

Display Scan Out Gobbles a Chunk

We automatically lose a chunk of this to display scan out. At native Cardhu resolution 1366 x 768, with a 32bpp color buffer, running at 60Hz refresh (independent of application frame-rate):

Scan Out Bandwidth = 240MB/s = 60Hz * 4bytes * 1366 * 768

UI Composition Gobbles Another Chunk

In many operating systems, applications draw to an intermediate render-target which is then compiled (with operating-system UI elements, for example) to a final surface. That final composite surface is what is then passed to the display hardware for scan out.

The process of composition is memory bandwidth intensive; it involves physically reading (some part of) each contributing surface from memory, and writing the generated resultant surface out to memory.

For a 32bit frame buffer, at native Cardhu resolution, at 30Hz, assuming two contributing surfaces with no blending (i.e., top of the final image from surface A, bottom from surface B), the minimum cost of composition is:

Compositor Bandwidth = 240MB/s = 30Hz * (4bytes rd + 4 bytes wr) * 1366 * 768

So, the system immediately consumes more than 10% (480MB/s) of the 4.0GB/s available memory bandwidth for display and composition; what’s left is around 3.5GB/s for all system activity.

Examples Illustrating Memory Bandwidth Availability And Performance Impact

Memory latency (for example, in texture-fetch) is well hidden on Tegra, which allows us to construct a simple model of memory performance based on transactions alone.

Example 1 Geometry Drawn To Depth-Buffer Only (Depth Pre-Pass)

Here we have the simplest possible use-case:

No fragment shader at all (fragment color write is disabled with glColorMask).
The hardware can test and write fragment depth at 8 pixels per clock.
Cardhu GPU speed is 520MHz.
Cardhu depth buffer is 16bpp (2 bytes)

Depth Test: Reject

In order to sustain the 8 pixels per clock peak rate for depth test and reject (i.e. all fragments fail the depth test), we need:

Depth test transacted memory =  7.75GB/s = 8 * 2(bytes per pixel) * 520MHz

Plainly, the load this use-case places on the memory system is far in excess of available resources; 7.75GB/s required, but only (in the worst case) 3.5GB/s available.

Following this through, the ratio of required to available memory bandwidth is 0.45. So, according to this model, then:

Sustainable depth-rejected ppc = 3.6 ppc  = 8ppc * (0.45)

In fact, this use-case performs better than the model on Cardhu; measured performance is around 5.1ppc (likely due to memory efficiency being considerable higher than the worst case figure used here).

Depth Test: Accept

Of course, some fragments will not be trivially rejected in a real depth pre-pass. To complete the picture, we need to account for memory bandwidth in the case where fragments are written.

Depth write transacted memory =  7.75GB/s = 8 * 2(bytes per pixel) * 520MHz

If all fragments pass the depth test, the same memory bandwidth is required to update the depth buffer content as was originally needed to perform the depth-test.

Total depth accept transacted memory = 15.5GB/s = 7.75GB/s(test) + 7.75GB/s(write)

Of course, this isn’t possible, so practical depth accept performance will be massively attenuated from 8ppc, based on ratio of available to required memory bandwidth 0.22 (3.5/15.5), to somewhere in the region of:

Sustainable depth-accepted ppc = 1.8 ppc  = 8ppc * 0.22

Again, Cardhu actually performs slightly better than the model predicts; measured depth accept rate on Cardhu is around 2.3ppc, and again this is likely due to the pessimistic 3.5GB/s available figure used here.

How Is This Useful

Generally, it is useful to understand the impact the memory system has on systems because that understanding tends to steer towards bandwidth efficient solutions, which will typically work better (use less power for greater performance) in the mobile ecosystem.

Specifically, in this instance, it’s useful because the knowledge that:

Depth-tested fragments will be rejected at 3.6ppc (this rate is the same for ALL depth-rejection; fragment shader complexity is not a factor because the fragment shader is not executed), and
Fragments cannot be accepted at better than 1.8ppc (Cardhu can actually shade fragments at this rate using a minimal 1 cycle shader)

tends to lead obviously to the conclusion that:

In order to improve fragment throughput, we should try to bias the submission order of drawcalls so as to ensure that as many fragments as possible are rejected.

The model above explains how memory bandwidth limitations directly impact depth-test for the depth pre-pass use case. Depth pre-pass is a technique used often on Cardhu to reduce the amount of fully shaded overdraw (typically it’s most valuable when fragment shader complexity is high, and vertex shader complexity is relatively low).

The practical conclusion to be drawn from the calculations above, is simply that a nearest first, depth-sorted, draw order could (depending on exactly how the ratio of rejected to accepted fragments improves) have substantial performance benefit to the depth pre-pass use-case.

Example 2 Non-Depth-Tested, Alpha Blended, Textured Geometry

This use case has no depth buffer traffic; it involves a single full-screen quad which blends between the current frame buffer and a texture provided source, driven by texture alpha.

Framebuffer format is 32bpp RGBA
Blend equation is set to GL_ADD
Blend function is set to GL_SRC_ALPHA, GL_SRC_ONE_MINUS_ALPHA.
Texture is RGBA 32bit, 1366 x 768 with min/max filter mode GL_LINEAR (i.e., no mipmap selection)

The fragment shader used is this trivial 1-clock shader:

uniform sampler2D tex0; 
varying mediump vec2 texcoord0; 
void main() 
{
     gl_FragColor = texture2D(tex0, texcoord0);
}

A single full-screen quad at native Cardhu resolution is 1366 x 768 pixels. With a 1-cycle shader all fragments can be drawn in 1366 x 768 fragment shader cycles. Cardhu has two fragment units, so that cycle count can be halved. GPU-bound cycle count for this drawcall is:

GPU clocks = 0.525 million cycles = 1366 * 768 / 2

Cardhu runs at 520Mhz, so time to process these fragments is:

GPU time = ~1ms = 525,000 cycles / 520MHz.

However, memory bandwidth limitations change the picture substantially.

Each of these 1.05 million fragments transacts:

4 bytes to read the frame buffer
4 bytes to write the frame buffer

So 8 bytes per fragment, with 1.05 million fragments:

Framebuffer memory transacted = 8.0MB

We assume that the texture is fetched precisely once; i.e. that the cache is perfect, which is extremely unlikely.

Texture memory transacted = 4.0MB = (1366 * 768) *  4 bytes

So in a single millisecond of GPU time, the total requested memory is 12MB.

Requested memory Bandwidth = 12GB/s = 12MB / 0.001s

Clearly, this is simply not possible; performance will be heavily bottlenecked due to limited available memory bandwidth, which is only 30% (3.5/12.0) of what’s required.

Absolute best run-time for this shader, on Cardhu, is:

Minimum possible GPU time =  3.33ms = 1ms * (1.0/0.3)

A number of options are available to optimize this use-case further:

Use a compressed, 16bpp RGB, frame buffer format; reduces required frame buffer memory bandwidth by 50%, and total required memory bandwidth by 33%.
Use a compressed texture format; converting to DXT1 reduces texture-fetch bandwidth by more than 80%, and total required memory bandwidth by more than 25%.
Reduce the size of the texture; a 75% reduction size (i.e., 50% on each axis) results in a 75% reduction in texture-fetch bandwidth, and a subsequent 25% reduction in total bandwidth.

Obviously, as shader cycles increase, the required memory bandwidth tends to fall dramatically (the key issue is transacted memory per fragment cycle, if the average cycles per fragment increases, so the load on the memory system decreases).