Open topic with navigation
On Tegra, with complex, blended, texture-heavy rendering, available memory bandwidth can become a performance bottleneck if care is not taken. Understanding some rough guidelines can help the developer know when they may be bumping up against such limits.
The memory interface is designed to transfer 8 bytes per memory controller (MC) clock cycle. This is a total available memory bandwidth around 6GB/s for Cardhu with 750MHz memory clock.
The efficiency of the memory system in real-life cases is affected by a number of factors:
Based on experimental data it is safe to assume between 60-90% efficiency for fragment rendering. On Cardhu, at 6GB/s, that gives a lower bound around 3.5GB/s, and an upper bound around 5.4GB/s. The following examples assume a somewhat pessimistic 66% efficiency, so the total available memory bandwidth is around 4GB/s.
We automatically lose a chunk of this to display scan out. At native Cardhu resolution 1366 x 768, with a 32bpp color buffer, running at 60Hz refresh (independent of application frame-rate):
Scan Out Bandwidth = 240MB/s = 60Hz * 4bytes * 1366 * 768
In many operating systems, applications draw to an intermediate render-target which is then compiled (with operating-system UI elements, for example) to a final surface. That final composite surface is what is then passed to the display hardware for scan out.
The process of composition is memory bandwidth intensive; it involves physically reading (some part of) each contributing surface from memory, and writing the generated resultant surface out to memory.
For a 32bit frame buffer, at native Cardhu resolution, at 30Hz, assuming two contributing surfaces with no blending (i.e., top of the final image from surface A, bottom from surface B), the minimum cost of composition is:
Compositor Bandwidth = 240MB/s = 30Hz * (4bytes rd + 4 bytes wr) * 1366 * 768
So, the system immediately consumes more than 10% (480MB/s) of the 4.0GB/s available memory bandwidth for display and composition; what’s left is around 3.5GB/s for all system activity.
Memory latency (for example, in texture-fetch) is well hidden on Tegra, which allows us to construct a simple model of memory performance based on transactions alone.
Here we have the simplest possible use-case:
In order to sustain the 8 pixels per clock peak rate for depth test and reject (i.e. all fragments fail the depth test), we need:
Depth test transacted memory = 7.75GB/s = 8 * 2(bytes per pixel) * 520MHz
Plainly, the load this use-case places on the memory system is far in excess of available resources; 7.75GB/s required, but only (in the worst case) 3.5GB/s available.
Following this through, the ratio of required to available memory bandwidth is 0.45. So, according to this model, then:
Sustainable depth-rejected ppc = 3.6 ppc = 8ppc * (0.45)
In fact, this use-case performs better than the model on Cardhu; measured performance is around 5.1ppc (likely due to memory efficiency being considerable higher than the worst case figure used here).
Of course, some fragments will not be trivially rejected in a real depth pre-pass. To complete the picture, we need to account for memory bandwidth in the case where fragments are written.
Depth write transacted memory = 7.75GB/s = 8 * 2(bytes per pixel) * 520MHz
If all fragments pass the depth test, the same memory bandwidth is required to update the depth buffer content as was originally needed to perform the depth-test.
Total depth accept transacted memory = 15.5GB/s = 7.75GB/s(test) + 7.75GB/s(write)
Of course, this isn’t possible, so practical depth accept performance will be massively attenuated from 8ppc, based on ratio of available to required memory bandwidth 0.22 (3.5/15.5), to somewhere in the region of:
Sustainable depth-accepted ppc = 1.8 ppc = 8ppc * 0.22
Again, Cardhu actually performs slightly better than the model predicts; measured depth accept rate on Cardhu is around 2.3ppc, and again this is likely due to the pessimistic 3.5GB/s available figure used here.
Generally, it is useful to understand the impact the memory system has on systems because that understanding tends to steer towards bandwidth efficient solutions, which will typically work better (use less power for greater performance) in the mobile ecosystem.
Specifically, in this instance, it’s useful because the knowledge that:
tends to lead obviously to the conclusion that:
The model above explains how memory bandwidth limitations directly impact depth-test for the depth pre-pass use case. Depth pre-pass is a technique used often on Cardhu to reduce the amount of fully shaded overdraw (typically it’s most valuable when fragment shader complexity is high, and vertex shader complexity is relatively low).
The practical conclusion to be drawn from the calculations above, is simply that a nearest first, depth-sorted, draw order could (depending on exactly how the ratio of rejected to accepted fragments improves) have substantial performance benefit to the depth pre-pass use-case.
This use case has no depth buffer traffic; it involves a single full-screen quad which blends between the current frame buffer and a texture provided source, driven by texture alpha.
GL_LINEAR(i.e., no mipmap selection)
uniform sampler2D tex0;
varying mediump vec2 texcoord0;
gl_FragColor = texture2D(tex0, texcoord0);
A single full-screen quad at native Cardhu resolution is 1366 x 768 pixels. With a 1-cycle shader all fragments can be drawn in 1366 x 768 fragment shader cycles. Cardhu has two fragment units, so that cycle count can be halved. GPU-bound cycle count for this drawcall is:
GPU clocks = 0.525 million cycles = 1366 * 768 / 2
Cardhu runs at 520Mhz, so time to process these fragments is:
GPU time = ~1ms = 525,000 cycles / 520MHz.
However, memory bandwidth limitations change the picture substantially.
Each of these 1.05 million fragments transacts:
So 8 bytes per fragment, with 1.05 million fragments:
Framebuffer memory transacted = 8.0MB
We assume that the texture is fetched precisely once; i.e. that the cache is perfect, which is extremely unlikely.
Texture memory transacted = 4.0MB = (1366 * 768) * 4 bytes
So in a single millisecond of GPU time, the total requested memory is 12MB.
Requested memory Bandwidth = 12GB/s = 12MB / 0.001s
Clearly, this is simply not possible; performance will be heavily bottlenecked due to limited available memory bandwidth, which is only 30% (3.5/12.0) of what’s required.
Absolute best run-time for this shader, on Cardhu, is:
Minimum possible GPU time = 3.33ms = 1ms * (1.0/0.3)
A number of options are available to optimize this use-case further:
Obviously, as shader cycles increase, the required memory bandwidth tends to fall dramatically (the key issue is transacted memory per fragment cycle, if the average cycles per fragment increases, so the load on the memory system decreases).
NVIDIA® GameWorks™ Documentation Rev. 1.0.200608 ©2014-2020. NVIDIA Corporation. All Rights Reserved.