Avoiding Common Causes of Inter-frame Dependencies

Render to Texture

Rendering to textures other then the SwapChain's back buffer (assuming double buffering) is a common practice and is useful in many rendering techniques. However, there is a difference between a texture render target and the frame buffer back buffer. When a SwapChain is created, a discard flag can be set (D3DSWAPEFFECT_DISCARD in Direct3D 9 or DXGI_SWAP_EFFECT_DISCARD in Direct3D 10 and later). This flag tells the driver that the data written to the back buffer does not need to be preserved across frames, so once Present is called, the driver can start a new frame on a separate AFR Group without having to transfer the back buffer data from the GPU(s) in one AFR Group to the others.

Render targets (RTs), on the other hand, don't have these creation flags. As a result, the driver must preserve their contents across frames, because it doesn't know if the application will need the data written to that render target in future frames. Some rendering techniques rely on the data being preserved, but many don't. The driver's only option, however, is to copy the RT contents to all the other GPUs to allow them access to the latest state of the RT. This typically leads to large data transfers between GPUs, which aren't optimal, as well as synchronization overhead. For example: GPU2 will stall waiting for GPU1 to update a shared RT. Depending on the timing and size of the data dependencies, as well as on the available PCI Express bandwidth, the amount of SLI performance scaling will vary.

In the absence of an SLI profile that prevents the inter-GPU transfer of render targets, there is a way in which the application can to avoid this performance penalty: clearing the render target each frame before using it by calling the appropriate Clear() function. This applies to all render targets, whether used as color or depth-stencil buffers. If the application clears render targets in this fashion, the driver will assume that data does not need to be preserved and forego the need to copy the render targets between GPUs.

It is important to keep in mind that Clear calls should continue being used where important for performance either on single GPU or in SLI configurations, independently of the existence of a SLI profile for your application. Clears are important for good performance on depth buffers and multisampled surfaces (either color or depth), and should always be preferred over full-screen quads that fill the surface with a constant value. At the same time, while clearing such surfaces once per frame before they are used is a good practice, Clears should be used judiciously, such that redundant Clear calls are avoided.

Direct3D 9 Example
pd3d9Device->SetRenderTarget(0, gppSurface);
if (bClearRT)
pd3d9Device->Clear(0L, NULL, D3DCLEAR_TARGET,0x00000000, 1.0f, 0L);
Direct3D 10 Example
pDevice->OMSetRenderTargets(NumRenderTargets, ppRenderTargetViews,
if (bClearRT)
for (int i = 0; i < NumRenderTargets; i++)
{ FLOAT clearColor[4] = { 0.0f }; pd3d10Device->ClearRenderTargetView(pRenderTargetViews[i], clearColor); pd3d10Device->ClearDepthStencilView(pDepthStencilView); }

Stream Out Buffers in D3D10 and Later

Much like render targets, buffers used as the destination of stream output operations in D3D10 and later have the same problem. The driver doesn't know if the data contained in the buffer will be used in future frames, therefore causing the same copy and synchronization performance penalties that non-cleared render targets have. There is no API mechanism to clear a Buffer, and therefore the only way currently available to avoid a performance penalty in AFR mode when using stream output is to get an SLI profile for your application added to the driver.

Unordered Access Views in D3D11

Direct3D 11 introduced the concept of Unordered Access Views that can be bound to the pipeline for random read/write accesses. Just like render targets and buffers used as the destination of stream output operations, any potential modification performed on UAVs will introduce inter-frame dependencies that, in the absence of a frame profile indicating otherwise, the driver will handle by introducing appropriate inter-GPU synchronization operations. Two D3D11 calls (ID3D11DeviceContext::ClearUnorderedAccessViewUint and ID3D11DeviceContext::ClearUnorderedAccessViewFloat) can be used to clear these buffers before they are used each frame, thus removing the inter-frame dependency and avoiding the overhead of synchronization between GPUs in different AFR groups.



NVIDIA® GameWorks™ Documentation Rev. 1.0.200601 ©2014-2020. NVIDIA Corporation. All Rights Reserved.