Global memory resides in device memory and can be accessed using two different data paths: Starting with devices of compute capability 2.x and higher, global memory traffic is routed through the data caches (L1 cache and/or L2 cache). Only available on devices of compute capability 3.5, read only global memory accesses can alternatively go through the texture cache using a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures.
Independent of the used data path, the compiler offers control over the behavior of the caches on the way. Using the -dlcm
compilation flag, global memory accesses can be configured at compile time to be cached in all available caches on the data path (-Xptxas -dlcm=ca
) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg
). While the compiler setting offers global control per compilation module, inline PTX assembly offers explicit control per individual memory instruction (see the Cache Operators section of the PTX ISA definition). In the output of this experiment, the difference in the used caching behavior is denoted as Cached Loads versus Uncached Loads. The former uses the L1 cache or texture cache (depending on the data path) followed by the L2 cache; the latter only uses the L2 cache.
A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.
Chart
Global Read OnlyDevices of compute capability 3.5 support reading global memory through the same cache used by the texture pipeline. As the texture cache offers very different coalescing rules over the data cache, using this memory path may improve performance - especially for highly non-uniform memory access patterns. For kernels that are limited by the memory bandwidth to and from the data cache, routing some traffic through the global read only data path can additionally benefit performance as the texture cache is a separate physical unit with a separate data path; however, note that the global read only memory accesses share the texture cache with any traffic generated through texture fetches.
If targeted for an applicable compute device, the compiler will automatically use global read only data accesses wherever suitable. Add the For devices that do not support the global read only data path, the upper portion of the chart is greyed out. Global
When a warp executes an instruction that accesses generic global memory ( The Transactions Per Request chart shows the average number of L1 transactions required per executed global memory instruction, separately for load and store operations. Lower numbers are better; the target for a single memory operation on a full warp is 1 transaction for a 4byte access, 2 transactions for a 8byte access, and 4 transactions for a 16byte access. How many transactions are actually required varies with the access pattern of the memory operation and is also dependent on the compute capability of the target device. The CUDA C Programming Guide provides detailed information on the coalescing rules for each compute architecture (Compute Capability 2.x, Compute Capability 3.x). |
const __restrict__
type qualifiers set or consider using the __ldg()
intrinsic for those memory operations. The Source View page helps in locating and verifying individual memory instructions.
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.