NVIDIA® Nsight™ Application Development Environment for Heterogeneous Platforms, Visual Studio Edition 5.3 User Guide
Send Feedback
Global data resides in device memory, nonpageable system memory registered to the device, or managed ("Unified") memory. It can be accessed using various data paths: The top level cache is L1 or Texture cache, which is present in each SM. The L2 cache is global to the GPU. The data path used depends on the GPU's compute capability, what caching behavior is requested, and whether or not the data is marked as read-only. See the CUDA Programming Guide for more information.
Independent of the used data path, the compiler offers control over the behavior of the caches on the way. Using the -dlcm
compilation flag, global memory accesses can be configured at compile time to be cached in all available caches on the data path (-Xptxas -dlcm=ca
) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg
). While the compiler setting offers global control per compilation module, inline PTX assembly offers explicit control per individual memory instruction (see the Cache Operators section of the PTX ISA definition). In the output of this experiment, the difference in the used caching behavior is denoted as Cached Loads versus Uncached Loads. The former uses the L1 cache or texture cache (depending on the data path) followed by the L2 cache; the latter only uses the L2 cache.
A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.
Chart![]() Global Read OnlyDevices of compute capability 3.5 support reading global memory through the same cache used by the texture pipeline. As the texture cache offers very different coalescing rules over the data cache, using this memory path may improve performance - especially for highly non-uniform memory access patterns. For kernels that are limited by the memory bandwidth to and from the data cache, routing some traffic through the global read only data path can additionally benefit performance as the texture cache is a separate physical unit with a separate data path; however, note that the global read only memory accesses share the texture cache with any traffic generated through texture fetches.
If targeted for an applicable compute device, the compiler will automatically use global read only data accesses wherever suitable. Add the The chart layout depends on the architecture of the device. For first-generation Kepler devices that do not support the global read-only data path, the upper portion of the chart is greyed out. For Maxwell and Pascal devices, L1 and Tex caches are combined into a single unit. See the CUDA documentation for deeper explanations of the memory hierarchy, and how it differs across the various architectures. Global
When a warp executes an instruction that accesses generic global memory ( The Transactions Per Request chart shows the average number of L1 transactions required per executed global memory instruction, separately for load and store operations. Lower numbers are better; the target for a single memory operation on a full warp is 1 transaction for a 4byte access, 2 transactions for a 8byte access, and 4 transactions for a 16byte access. How many transactions are actually required varies with the access pattern of the memory operation and is also dependent on the compute capability of the target device. The CUDA C Programming Guide provides detailed information on the coalescing rules for each compute architecture (Compute Capability 3.x, Compute Capability 5.x, Compute Capability 6.x). |
const __restrict__
type qualifiers set or consider using the __ldg()
intrinsic for those memory operations. The Source View page helps in locating and verifying individual memory instructions.
NVIDIA® Nsight™ Application Development Environment for Heterogeneous Platforms, Visual Studio Edition User Guide Rev. 5.3.170616 ©2009-2017. NVIDIA Corporation. All Rights Reserved.