You are here:

Memory Statistics - Global

NVIDIA® Nsight™ Application Development Environment for Heterogeneous Platforms, Visual Studio Edition 5.3 User Guide
Send Feedback

Overview

Global data resides in device memory, nonpageable system memory registered to the device, or managed ("Unified") memory. It can be accessed using various data paths: The top level cache is L1 or Texture cache, which is present in each SM. The L2 cache is global to the GPU. The data path used depends on the GPU's compute capability, what caching behavior is requested, and whether or not the data is marked as read-only. See the CUDA Programming Guide for more information.

Independent of the used data path, the compiler offers control over the behavior of the caches on the way. Using the -dlcm compilation flag, global memory accesses can be configured at compile time to be cached in all available caches on the data path (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). While the compiler setting offers global control per compilation module, inline PTX assembly offers explicit control per individual memory instruction (see the Cache Operators section of the PTX ISA definition). In the output of this experiment, the difference in the used caching behavior is denoted as Cached Loads versus Uncached Loads. The former uses the L1 cache or texture cache (depending on the data path) followed by the L2 cache; the latter only uses the L2 cache.

A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Chart

Global Read Only

Devices of compute capability 3.5 support reading global memory through the same cache used by the texture pipeline. As the texture cache offers very different coalescing rules over the data cache, using this memory path may improve performance - especially for highly non-uniform memory access patterns. For kernels that are limited by the memory bandwidth to and from the data cache, routing some traffic through the global read only data path can additionally benefit performance as the texture cache is a separate physical unit with a separate data path; however, note that the global read only memory accesses share the texture cache with any traffic generated through texture fetches.

If targeted for an applicable compute device, the compiler will automatically use global read only data accesses wherever suitable. Add the const __restrict__ type qualifiers for pointers to memory that qualify for global read only accesses or use the __ldg() intrinsic for explicit control. To verify which data path is used for individual memory accesses inspect the Source View page. Global read only memory accesses will show up as LDG instructions; global memory reads through the data cache are represented as LD instructions.

The chart layout depends on the architecture of the device. For first-generation Kepler devices that do not support the global read-only data path, the upper portion of the chart is greyed out. For Maxwell and Pascal devices, L1 and Tex caches are combined into a single unit. See the CUDA documentation for deeper explanations of the memory hierarchy, and how it differs across the various architectures.

Global

When a warp executes an instruction that accesses generic global memory (LD or ST assembly instructions), it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. In addition, each memory transaction requires the assembly instruction to be repeatedly issued again; causing instruction replays if more than one transaction is required to fulfill the request of a warp. More details on instruction replays and their performance implications are described in the context of the Instruction Statistics experiment.

The Transactions Per Request chart shows the average number of L1 transactions required per executed global memory instruction, separately for load and store operations. Lower numbers are better; the target for a single memory operation on a full warp is 1 transaction for a 4byte access, 2 transactions for a 8byte access, and 4 transactions for a 16byte access. How many transactions are actually required varies with the access pattern of the memory operation and is also dependent on the compute capability of the target device. The CUDA C Programming Guide provides detailed information on the coalescing rules for each compute architecture (Compute Capability 3.x, Compute Capability 5.x, Compute Capability 6.x).

Analysis

If the number of Transactions per Request is high …
- … check the used memory access patterns and try to optimize the data accesses focusing on the coalescing rules of the target compute device (see the Global Memory sections of the CUDA C Programming Guide for more details). Run the Memory Transactions experiment to identify the number of transactions per individual source code line. This can help guiding this optimization effort to the biggest offenders first.
- … and the kernel accesses some data read-only for the entire grid launch, verify if your target device supports the global read only data path. If so, double check that the corresponding data pointers have the const __restrict__ type qualifiers set or consider using the __ldg() intrinsic for those memory operations. The Source View page helps in locating and verifying individual memory instructions.
- … run the Memory Transactions experiment to investigate the type and width of the executed memory transactions. With increasing width of a global memory operation the ideal number of transactions increases as well. Hence, it is crucial to understand the mix of memory operations to come up with a reasonable expectation of an ideal number of transactions. Also keep in mind that reading 4 sequential 4 byte values per thread is in most cases more efficient if implemented as single 16 byte load instead of executing 4 separate 4 byte loads; even if the former ends up with 4 transactions per request, while the latter results in 1 transactions for each of the four requests.
- … and the memory access patterns cannot be further optimized, consider changing the cache operator for memory operations with very high address divergence. Caching data only in L2 can result in less overall data traffic, as L1 transactions are 128 byte in size, while L2 transactions get executed in 32 byte segments. Consider a full warp, where each thread accesses a data element with a 128 byte stride: If cached in L1 this will result in 32 times 128 byte transactions (assuming no L1 cache hits); transferring a total of 4096 bytes. In contrast caching the same access in L2 only, leads to 32 times 32 byte transactions; with a total of 1024 bytes only.
If the L1 Cache Hit Rate is low …
- … check which cache configuration was selected for the kernel launch. Devices of compute capability 3.x and higher allow for different cache configurations, i.e. different splits of the on-chip memory between the L1 cache and shared memory. If the kernel uses no shared memory or less shared memory does not result in significantly different occupancy, try changing the cache configuration to favor larger L1 cache sizes. A larger L1 cache increases the amount and lifetime of the data held in the cache, consequently increasing the likeliness to hit data in the cache.

Open topic with navigation