You are here: Developer Tools > Desktop Developer Tools > NVIDIA Nsight Visual Studio Edition > Memory Statistics - Global

Overview

Global memory resides in device memory and can be accessed using two different data paths: Starting with devices of compute capability 2.x and higher, global memory traffic is routed through the data caches (L1 cache and/or L2 cache). Only available on devices of compute capability 3.5, read only global memory accesses can alternatively go through the texture cache using a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures.

Independent of the used data path, the compiler offers control over the behavior of the caches on the way. Using the -dlcm compilation flag, global memory accesses can be configured at compile time to be cached in all available caches on the data path (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). While the compiler setting offers global control per compilation module, inline PTX assembly offers explicit control per individual memory instruction (see the Cache Operators section of the PTX ISA definition). In the output of this experiment, the difference in the used caching behavior is denoted as Cached Loads versus Uncached Loads. The former uses the L1 cache or texture cache (depending on the data path) followed by the L2 cache; the latter only uses the L2 cache.

A L1 cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 (cached loads using the generic data path) are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only (uncached loads using the generic data path) are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Chart

Global Read Only

Devices of compute capability 3.5 support reading global memory through the same cache used by the texture pipeline. As the texture cache offers very different coalescing rules over the data cache, using this memory path may improve performance - especially for highly non-uniform memory access patterns. For kernels that are limited by the memory bandwidth to and from the data cache, routing some traffic through the global read only data path can additionally benefit performance as the texture cache is a separate physical unit with a separate data path; however, note that the global read only memory accesses share the texture cache with any traffic generated through texture fetches.

If targeted for an applicable compute device, the compiler will automatically use global read only data accesses wherever suitable. Add the const __restrict__ type qualifiers for pointers to memory that qualify for global read only accesses or use the __ldg() intrinsic for explicit control. To verify which data path is used for individual memory accesses inspect the Source View page. Global read only memory accesses will show up as LDG instructions; global memory reads through the data cache are represented as LD instructions.

For devices that do not support the global read only data path, the upper portion of the chart is greyed out.

Global

When a warp executes an instruction that accesses generic global memory (LD or ST assembly instructions), it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. In addition, each memory transaction requires the assembly instruction to be repeatedly issued again; causing instruction replays if more than one transaction is required to fulfill the request of a warp. More details on instruction replays and their performance implications are described in the context of the Instruction Statistics experiment.

The Transactions Per Request chart shows the average number of L1 transactions required per executed global memory instruction, separately for load and store operations. Lower numbers are better; the target for a single memory operation on a full warp is 1 transaction for a 4byte access, 2 transactions for a 8byte access, and 4 transactions for a 16byte access. How many transactions are actually required varies with the access pattern of the memory operation and is also dependent on the compute capability of the target device. The CUDA C Programming Guide provides detailed information on the coalescing rules for each compute architecture (Compute Capability 2.x, Compute Capability 3.x).

Analysis


of

NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.