The data cache hierarchy of CUDA devices is described in the Programming Guide's compute capability sections, e.g. 2.x and 3.x. In general, there are three types of data caches: L1, L2, and texture. Loads from the caches are made via transactions of a fixed size. L1 transactions are 128 bytes, and L2 and texture transactions are 32 bytes. An important strategy for optimizing memory usage is to group loads and stores in order to access the necessary data in as few cache transactions as possible. For memory cached in both L1 and L2, if every thread in a warp loads a 4-byte value from sparse locations which miss in L1 cache, each thread will incur one 128-byte L1 transaction and four 32-byte L2 transactions. This will cause the load instruction to reissue 32 times more than if the values would be adjacent and cache-aligned. If bandwidth between caches becomes a bottleneck, rearranging data or algorithms to access the data more uniformly can alleviate the problem. See the Global Memory sections of the Programming Guide for more details.
Chart
CachesGlobal memory accesses are routed either through L1 and L2, or only L2, depending on the architecture and the type of instructions used. Global read-only memory accesses are routed through the texture and L2 caches. See the Memory Statistics Global experiment for an explanation of how to control whether global accesses bypass L1. Local memory is routed through L1 and L2 cache. See the Memory Statistics Local experiment for more information on local memory accesses. Texture memory is read-only device memory, and is routed through the texture cache and the L2 cache. See the Memory Statistics Texture experiment for details on texture accesses. Shared memory accesses do not go through any cache. |
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.