Open topic with navigation
NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide
Threads within a block can cooperate by sharing data through shared memory and by synchronizing their execution to coordinate memory accesses. Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory. For that shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it. A typical programming pattern is to stage data coming from device memory into shared memory, process the data in shared memory while sharing the shared view of the data across the threads of a block, and write the results back to device memory.
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts. To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.
For devices of compute capability 2.x, shared memory has 32 banks that are organized such that successive 32bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles. A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32bit word (even though the two addresses fall in the same bank). More details and examples of common access patterns are given in the Shared Memory section for compute capability 2.x in the CUDA C Programming Guide.
For devices of compute capability 3.x, shared memory has 32 banks with two addressing modes that can be configured using
cudaDeviceSetSharedMemConfig(). Each bank has a bandwidth of 64 bits per clock cycle. In 64bit mode, successive 64bit words map to successive banks. A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 64bit word (even though the addresses of the two sub-words fall in the same bank). In 32bit mode (default), successive 32bit words map to successive banks. A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32bit word or within two 32bit words whose indices i and j are in the same 64word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank). More details and examples of common access patterns are given in the Shared Memory section for compute capability 3.x in the CUDA C Programming Guide.
The number of Load/Store Requests equals the amount of shared memory instructions executed. When a warp executes an instruction that accesses shared memory, it resolves the bank conflicts as discussed previously. Each bank conflict forces a new memory transaction. The more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. Each memory transaction also requires the assembly instruction to be issued again; causing instruction replays if more than one transaction is required to fulfill the request of a warp. More details on instruction replays and their performance implications are described in the context of the Instruction Statistics experiment.
The Transactions Per Request chart shows the average number of shared memory transactions required per executed shared memory instruction, separately for load and store operations. Lower numbers are better; the target for a single shared memory operation on a full warp is 1 transaction for both, a 4byte access and an 8byte access, and 2 transactions for a 16byte access. Note that the ideal number of transactions for a given access pattern is dependent on the width of the access, the configured address mode (if available), and the target compute capability. Refer to the Ideal Shared Memory Transactions of the Memory Transactions experiment to tell the lowest number of transfers possible for a given instruction. The Transaction Size for shared memory traffic is 128 bytes for compute capability 2.x and 256 bytes for compute capability 3.x. For both devices classes the bandwidth is equal to one transaction per clock per SM.
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.