NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.0 User Guide
Send Feedback
The Memory Transactions source-level experiment provides detailed statistics for each instruction that performs memory transactions. If a kernel is limited by memory bandwidth, fixing inefficient memory access can improve performance. This experiment makes it easy to find which instructions are making memory accesses that incur large numbers of transactions, and could potentially be improved. Patterns to start optimizing first are those which transfer more bytes than are requested, or incur more transactions than the ideal number.
When a warp executes an instruction that accesses memory, it is important to consider the access pattern created by the threads in that warp. For example, when loading data through the L1 cache, an entire 128-byte cache line is fetched, regardless of whether one thread is reading one value (the least efficient pattern), or if all 32 threads are reading consecutive 4-byte values (the most efficient pattern). A memory "request" is an instruction which accesses memory, and a "transaction" is the movement of a unit of data between two regions of memory. Efficient access patterns minimize the number of transactions incurred by a request. Inefficient patterns make large numbers of transactions, using only a small amount of data from each transaction, wasting bandwidth in the connections between regions of the Memory Hierarchy. See the Memory Statistics family of kernel-level experiments for more background.
Memory space that was accessed in this operation. Memory spaces are Global, Local, and Shared. If the instruction is a generic load or store, different threads may access different memory spaces, so lines marked Generic list all spaces accessed.
Type of memory operations executed (Load or Store).
Memory access size of a single access in bits.
Distribution of executed L1 memory transactions per instruction executed. Bins with a non-zero count have gray bars, and the highest bin has a red bar. A good histogram has only counts in the lower bins. The number of bins is selected to allow up to the worst case number of transactions, which varies depending on the memory type and access size. For global and shared memory accesses, there are always 32 bins, because each thread in a warp can read from a different location, incurring a separate transaction per thread. Local accesses of up to 4 bytes per thread have 32 bins for the same reason. Since local memory is strided in 4-byte segments, 8-byte local accesses require 2 transactions per thread and will have 64 histogram bins, and 16-byte local accesses require 4 transactions per thread and will have 128 histogram bins.
Number of executed 128-byte memory transactions between the SM and L1 due to Global memory accesses.
Number of executed 128-byte memory transactions between the SM and L1 due to Local memory accesses.
Number of executed 128-byte memory transactions between the SM and L1 due to Shared memory accesses.
Number of ideal 128-byte memory transactions from SM to L1 due to Global memory accesses. The calculation of ideal transactions is based on aligned, sequential data and takes the predication mask and the active mask into account. For an access pattern where multiple threads access the same data, it is possible to achieve better than ideal transactions.
Number of ideal 128-byte memory transactions from SM to L1 due to Local memory accesses. The calculation of ideal transactions is based on aligned, sequential data and takes the predication mask and the active mask into account. For an access pattern where multiple threads access the same data, it is possible to achieve better than ideal transactions.
Number of ideal 128-byte memory transactions from SM to L1 due to Shared memory accesses. The calculation of ideal transactions is based on aligned, sequential data and takes the predication mask and the active mask into account. For an access pattern where multiple threads access the same data, it is possible to achieve better than ideal transactions.
Total executed memory instructions (per thread), regardless predicate or condition code.
Total executed memory instructions (any semantics per warp) regardless predicate or condition code.
Amount of data requested in bytes; summed across the active threads that are not predicated off.
Amount of data transferred between the SM and L1 in bytes.
Amount of data transferred between L1 and L2 in bytes.
Number of executed 32-byte transactions between L1 and L2 due to Global memory accesses.
Number of executed 32-byte transactions between L1 and L2 due to Local memory accesses.
Number of 128-byte transactions required between the SM and L1 per request made. Lower is better.
Number of 32-byte transactions required between L1 and L2 per request made. Lower is better.
Number of 128-byte transactions between the SM and L1 that exceeded the ideal count for the made request. The calculation of ideal transactions is based on aligned, sequential data and takes the predication mask and the active mask into account. For an access pattern where multiple threads access the same data, it is possible to achieve better than ideal transactions. Note that this would result in a value for Above Ideal Transactions that is lower than expected, or even negative. Use the Instruction Count source-level experiment to identify instructions which execute with less than a full warp.
Number of bytes actually transferred between the SM and L1 for each requested byte in the kernel. Lower is better.
Number of bytes actually transferred between L1 and L2 for each requested byte in L1. Lower is better.
Many of the metrics provided by this experiment can imply a general problem: If the amount of data transferred between any two memory regions exceeds the amount of data requested, the access pattern is not optimal. This may appear as L1/L2 Transfer Overhead, or a high value of Transactions Per Request. See the Memory Statistics family of kernel-level experiments for analysis guidance.
NVIDIA® Nsight™ Development Platform, Visual Studio Edition User Guide Rev. 4.0.140501 ©2009-2014. NVIDIA Corporation. All Rights Reserved.