Nsight provides a series of experiments focused at the usage of the memory sub-system during kernel execution. Each individual experiment covers a particular area of the Memory Hierarchy defined by the CUDA programming model:
These memory experiments can be collected independently of each other; to avoid unnecessary profile overhead skip collecting memory experiments that cover an area of the memory hierarchy that is not used by the kernel. For each executed experiment a separate output tab is generated on the report page; in addition the results of a memory experiment are also used as input to the Overview tab. This overview provides a summary of the complete memory hierarchy with high-level performance metrics. The remainder of the document describes the Overview tab in more detail; for further information on an individual memory experiment follow the links above.
TableLists the key performance metrics for each covered memory space. Repeats all information shown in the chart as raw values; and also offers some addition metrics, such as the replay overhead per memory space. The replay overhead is defined as: (Transactions - Requests) / Instruction_Issued The number of requests equals the number of executed memory instructions. Depending on the memory access pattern and accessed memory space, a single request may require multiple transactions to fulfill a request for all active threads of a warp. Each additional memory transaction forces the memory instruction to be issued again. The more transactions are needed, the more often the instruction needs to be replayed; ultimately hindering fast forward progress. See the Instruction Statistics experiment for more details. ColumnsTotal Reports the values as total aggregates over the course of the kernel execution. Per Warp Reports the applicable metrics per warp. Useful to quickly compare against estimates derived during the kernel development. Per Second Reports the applicable metrics in relation to the kernel execution duration. For rows referring to the amount of transferred bytes the value of this column results in the achieved memory throughput. The architectural peak memory bandwidth to device memory for the target device is given on the GPU Devices report page. |
NVIDIA GameWorks Documentation Rev. 1.0.150630 ©2015. NVIDIA Corporation. All Rights Reserved.