Nsight Graphics Activities - Advanced Learning
This section will focus on the primary Nsight Graphics tools key concepts, advanced information and howto's.
Introduction
Since the dawn of graphics acceleration, NVIDIA has led the way in creating the most performant and feature rich GPUs in the world. With each generation, GPUs get faster, and, because of that, more complex. In order to create applications that fully take advantage of the complex capabilities that exist in modern GPUs, you, the programmer, must have a deep understanding of how the GPU operates, as well as a way to see the GPU's state as it relates to that operation. Lucky for you, NVIDIA Graphics Developer Tools has a simple mission; to provide an ecosystem of tools that gives you that super power.
This reference guide was created by experts on the Developer Tools team and is meant as a gentle introduction to some critical tools features that will help you debug, profile and ultimately, optimize your application.
Feel free to skip ahead to whichever section is most relevant to you. If you find any information lacking, please contact us at NsightGraphics@nvidia.com so we can make this reference better. In addition to this guide, we're happy to work with individual developers on training. Lastly, if anything ever goes wrong, remember to use that Feedback Button at the top right of the tool window so that we have a chance to make the tool work better for you.
Thank you,
Aurelio Reis
SWE Director, Graphics Developer Tools
NVIDIA
GPU Trace
What is it?
GPU Trace is a D3D12/DXR and Vulkan/VKRay graphics profiler used to identify performance limiters in graphics applications. It uses a technique called periodic sampling to gather metrics and detailed timing statistics associated with different GPU hardware units. With the new Advanced Mode, it is similar in some ways to the Range Profiler, as it can allow using multiple passes and statistical sampling to collect metrics. For standard metric collection, it takes advantage of specialized Turing hardware to capture this data in a single pass with minimal overhead.
GPU Trace saves this data to a report file and includes "diffing" functionality via a feature called "TraceCompare". The data is presented as an intuitive visualization on a timeline which is configurable and easy to navigate. The data is organized in a hierarchical top-down fashion so you can observe whole-frame behavior before zooming into individual problem areas. Problems that previously required guessing & testing can now be visually identified at a glance.
In this section we will focus on the key-concepts the GPU Trace tool introduces and in depth explanation of the data it retrieves.
- To learn how to use the tools go to the activities section in the user guide: here
- To understand the UI components GPU Trace UI section in the User Interface Reference here,
Units Throughput
The Unit Throughputs row overlays the %-of-max-throughput of every hardware unit in the GPU. Multiple units can concurrently reach close to 100% at any moment in time.
Unit | Pipeline Area | Description |
---|---|---|
SM | Shader | The Streaming Multiprocessor executes shader code. |
L1TEX | Memory | The L1TEX unit contains the L1 data cache for the SM, and two parallel pipelines: the LSU or load/store unit, and TEX for texture lookups and filtering. |
L2 | Memory | The L2 cache serves all units on the GPU, and is a central point of coherency. |
VRAM | Memory | GDDR6 on Wikipedia, GDDR5X, GDDR6 |
PD | World Pipe | The Primitive Distributor fetches indices from the index buffer, and sends triangles to the vertex shader. |
VAF | World Pipe | The Vertex Attribute Fetch unit reads attribute values from memory and sends them to the vertex shader. VAF is part of the Primitive Engine. |
PES+VPC | World Pipe |
The Primitive Engine orchestrates the flow of primitive and attribute data across all world pipe shader stages (Vertex, Tessellation, Geometry). PES contains the stream (transform feedback) unit. The VPC unit performs clip and cull. |
RASTER | Screen Pipe | The Raster units receives primitives from the world pipe, and outputs pixels (fragments) and samples (coverage masks) for the PROP, Pixel Shader, and ROP to process. |
PROP | Screen Pipe | The Pre-ROP unit orchestrates the flow of depth and color pixels (fragments) and samples, for final output. PROP enforces the API ordering of pixel shading, depth testing, and color blending. Early-Z and Late-Z modes are handled in PROP. |
ZROP | Screen Pipe | The Depth Raster Operation unit performs depth tests, stencil tests, and depth/stencil buffer updates. |
CROP | Screen Pipe | The Color Raster Operation unit performs the final color blend and render-target updates. CROP implements the “advanced blend equation” |
SM Occupancy Rows
The SM Occupancy row shows warp slot residency over time. Each Turing SM has 32 warp slots, where launched warps reside while they take turns issuing instructions.
The row shows an ordered breakdown of warp slots. From top to bottom:
-
Active SM unused warp slots (dark gray)
-
The warp slot is unused, but the SM is active -- other warps are running on it.
-
Active SMs may be occupancy limited, implying these dark gray warp slots may be unable to absorb additional work
-
-
Compute Warps : compute warps across all simultaneously running Compute Dispatches, DXR DispatchRays, DXR BuildRaytracingAccelerationStructures, and DirectML calls
-
Pixel Warps : 3D pixel shader warps from all simultaneously running draw calls
-
Vertex/Tess/Geometry : 3D world pipe shaders from all simultaneously running draw calls
-
Mixed occupancy timeslice:
- Low occupancy timeslice:
Asynchronous Compute
The only way to concurrently run compute and 3D is by simultaneously:
-
sending 3D work to the DIRECT queue
-
sending compute work to an ASYNC_COMPUTE queue
On Ampere, you can also dispatch concurrent compute workloads by dispatching it on both the DIRECT and ASYNC_COMPUTE queue.
You can detect whether a program is taking advantage of async compute in several ways:
The “Compute In Flight” row contains an “Async Compute In Flight” counter.
Observe when the compute warps executed on the SM Occupancy row, and determine if they were Sync or Async based on the color of the “Compute In Flight” row.
Look for multiple queue rows; the ASYNC_COMPUTE queue will appear as something other than Q0.
Compute will only run simultaneously with graphics if submitted on from an ASYNC_COMPUTE queue. This can disambiguate the SM Occupancy row.
See the GPU Programming Guide for more information:
Warp Can't Launch Reasons
When a draw call or compute dispatch enqueues more work than can fit onto the SMs all-at-once, the SMs will report that “additional warps can’t launch”. We can graph these signals over time to determine the limiting factor.
3D Shaders
3D warps may not launch due to the following reasons:
-
Register Allocation
-
Warp Allocation
-
Attribute Allocation (ISBEs for VTG, or TRAM for PS)
GPU Trace allows you to determine the following:
- When Pixel Shaders can’t launch, for any reason.
- When Pixel Shaders launch was register limited.
- The warp-allocation limited regions in the Warp Occupancy chart. There are no free slots when there are no gray regions.
- By deduction, if SM Occupancy shows a large region of pixel warps with no limiter visible, it either means
- there was no warp launch stall, but rather very long running warps [unlikely], OR
- warp launch was stalled due to attribute allocation.
Compute warps may not launch due to the following reasons:
-
Register Allocation
-
Warp Allocation
-
CTA Allocation
-
Shared Memory Allocation
GPU Trace allow you to determine the following:
- When Compute Shaders should be running in the Compute In Flight row.
- When Compute Shader launch was register limited.
- The warp-allocation limited regions in the Warp Occupancy chart. There are no free slots when there are no gray regions.
- By deduction, if SM Occupancy shows a large region of compute warps
with no limiter visible, it eithers means:
- there was no warp launch stall, but rather very long running warps [unlikely], OR
- one of the other reasons (CTAs, Shared Memory) was the reason.
Further disambiguating between 4a and 4b above:
-
Are the CTA dimensions (HLSL numthreads) the theoretical occupancy limiter for your shader? A thread group with 32 threads or fewer will be limited to half occupancy. Increasing to 64 threads per CTA will relieve this issue.
-
Is the shared memory size per CTA the theoretical occupancy limiter for your shader? (HLSL groupshared variables) Does (numCTAs * shmemSizePerCTA) exceed the per-SM limit of 64KiB [compute-only mode] or 32KiB [SCG mode]?
SM Throughput
SM Throughput reveals the most common computational pipeline limiters in shader code:
- Issue Active : instruction-issue limited
- ALU Pipe : INT other than multiply, bit manipulation, lower frequency FP32 like comparison and min/max.
-
FMA Pipe : FP32 add & multiply, integer multiply.
-
FP16+Tensor : FP16 instructions which execute a vec2 per instruction, and Tensor ops used by deep learning.
-
SFU Pipe : transcendentals (sqrt, rsqrt, sin, cos, log, exp, ...)
-
Pipes not covered in the SM Throughput: IPA, LSU, TEX, CBU, ADU, RTCORE, UNIFORM.
- LSU and TEX are never limiters in the SM; see L1 Throughput instead.
- IPA traffic to TRAM is counted as part of LSU
- RTCORE has its own Throughput metric
- UNIFORM and CBU are unaccounted for, but rarely a limiter
Problem Solving:
L1 Throughput
GPU Trace exposes a simplified model of the L1TEX Data Cache, that still reveals the most common types of memory limiters in shader programs.
The Turing and GA10x L1TEX Caches share a similar design, capable of concurrent accesses:
- Input: Simultaneously accepting an LSU instruction and TEX quad per cycle
- Input: LSUIN accepts 16 threads’ addresses per cycle from AGU
- Data: Simultaneously reading or writing the Data SRAM for LSU and TEX.
- Writeback: Simultaneously returning data for LSU and TEX reads
Additional cache properties:
- The T-Stage (cache tags), Data-Stage, and M-Stage are shared between LSU and TEX
- The T-Stage address coalescer can output up to 4 tags per cycle, for divergent accesses. This ensures T-Stage is almost never a limiter, compared to Data-Stage
- Per cycle, M-Stage can simultaneously read from L2 and write to L2. There is a crossbar (XBAR) between M-Stage and L2, not pictured above
In this simplified model, memory & texture requests follow these paths:
- Local/Global Instruction → LSUIN → T-Stage → LSU Data → LSU Writeback → SM
- Shared Memory Instruction → LSUIN → LSU Data → LSU Writeback → RF
- Includes compute shared memory and 3D shader attributes
- A few non-memory ops like HLSL Wave Broadcast are counted here
- Texture/Surface Read → TEXIN → T-Stage → TEX Data → Filter → Writeback → SM
- Includes texture fetches, texture loads, surface loads, and surface atomics
- Surface Write → TEXIN... → T-Stage → LSU Data → Writeback → RF
- Surface writes cross over to the LSU Data path
- Memory Barriers → LSUIN & TEXIN → … flows through both sides of the pipe
- Primitive Engine attribute writes to (ISBE, TRAM) → LSU Data
- Primitive Engine attribute reads from ISBE → LSU Data
- Local/Global/Texture/Surface→ T-Stage [miss!] → M-Stage → XBAR → L2 → XBAR → M-Stage → Cache Data SRAM
Turing Problem Solving
Turing reports the following activity:
- Throughput of LSU Data and TEX Data
- Separate counters in LSU Data for Local/Global vs. Shared memory
- Throughput of LSU Writeback, and TEX Filter
- All values are equal : most likely request limited
- TEX Data > others : Bandwidth limited; cachelines/instruction > 1
- TEX Filter > others : expensive filtering (Trilinear/Aniso) OR Texture writeback limited due to a wide sampler format (or surface format when sampler disabled)
- LSU Data > others : Possibilities are:
- Pure bandwidth limited
- Cachelines per instruction > 1, causing serialization
- Shared memory bank conflicts, causing serialization
- Vectored shared memory accesses (64-bit, 128-bit) requiring multi-cycle access
- Heavy use of SHFL (HLSL Wave Broadcast)
- LSU Writeback > others : limited by coalesced wide loads (64-bit or 128-bit)
- Note: this implies efficient use of LSU Data
- LSU LG Data or TEX Data close to 100% : implies a high hit-rate
- If Sector Hit Rate is low : may imply latency-bound by L2 or VRAM accesses
GA10x Problem Solving
GA10x reports the following activity:
- Throughput of LSU Data and TEX Data
- Separate counters in LSU Data per Local/Global, Surface, and Shared memories
- Throughput of LSU Writeback, TEX Filter, and TEX Writeback
- L1TEX Sector Hit-Rate
- Reports the collective hit-rate for Local, Global, Texture, Surface
- Note that shared memory and 3D attributes do not contribute to the hit-rate
By comparing the values of the available throughputs, we can draw the following conclusions:
- All values are equal : most likely request limited
- TEX Data > others : Bandwidth limited; cachelines/instruction > 1
- TEX Filter > others : expensive filtering (Trilinear/Aniso)
- TEX Writeback > others : limited wide sampler format (or surface format when sampler disabled)
- LSU Data > others : Possibilities are
- Pure bandwidth limited
- Cachelines per instruction > 1, causing serialization
- Shared memory bank conflicts, causing serialization
- Vectored shared memory accesses (64-bit, 128-bit) requiring multi-cycle access
- Heavy use of HLSL Wave Broadcast
- LSU Writeback > others : limited by coalesced wide loads (64-bit or 128-bit)
- Note: this implies efficient use of LSU Data
- LSU LG Data or TEX Data close to 100% : implies a high hit-rate; cross-check against the L1TEX Sector Hit Rate
- If Sector Hit Rate is low : may imply being latency-bound by L2 or VRAM accesses
VRAM
The GPU’s VRAM is built with DRAM. DRAM is a half duplex interface, meaning the same wires are used for read and write, but not simultaneously. This is why the total VRAM bandwidth is the sum of read and write.
The VRAM Bandwidth row shows bandwidth as a stacked graph, making it easy to visualize the balance between read and write traffic. VRAM traffic implies either L2 cache misses, or L2 writeback. In either case, high VRAM traffic can be a symptom of poor L2 cache usage -- either too large a working set, or sub-optimal access patterns.
PCI Bandwidth
The GPU connects to the rest of the computer via PCI Express (PCIe). PCIe is a full duplex interface, meaning separate wires are used for reads and writes, and these can occur simultaneously. This is why the PCIe row is displayed as an overlay, where reads and writes can independently reach 100%.
When the GR engine is idle (neither graphics nor compute running), it is often due to a data dependency, where a previous data transfer (DMA copy) must complete before draws or compute dispatches can start. Compare the GPU Active row against the Throughputs row to confirm that hypothesis.
Range Profiler
The Range Profiler is a full pipeline profiler that helps developers better understand how their application utilizes the GPU. This is done by utilizing the tool’s replay capability to collect many detailed metrics from every GPU subunit, which you can see what parts of the GPU are under or over utilized. The profiling experience is guided by visualizing various ranges, or groups of draw/dispatch calls, that the user can interactively select as the focus of the profiling activity. GPUs from Turing to Maxwell are supported, as well as various graphics APIs including DirectX12 with DXR, DirectX11, Vulkan with NVIDIA VKRay, and OpenGL.
How Do I Use It?
The Range Profiler is available from both the Profiling and Frame Debugging activities. Using the Profiling activity, however, will provide a more profiling centric UI and will also disable some features of the tool that are more debugger centric, which may introduce overhead that can impact the profiling results. Accessing the Range Profiler can be done a number of ways, providing contextually aware ways to approach the problem of profiling. First, you can simply select the Range Profiler from the Frame Debugger menu. This will bring up the view with the entire frame selected as the current “range” to investigate. This can also be accomplished via the “Profiler Frame” button on the toolbar. Next, you can use the “Profile Current Event” button on the toolbar, as well as right clicking the event in the Scrubber or Events View, to focus the profiling to a particular event of interest, like maybe a particularly long draw or dispatch call. You can also initiate profiling on a given range by right clicking on the range “bar” in the Scrubber View, which is handy if you already have an idea of the section of the frame you are interested in investigating.
The more contextual information you can give the Range Profiler the more useful and powerful it can be to understand where time and performance is spent on the GPU. The tool tries to create ranges via common state, such as render targets, and related workloads, via command list ranges. However, adding custom ranges using the various graphics APIs’ annotation capabilities like ID3DUserDefinedAnnotation, VK_EXT_debug_marker, etc. can go a long way in grounding you as you work your way through the hundreds to thousands of draw and dispatch calls in a given scene.
Finally, the Range Profiler is configurable by you, the user. There are simple text “.section” files and “.py” scripts that allow you to edit the values that are presented in order to customize the UI as you gain expertise on the GPU and your particular workloads. You can modify the existing sections, create new ones, and enable/disable sections on the fly as you investigate different rendering performance problems. For more information on how to edit the section files see Configuring The Range Profiler
Range Profiler Key Concepts
Selecting What To Profile
We encourage developers to start by looking at the larger ranges in their scene, as these typically will have more headroom for improvement. You can also start out by looking at the time spent in various ranges compared to your allocated budgets for that part of the scene or rendering pass, again starting with the ones that are more over budget.
Throughput Metrics
One of the key sections for understanding overall GPU performance is called the Pipeline Overview.
This section will help you understand the throughput for the currently selected range/workload and serves as a starting point for further investigation. For a better understanding of the GPU pipeline and the different subunits, consult section 3.2 from the GPU Programming Guide.
In the example above, we have selected a section of the scene responsible for the shadow pass, and we can see that our top throughput values are concentrated on the Raster and Z Blending units. So, we can tell already that we are not bound based on the shader unit or something else, which makes sense since we are simply transforming the geometry and writing the z values to a buffer. From here, you can look at the Range Info section, which will show you information about primitive culling, etc. and can help you understand possible next steps to reducing the cost to determine the shadow coverage.
For more details on how to understand the throughput metrics and how to use them to analyize your GPU performance, please consult Louis Bavoil's blog article on GPU performance Analysis
SM Metrics
The next section covers the SM or shader unit inside the GPU. This is where all of the various shaders are run and are critical to ensuring maximum performance. The initial view shows a number of values useful to understanding SM utilization:
SM Active tells the % of time the SM unit was active during the selected range. A lower value here indicates possible underutilization and potential for adding shader work. The SM Active Min/Max Delta indicates how well balanced a given workload was across all of the available shader units. If this number is high and you have a compute work load, for instance, there are likely not enough threads/warps to fill the GPU with work. The SM Occupancy value tells you how many warps were available on the GPU, which is important to maximize throughput. When a warp gets stalled waiting for memory or texture fetches, having a high occupancy means there are other warps that can be swapped in so overall progress in the scene can continue, even if the current warp needs to wait for some results. On the right side is a list of some of the warp stall reasons, indicating that a warp was not able to issue an instruction on a given clock. These will help you know the high level reasons for the shaders in the selected range not making progress. You can use this information to indicate when to open the Shader Profiler and see similar stall information broken down to specific shaders and source lines.
Notices
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.