NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.0 User Guide
Send Feedback
Display Driver
You must install the NVIDIA display driver that supports the NVIDIA Nsight tools. If you have an NVIDIA graphics card installed on your target machine, you likely already have an NVIDIA display driver; however, NVIDIA Nsight requires a specific version of the driver in order to function properly. From the NVIDIA web site, download and install the following display driver (or newer):
Driver Release 332, Release 332.44 or newer
See below for more release information about:
Support for the new Maxwell architecture (for example, found in the GeForce GTX 750 Ti and 750).
$(TargetPath)
, $(ProjectDir)
, and $(LocalDebuggerCommandArguments)
are able to be added on the NVIDIA Nsight properties page. (12107) __global__
subroutine on a Maxwell GPU, a bug in the shared memory range checker will incorrectly flag those memory accesses inside the subroutine as out-of-range. (30497)__global__ static
attributes, the NVIDIA Nsight debugger might not be able to display local variables inside that function. Users can work around this issue by simply removing the static
qualifier on the function. (21914)x = cos() + sin()
ARB_vertex_attrib_binding
and ARB_multi_draw_indirect
has been added. (26894, 23858) $(TargetPath)
, $(ProjectDir)
, and $(LocalDebuggerCommandArguments)
are able to be added on the NVIDIA Nsight properties page. (12107)rop_busy
hardware counter has been removed from the list of available counters, due to a hardware bug that caused the value to not be correct. If you reinstall NVIDIA Nsight, this may still be a default counter and will show unusually high values. You can either edit your graphs to remove the counter (via Nsight > Windows > Graphics HUD Configuration, or by deleting your persisted settings. To do this, open Windows Explorer and navigate to %appdata%\NVIDIA Corporation
, and delete the entire Nsight directory. (29203)SwapBuffer
calls for a single buffered window. (24590) fx_N_M
target) is not supported, only pure HLSL shaders. (24891) Z
, this value may be 0
, even though the depth value for a fragment may be written. (22061)#line
directive to refer back to the original sources may not work as expected. (22067)nvtxRangeBegin
and nvtxRangeEnd
functions, only nvtxRangePush
and nvtxRangePop
. (22163)D3D11_MAP_FLAG_DO_NOT_WAIT
to a Map call on a Direct3D 11 Device Context, it is possible that the operation hasn't finished so you will see a return code of 0x887A000A
or DXGI_ERROR_WAS_STILL_DRAWING
. This can sometimes happen when the capture is trying to restore a buffer to the frame start state and it is mapped early in the frame. Simply remove the D3D11_MAP_FLAG_DO_NOT_WAIT
, and it should function properly. (24846) This issue can be resolved by always adding an "explicit" return at the end of your shader. (14656)
RefRast
) tool, which is the CPU rasterizer provided by Microsoft. The Graphics Debugger will signal an error if the IDXGIFactory::CreateSoftwareAdapter
function is used for device creation.
title_static
modifier.
$(TargetPath)
, $(ProjectDir)
, and $(LocalDebuggerCommandArguments)
are able to be added on the NVIDIA Nsight properties page. (12107)
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, ...)
with a large value, the driver may allocate many GBs of device memory, leaving little memory for the application itself. To work around this, select the lowest safely-usable value for cudaLimitDevRuntimeSyncDepth
, which will leave more device memory available for both NVIDIA Nsight and the application itself to use. To see how much memory is being reserved for the CDP sync stack, run an Application Trace with Software Counters enabled, and check the Device Memory row underneath a call to view cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, ...)
.
(24592)Analysis Activity Known Issues
- Tracing the following APIs is not supported in managed processes:
- NVTX
- OpenCL
- Direct3D
- OpenGL
- Launching a managed
.exe
for tracing with any of the aforementioned APIs enabled will result in an "Access Denied" pop-up message, and the analysis session will not start.
- In Trace Process Tree mode, instrumentation for tracing the aforementioned APIs can only propagate to native child processes. If a managed child process is launched, neither it nor any child process it launches (managed or native) can be instrumented by NVIDIA Nsight. The analysis session will continue unaffected, and the user will not be notified of the problem; the report will not contain data from managed processes and their children.
- System and CUDA tracing is fully supported in managed processes, and in Trace Process Tree mode, tracing support propagates to all child processes (native or managed).
- Managed processes are fully supported in the Profile CUDA modes.
- The stop collection timer is implemented in Visual Studio. The latency to communicate to the monitor and application can result in a longer duration than requested.
- CPU Thread Trace
If the Windows Kernel Event Provider is already in use when a new capture session is launched, the collected data may produce unexpected results. For best results ensure that no other kernel providers are running during an analysis session.- CUDA Trace
- CUDA trace does not show implicit memory transfers for graphics interop.
- CUDA Runtime API trace does not capture the <<< >>> kernel launch syntax. Instead, the corresponding CUDA Runtime API calls are reported. Some of the CUDA Driver API calls that are executed by the CUDA Runtime may report errors, such as
CUDA_ERROR_INVALID_CONTEXT
, even though the usage of the CUDA Runtime API is valid. (6745)- When collecting trace information about CUDA kernels and memory transfers, sometimes the report file will not contain complete information about the kernels and memory transfers. This happens because retrieving the data interferes with the application and affects performance, so the tool only does it after these events:
If your capture appears to be missing some or all kernel launch or memory transfer events, either force the data to flush by adding a call to
- a call to
cuCtxSynchronize()/cudaDeviceSynchronize()
,- a call to
cuCtxDestroy()/cudaDeviceReset()
,- a call to
cuStreamDestroy()/cudaStreamDestroy()
,- the application launches enough kernels or memory transfers to fill up NVIDIA Nsight's buffer, so NVIDIA Nsight forces a context synchronize in order to retrieve the data.
cuCtxSynchronize()/cudaDeviceSynchronize()
after all the CUDA work is finished, or (for an application that continuously launches kernels and memcpys), simply capture for more time and try to generate enough data to incur NVIDIA Nsight's flush for a full buffer. (4812)- CUDA Profiler
- On Tesla GPUs, branch counters include
__syncthreads()
.- Profile Trigger increments by 1 per warp, not by 1 per active thread.
- The NVIDIA Nsight CUDA Profiler cannot collect all necessary data in a single pass of the kernel, so the profiler replays the kernel as many times as necessary to collect all requisite data. Between replays of the kernel, the accessible memory is restored to the state it was in before the kernel ran, ensuring the kernel will execute the same code paths. However, the L2 cache state is not restored, so all passes after the first will execute with different data cached in L2. For kernels that access small amounts of global or local memory, this may cause the L2 cache to show hit rates better than it would achieve in normal execution. Kernels that access large amounts of memory that cannot fit entirely in L2 cache will show more accurate results.
- OpenCL
- The end timestamp can sometimes be recorded significantly after the completion of a command. If this occurs, adding a clFlush after specific command will fix the timestamp.
- The start/end range for memory read and write commands includes both host and device time. CUDA start/end range only includes device time.
- Viewing OpenCL Source or Binary code from the OpenCL Programming Builds or OpenCL Program Summary creates a temporary file in %TMP%. The temporary file is not deleted when the file is closed.
- OpenCL reports occasionally do not contain device commands. This can occur if the OpenCL context/queue is not released or less than 512 events occurred during a capture.
- DirectX/OpenGL Trace
- Graphics workload information, such as draw calls and dispatches, are output in groups of 16384 workload events. As a consequence, a report will not contain any graphics workload information if an insufficient number of draw calls occurred during a capture. Increasing the capture duration will help to work around this limitation.
- Some applications, such as Chrome, run in a sandbox environment. The effects on NVIDIA Nsight of such a sandbox are hard to predict, so if having trouble, a user should read the documentation for the target application, and disable any sandbox when possible. For Chrome, the applicable launch flag is
-no-sandbox
. (16426)- When you are running analysis for DX apps on a multi-GPU system, you could see a hang. When running frame timings for DX apps on a multi-GPU system, you could see a timeout waiting for the results. One possible solution would be to connect the monitor to the other GPU. Failing that, you should run analysis with only one GPU plugged into the system.
Analysis Report Known Issues
- On the PM Counters report, you may encounter an error in which not all passes are displayed. (26301)
- If two different host computers use the same remote target machine, it is possible that the 2 machines could generate the same report directory. This would be confusing because reports from the 2 machines would be mixed together. Although unlikely, this can occur when 2 different machines analyze an application of the same name. The NVIDIA Nsight analysis tools on the host machine create the directory name based on the name of the application.
Timeline Known Issues
- There can be an error of approximately 1 microsecond between CPU events and GPU events.
- Percentages displayed in the row labels and tool tips are based upon the full capture time.
- The mouse forward and back buttons cannot be used to navigate the report page system.
- CTRL+- toggles to the previous document instead of Zooming Out.
- Double-clicking on a row containing a line/area graph that also has children will expand/collapse the row as opposed to increasing the height to 66% of the view.
- Using VNC (virtual network computing) software to remotely open a Timeline Report can cause Visual Studio to crash. (7157)
NVIDIA® Nsight™ Development Platform, Visual Studio Edition User Guide Rev. 4.0.140501 ©2009-2014. NVIDIA Corporation. All Rights Reserved.