Updates in 2024.4

General

  • Added support for the Blackwell architecture.

  • Added support for several launch__* metrics for CUDA graphs.

  • Added support for cuMemBatchDecompressAsync API in the Range Replay.

NVIDIA Nsight Compute

  • A new feature overview is now shown the first time a new UI version is opened.

  • Switched the default orientation of the Raw page to show metrics in rows and profile results in columns.

  • Added support for reporting register spilling compiler annotations on the Source page.

  • The source page has improved search with support for regular expression- and value-based lookups.

  • Added support to set a Source View Profile as the default profile to apply it automatically while opening a report.

  • Added hyperlinks for the line numbers and inline function addresses in the Inline Table. This enabled you to quickly jump to the respective line number in the Source view and address in the SASS view. Added a new column Source File in the Inline Table to show the file name to which source belongs.

  • The memory chart can indicate or hide inactive elements.

  • Chart tooltips on the Details page now show more relevant information when a specific value is hovered.

  • Roofline charts now support showing the formula for ridge point calculation in the metric details tool window.

  • The occupancy calculator now considers the impact of block barriers for Hopper-architecture and newer GPUs. It also has improved controls to adjust input values.

  • The remote connections dialog now supports placeholders to deploy files to e.g. user-specific directories on the target system.

NVIDIA Nsight Compute CLI

  • Added new --nvtx-push-pop-scope command line option which allows to set push pop range scope process wide.

Resolved Issues

  • Fixed UI scrolling issues on macOS trackpads.

  • Fixed that certain Python script errors were not properly reported when loading rule files.

  • On CUDA 12.7 drivers, context switch trace can now filter events more precisely to the profiled CUDA context, even when profiling in containers.

  • NVTX filtering now properly supports start/end ranges that start and end in different threads.

  • Fixed several issues with range replay when capturing CUDA memcpy APIs.