Optimization for DGX Station#

This section discusses software optimization practices that are relevant for DGX Station.

General Practices#

Profile your application to find hotspots that would benefit most from optimization. Making assumptions about hotspots can lead to incorrect conclusions. Some appropriate tools are discussed in a section below.
Create reproducible microbenchmarks when optimizing hotspots. Measuring the runtime of short functions can be tricky; consider using a framework like google benchmark or nanobench. These frameworks generally handle the most notorious microbenchmarking challenges such as data variance, outliers, warmups, and subtracting overhead. Setting up microbenchmarks will help you track your progress while optimizing a hotspot.
If possible, take advantage of vectorization, either by using your compiler’s auto-vectorization or doing it by hand. SIMD is generally faster and more power-efficient with respect to work done than scalar code. If you are migrating vectorized x86_64 code, also see CPU Instruction Extensions.
Pay attention to memory access patterns. If appropriate, use tiling methods to maintain data locality.

Compiler Configuration#

DGX Station uses ARM Neoverse V2 cores that are based on ARMv9.0-A architecture and include the optional crypto extensions. There are four 128b vector units in each core, accessible with ARMv8.0 NEON (Advanced FP & SIMD) instructions or the more flexible SVE2 ISA.

gcc and clang#

A specific architecture can be targeted by adding the -march switch, for example -march=armv9.0-a.

Since LLVM version 21 and gcc version 15, it is possible to specifically target DGX Station with -mcpu=grace. This enables optimizations for the NVIDIA Grace cores and is thus preferred over the more generic -march=armv9.0-a switch.

When compiling on the DGX Station itself, -march=native will have the compiler target the host machine’s architecture. Otherwise, -march=grace can be used to specify the DGX Station architecture explicitly.

When targeting an architecture older than ARMv8.2-A, it is advisable to also enable -moutline-atomics. This option will replace inline atomics with calls to the best atomics available for the platform, as determined at runtime.

More details can be found in the NVIDIA Grace Performance Tuning Guide (compilers).

.NET Applications#

ARM Advanced FP & SIMD (also known as NEON) data types are available in modern .NET runtimes through the Vector<T>, Vector128<T>, and similar classes. Operations and ISA extensions on these vectors are accessible in the System.Runtime.Intrinsics.Arm namespace. When targeting multiple architectures, the availability of a particular ISA extension can be determined at runtime, for example System.Runtime.Intrinsics.Arm.Dp.IsSupported.

Access to SVE instructions is currently experimental and may have implications when using NativeAOT because of the runtime-determined vector width. See dotnet issue 93095 and Engineering the Scalable Vector Extension in .NET for more information.

Taking Advantage of UMA#

Unlike most x86_64 systems, the ARM64-based DGX Station features a unified memory architecture (UMA) where the CPU and GPU have coherent access to each other’s attached memory. In practice, the CPU and GPU have separate virtual memory spaces that map to the common physical memory. This architecture presents developers with interesting optimization opportunities, such as preparing a block of memory with the CPU and then sharing it with the GPU by simply mapping it into its address space. This practice effectively eliminates copying data into GPU memory, which would be required with PCIe-connected GPUs.

Note that certain restrictions apply and care must be taken to allow the system memory to be used in such a manner.

Coherent Memory Management#

DGX Station systems with the GB300 Superchip couple an NVIDIA Grace CPU with a Blackwell GPU over a high-bandwidth NVLink interconnect. This configuration exposes a coherent memory space in which CPU system memory can be accessed directly by the GPU, removing the need for many explicit host-device copies in typical CUDA applications. From a programmers perspective, pointers returned by standard system allocators (for example malloc) and by CUDA pinned-memory APIs (for example cudaHostAlloc) can be used directly in CUDA kernels when coherence is present. This enables kernels to operate on host-resident data structures without transfering them to separate device buffers, which can significantly simplify data management code paths.

To verify that a given GPU supports this mode, use cudaDeviceGetAttribute and query the cudaDevAttrPageableMemoryAccessUsesHostPageTables attribute. If this attribute returns 1, the system supports coherent access to system memory and redundant host-device allocations and transfers can be eliminated.

Note that cudaMalloc still allocates memory exclusively in GPU VRAM (HBM on Blackwell). These allocations remain the preferred choice for performance-critical data, as they provide the highest bandwidth and lowest latency from the GPU’s perspective, even in a coherent memory environment. For deeper background and best practices, see the NVIDIA Developer Blog post Simplifying GPU Application Development with Heterogeneous Memory Management and the CUDA Programming Guide, section 2.4 Unified and System Memory. These documents explain how heterogeneous memory management, unified memory, and system memory interact and provide more detailed examples and recommendations.

Profiling#

Profiling your application is highly advisable before undertaking any optimization efforts. The tool of choice on Linux platforms is perf. Recording perf traces has security implications; you may need to unlock access to performance counters by adding a file to /etc/sysctl.d with the following content:

kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0

It is also possible to enable these settings only until the next reboot.

To install the perf tools, execute sudo apt install linux-tools-$(uname -r).

To record a trace for your application and launch the analyzer:

sudo perf record -e cycles,(...) ./your-application
perf report

For more on profiling with perf and available hardware counters, see the NVIDIA Grace Performance Tuning Guide (measuring performance).