Optimization for DGX Spark#

This section discusses software optimization practices that are relevant for DGX Spark.

General practices#

Profile your application to find hotspots that would benefit most from optimization. Making assumptions about hotspots can lead to incorrect conclusions. Some appropriate tools are discussed in a section below.
Create reproducible microbenchmarks when optimizing hotspots. Measuring the runtime of short functions can be tricky; consider using a framework like google benchmark or nanobench. These frameworks generally handle the most notorious microbenchmarking challenges such as data variance, outliers, warmups, and subtracting overhead. Setting up microbenchmarks will help you track your progress while optimizing a hotspot.
If possible, take advantage of vectorization, either by using your compiler’s auto-vectorization or doing it by hand. SIMD is generally faster and more power-efficient with respect to work done than scalar code. If you are migrating vectorized x86_64 code, see CPU Instruction Extensions.
When parallelizing large workloads using OpenMP, also consider guided or dynamic scheduling to better balance tasks in a hybrid architecture. If tasks complete quickly, try increasing the chunk size to keep synchronization overhead under control.
Pay attention to memory access patterns. If appropriate, use tiling methods to maintain data locality. Note that DGX Spark is a hybrid architecture with varying cache sizes across its cores.
The cores in DGX Spark are organized as two clusters with up to 5 P-cores and 5 E-cores each. Minimize concurrent writes to the same cache lines by different cores, especially ones in different clusters. Cache lines are 64 bytes long. Keep thread synchronization to a necessary minimum.

Compiler configuration#

DGX Spark is based on ARMv9.2-A architecture. Developers may choose to target an older architecture to maintain compatibility; it is recommended to target at least ARMv8.2 with the rcpc extension.

gcc and clang#

A specific architecture can be targeted by adding the -march switch, e.g. -march=armv8.2-a. When compiling on the Spark itself, -march=native will have the compiler target the host machine’s architecture. Otherwise, -march=armv9.2-a can be used to specify Spark’s ARMv9.2-A architecture.

Since LLVM version 21 and gcc version 15, it is also possible to specifically target DGX Spark with -mcpu=gb10. This enables optimizations for the Cortex-X925 and Cortex-A725 cores as well as the optional crypto extension and is thus preferred over a more generic -march switch. When compiling on the Spark itself, this architecture is automatically selected with -march=native.

When targeting an architecture older than ARMv8.2-A, it is advisable to also enable -moutline-atomics. This option will replace inline atomics with calls to the best atomics available for the platform, as determined at runtime.

.NET applications#

ARM Advanced FP & SIMD (a.k.a. NEON) data types are available in modern .NET runtimes via the Vector<T>, Vector128<T>, and similar classes. Operations and ISA extensions on these vectors are accessible in the System.Runtime.Intrinsics.Arm namespace. When targeting multiple architectures, the availability of a particular ISA extension can be determined at runtime, e.g. System.Runtime.Intrinsics.Arm.Dp.IsSupported.

Access to SVE instructions is currently experimental and may have implications when using NativeAOT because of the runtime-determined vector width. See dotnet issue 93095 and Engineering the Scalable Vector Extension in .NET for more information.

Memory reporting on UMA systems (including DGX Spark)#

DGX Spark systems use a unified memory architecture (UMA), where the GPU shares system memory (DRAM) with the CPU and other compute engines. This design reduces latency and allows larger amounts of memory to be used for GPU workloads. On UMA systems, the CPU can dynamically manage DRAM contents, including freeing up memory from the system’s SWAP area and page caches. However, the cudaMemGetInfo API does not account for memory that could potentially be reclaimed. As a result, the memory size reported by cudaMemGetInfo may be smaller than the actual allocatable memory, since the CPU may be able to release additional DRAM pages.

To more accurately estimate the amount of allocatable device memory on DGX Spark platforms, CUDA application developers should consider the memory that can be reclaimed from the OS and not rely solely on the values returned by cudaMemGetInfo. For a reference implementation using C standard libraries see below link. This snippet will return the maximum memory that can be allocated without swapping (availableMemory) and maximum memory that can be allocated with swapping (availableMemory + freeSwap).

https://privatebin.nvidia.com/?07796e490ed5c7f6#upm5ok1suwYs1fN5nB5E6LHebWgbykmGpkZ3f8qPDkz

As a workaround for debugging purposes, you can flush the buffer cache manually with the following command:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

After flushing the cache, restart your application.

Profiling#

Profiling your application is highly advisable before undertaking any optimization efforts. The tool of choice on Linux platforms is perf. Recording perf traces has security implications; you may need to unlock access to performance counters by adding a file to /etc/sysctl.d with the following content:

kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0

It is also possible to enable these settings only until the next reboot. To install the perf tools, execute:

sudo apt install linux-tools-$(uname -r)

To record a trace for your application and launch the analyzer:

sudo perf record -e cycles,(...) ./your-application
perf report

For a listing of available performance monitoring counters, see Arm Cortex-X925 Performance monitoring events and Arm Cortex-A725 Performance monitoring events.

Links#

A64 SIMD Instruction List: SVE Instructions