Compilers#
Many commercial and open-source compilers fully support NVIDIA Grace. This section provides information about the available compilers, the recommended versions, and the recommended command-line options.
NVIDIA HPC Compilers#
The NVIDIA HPC SDK includes proven compilers, libraries, and software tools. The HPC SDK compilers (NVHPC) enable cross-platform C, C++, and Fortran programming for NVIDIA GPUs and multicore Arm, OpenPOWER, or x86-64 CPUs. The compilers are ideal for HPC modeling and simulation applications that are written in C, C++, or Fortran with OpenMP, OpenACC, and NVIDIA CUDA®.
When building natively on Grace, NVHPC version 23.3 or later automatically optimizes for Grace without additional command-line options. To verify, pass the --version
command-line option and look for -tp neoverse-v2 in the output
:
nvidia@localhost:~$ nvc --version
nvc 23.3-0 linuxarm64 target on aarch64 Linux -tp neoverse-v2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
The optimization and floating-point control flags for NVHPC are the same on NVIDIA Grace as on other CPUs. Refer to the NVIDIA HPC Compilers User’s Guide for more information.
Note
NVIDIA provides the BLAS, LAPACK, and FFT math libraries that are optimized for Grace, and we strongly recommend that you use them.
GNU Toolchain#
When using the GNU toolchain, we recommend GCC version 12.3 or later. GCC version 7 can be used on NVIDIA Grace, but because these older compilers target earlier Armv8-A architecture variants, the performance will be suboptimal. The latest and greatest binary versions of GNU toolchain can be found from your GNU/Linux distribution or downloaded using Spack.
Note
When possible, always use the latest version of GCC.
Even with the latest version of GCC, unless your toolchain has been configured and built for Grace, additional command-line options are necessary to generate optimal code for NVIDIA Grace. If no additional flags are provided, GCC will generate code targeting a generic Armv8-A CPU. The recommended flags are provided in the following table.
Note
More aggressive optimizations will trade floating point accuracy for performance.
Optimization Level |
Flags |
Notes |
---|---|---|
Aggressive |
-Ofast -mcpu=neoverse-v2 |
Enable fast math optimizations |
Moderate |
-O3 -mcpu=neoverse-v2 |
Recommended in most cases |
The -mcpu=neoverse-v2
flag is used in all cases. We recommend that you use the -mcpu
flag instead of the -march
and -mtune
flags because the -mcpu``flag will select the CPU that you were targeting for convenience instead of specifying the architecture with the required extensions using the ``-march
option and then specifying the -mtune
option. Refer to https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#aarch64-feature-modifiers for more information about the instruction set features that can be turned on and off on a per-feature basis.
The __sync
built-ins in GNU C or GNU C++ are precursors to modern atomic extensions that are used in the C11 / C++11 standards. These built-ins are now considered legacy, and users should port the atomic extensions in C11 / C++11. This is good advice for any platform, but it is particularly relevant for CPUs that implement the AArch64 architecture because the legacy __sync
sync built-ins tend to enforce more strict orderings than are necessary. Refer to https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html for more information.
The C standard does not specify the signedness of the char type. On x86, a char is assumed to be signed by default, and on Arm, char is assumed to be unsigned. This difference can be addressed by using the standard int types that specify signedness when the sign of a number is important (for example, uint8_t
and int8_t
) or by compiling with the -fsigned-char
flag to set the signedness of char at compile time.
Refer to https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/AArch64-Options.html for more information about the command-line options that are required for the AArch64 target in GCC.
LLVM Clang and Flang Compilers#
When you use LLVM, we recommend version 16 or later. LLVM compilers support Arm64 CPUs but mainly for C and C++ (the clang and clang++ commands). LLVM’s Fortran compiler (flang) is not yet widely used and is still maturing. Like the GNU compilers, Clang prioritizes portability over performance and, to enable optimization, additional flags must be added.
NVIDIA provides builds of LLVM Clang at developer.nvidia.com/grace/clang that are specially packaged for the Grace CPU. These builds are mainline Clang that are configured to support Grace, so the builds can be used as a drop-in replacement for Clang in your current workflows. The following table provides information about the optimization leves and flags.
Optimization Level |
Flags |
Notes |
---|---|---|
Aggressive |
-Ofast -mcpu=neoverse-v2 |
Enables fast math optimizations. |
Moderate |
-O3 -mcpu=neoverse-v2 |
Recommended in most cases. |
Conservative |
-O3 -ffp-contract=off -mcpu=neoverse-v2 |
Disables fused math operations. |
The -mcpu=neoverse-v2
flag is used in all cases, and we recommend using
the -mcpu
flag instead of the -march
and -mtune
flags.
Arm Compiler for Linux and Other Commercial Compilers#
The Arm Compiler for Linux (ACfL) is a commercially supported, closed-source compiler provided by Arm. It is free and is bundled with the optimized BLAS, LAPACK, and FFT libraries. Arm supports NVIDIA Grace in ACfL through support for the Neoverse-V2 CPU microarchitecture. Refer to Arm Compiler for Linux for more information about compiler options, flags, and support.
System vendors, such as HPE/Cray and Fujitsu, also provide compilers that target their own Arm-based products. The code generated by these vendor compilers tends to be highly tuned for the target platform, which makes them a good choice in performance-critical situations. Contact the system vendor for information and support.
Arm Architecture Feature Support#
Like other major CPU architectures, there are instructions that cannot be reached directly by translating C/C++. These instructions are usually supported by the weight of compiler intrinsics and a set of feature macros that can be used in applications to test for the specific architecture extension support at compile time. The common set of intrinsics, additional data types, architectural feature macros, among others for the Arm architecture for compilers, are defined by the Arm C / C++ Language Extensions (ACLE). Refer to ARM-software/acle for more information.
A discussion on ACLE and intrinsics programming is out of scope for this document, and users should refer to it and their compiler documentation, for the compliance level.
NVIDIA Grace implements the Armv9-A architecture and several of the Armv9-A architectural extensions. To see which Arm architectural features are enabled at compile time, run the following command:
gcc -dM -E -mcpu=neoverse-v2 - < /dev/null | grep ARM_FEATURE
Here is an example of the output with GCC 12 on NVIDIA Grace:
nvidia@localhost:~$ gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE | sort
#define __ARM_FEATURE_AES 1
#define __ARM_FEATURE_ATOMICS 1
#define __ARM_FEATURE_BF16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_BF16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_COMPLEX 1
#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_CRYPTO 1
#define __ARM_FEATURE_FMA 1
#define __ARM_FEATURE_FP16 FML 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_FRINT 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_JCVT 1
#define __ARM_FEATURE_MATMUL_INT8 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1
#define __ARM_FEATURE_QRDMX 1
#define __ARM_FEATURE_SHA2 1
#define __ARM_FEATURE_SHA3 1
#define __ARM_FEATURE_SHA512 1
#define __ARM_FEATURE_SM3 1
#define __ARM_FEATURE_SM4 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_SVE2_BITPERM 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_SVE2_SM4 1
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_UNALIGNED 1
Refer to the output of the following command for more information about target specific flags on arm64:
gcc -Q --help=target
Here is the sample output:
nvidia@localhost:~$ gcc -Q --help=target
The following options are target specific:
-mabi= lp64
-march= armv8-a
-mbig-endian [disabled]
-mbionic [disabled]
-mbranch-protection=
-mcmodel= small
-mcpu= generic
-mfix-cortex-a53-835769 [enabled]
-mfix-cortex-a53-843419 [enabled]
-mgeneral-regs-only [disabled]
-mglibc [enabled]
-mharden-sls=
-mlittle-endian [enabled]
-mlow-precision-div [disabled]
-mlow-precision-recip-sqrt [disabled]
-mlow-precision-sqrt [disabled]
-mmusl [disabled]
-momit-leaf-frame-pointer [enabled]
-moutline-atomics [enabled]
-moverride=<string>
-mpc-relative-literal-loads [enabled]
-msign-return-address= none
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard= global
-mstrict-align [disabled]
-msve-vector-bits=<number> scalable
-mtls-dialect= desc
-mtls-size= 24
-mtrack-speculation [disabled]
-mtune= generic
-muclibc [disabled]
-mverbose-cost-dump [disabled]
- Known AArch64 ABIs (for use with the ``-mabi=`` option): ilp32 lp64
- Supported AArch64 return address signing scope (for use with the ``-msign-return-address=`` option): all non-leaf none
- The code model option names for ``-mcmodel``: large small tiny
- Valid arguments to ``-mstack-protector-guard=``: global sysreg
- The possible SVE vector lengths: 1024 128 2048 256 512 scalable
- The possible TLS dialects: desc trad
Using Code Locality to Improve Performance#
Improving executable code locality can increase efficiency on Grace, which benefits the instruction cache hit rate, the iTLB hit rate, and branch prediction. Executables and large shared objects with code spread over a wide virtual address range are likely to see performance improvements by grouping frequently called functions into as few
naturally aligned 2 MB virtual address ranges as possible. The perf record
and perf script
commands can help determine the observed program counter addresses over a span of time. To determine whether an application might be a candidate for this optimization, count the number of observed address ranges in the perf output.
For large applications and/or libraries that will access more than 30 ranges in quick succession, this optimization type might yield speedups of as much as 50%. To achieve this, there are several ways to rearrange the linked binary/binaries to group frequently called functions or group functions with the other functions that they typically call. For example, some forms of automated Profile-Guided Optimization (PGO) might be beneficial in this scenario. The perf record
/perf script
output can also be used to capture the names of the most frequently called functions. By compiling with -ffunction-sections,
the frequency-sorted list of observed function names can be used to produce a linker
script that groups the “hot” functions nearby in memory, which achieves the same goal.
The scripts at NVIDIA/cpu-code-locality-tool can help automate the process of analyzing perf record
output to identify candidates for optimization and, where applicable, to produce the linker scripts described above.
Optimizations to decrease code size generally might be beneficial because smaller code naturally spans fewer 2 MB ranges. For example, if you are using gcc -O3, consider using -fno-ipa-cp-clone
.