Operating System Settings#

This chapter provides information about the operating system settings.

Page Size#

Grace supports 64K and 4K Linux kernel page sizes. To configure your Linux kernel with the page size that suits your business needs, change the following kconfig settings during the kernel compilation:

4K page size: CONFIG_ARM64_4K_PAGES=y
64K page size: CONFIG_ARM64_64K_PAGES=y

The 64K page size can benefit the applications that allocate a large amount of memory because there will be fewer page faults, better TLB hits, and efficiency.

Note

The recommended default value for the page size is 64K.

Huge Pages#

Huge pages might be beneficial to applications that allocate large chunks of memories, and the main benefit is fewer TLB misses.

You can use huge pages on Grace systems in the following ways:

Transparent Huge Pages (THP)
- Transparent to the application.
- Mostly automatic with a few available kernel tuning parameters.
- When using the recommended 64 KB page size, THP pages are currently too large for practical use in most applications (refer to Transparent Huge Pages for more information).
Hugetlbfs
- Does not suffer from fragmentation concerns or from allocation latency because the huge pages are preallocated and indivisible.
- Requires application modification.
- Requires sysadmin setup.

Transparent Huge Pages#

Transparent Huge Pages (THP) is completely transparent to applications, and applications can get the benefit of huge pages without changing their source code (refer to Transparent Hugepage Support for more information). As of kernel version 6.5, only 512 MB THP pages are supported when a 64 KB system page size is configured. If 512 MB THP is too large for your application, consider using hugetlbfs as described in Hugetlbfs.

Refer to Transparent Hugepage Support for more information about THP.

Note

The default huge page size is related to the kernel page size (refer to HugeTLBpage on ARM64 for more information).

Proactive Compaction#

Proactive Compaction reduces the allocation latency of huge pages by preemptively performing the work in the background, does not change the probability of obtaining a huge page, but it changes how fast you can get a huge page.

Without compaction, the kernel will return huge pages until it runs out of them. The application will experience a perf cliff, because the kernel is going to defragment the memory, and Proactive compaction smooths this out this process.
With compaction, when the applications hits a threshold of memory fragmentation, the kernel begins to defragment the memory pages in the background, which prevents you from running out of huge pages and hitting a performance cliff.

The proactive compaction exposes a tunable, /proc/sys/vm/compaction_proactiveness, which accepts values in the [0, 100] range, and a default value of 20. This tunable determines how aggressively the kernel should compact memory in the background and setting an aggressive value can lead to increased address translation latency. The default value of 20 is reasonable and should only be changed based on perf data. To limit the overhead of proactive compaction, you can use the on-demand compaction method, which is available only after CONFIG_COMPACTION is set. When 1 is written to the /proc/sys/vm/compact_memory file, all zones are compacted, and free memory is available in contiguous blocks where possible. This can be important, for example, when allocating huge pages, because it will also directly compact memory as required. Refer to Documentation for /proc/sys/vm/ for more information.

Hugetlbfs#

By using hugetlbfs, pools of hugetlb pages can be preallocated, and applications can use the huge pages in these pools. However, this requires changes in the applications. You can specify the minimum number of huge pages that are reserved by the system, how big the pool can grow, and configure malloc to use hugetlbfs for an app. We strongly recommend that you test your app with hugetlbfs, and if it works with your app, use your app. The benefit of reserving a pool of huge pages at boot time is that at boot time, the memory is not fragmented, so there is a greater chancethat the requested number of huge pages can be assembled. Refer to HugeTLB Pages for more information.

Configuring Linux Perf#

Refer to Configuring Perf for more information.

Performance Governor#

You can set the CPU governor using the cpupower command. For example, to set the CPU governor to Performance, run the following command:

sudo cpupower frequency-set -g performance

Note

On certain distributions, like Ubuntu, the cpufrequtils package provides a cpufrequtils service that might change the CPU governor to ondemand when the system boots. To avoid this behavior, users can disable this service by running the sudo systemctl disable cpufrequtils command.

Init on Alloc#

The CONFIG_INIT_ON_ALLOC_DEFAULT_ON kernel configuration option controls whether the kernel will fill newly allocated pages and heap objects with zeroes by default. You can overwrite this setting with the init_on_alloc=[0|1] kernel parameter.

On coherent systems, such as Grace Hopper, where GPU memory is exposed as system memory, this can cause heavy performance impacts to cudaMalloc() operations.

Note

The recommended default value on Grace Hopper is the init_on_alloc=0 parameter.

Not all distros will set the CONFIG_INIT_ON_ALLOC_DEFAULT_ON config on their kernels. For example, the SUSE and RHEL kernels do not currently set this option, but the Ubuntu -generic kernel does set this option. The current value of the init_on_alloc kernel configuration option on a system might be printed as follows:

grep init_on_alloc /proc/cmdline

Here is the output:

BOOT_IMAGE=/boot/vmlinuz-6.2.0-1010-nvidia-64k
root=UUID=7123054d-9b18-4c3d-8844-c538c751b59a ro
rd.driver.blacklist=nouveau nouveau.modeset=0 earlycon
module_blacklist=nouveau acpi_power_meter.force_cap_on=y
numa_balancing=disable init_on_alloc=0 preempt=none

Input-Output Memory Management Unit Passthrough#

The Input-Output Memory Management Unit (IOMMU) is a hardware component that performs address translation from I/O device virtual addresses (also called I/O virtual address (IOVA)) to physical addresses. Different platforms have different IOMMUs, such as the DMA Remapping Reporting (DMAR) used by the Intel IOMMU, and System Memory Management Unit (SMMU) that is used by the ARM platform. Linux provides the iommu.passthrough mode, and you can configure the DMA to use (or not) the IOMMU to access the memory for addressing. Some applications might have some performance benefits on bare metal and Virtual Machine (VM) environments when the iommu.passthrough is set to 1. Setting iommu.passthrough to 1 on the kernel command line bypasses the IOMMU translation for DMA and setting it to 0 uses IOMMU translation for DMA. This value needs to be set at deployment (in the kernel configuration) or by editing the appropriate grub configuration files. For the changes to take effect, you need to reboot the system. To add kernel parameters, complete the steps for your distro:

Ubuntu

Create the /etc/default/grub.d/iommu_passthrough.cfg file with the following contents:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX iommu.passthrough=1"

Run the following commands:

sudo update-grub
sudo reboot

RedHat

Run the following commands:

sudo grubby --update-kernel=ALL --args="iommu.passthrough=1"
sudo reboot

SUSE

Edit the /etc/default/grub file.
On the line that contains the GRUB_CMDLINE_LINUX string, append the iommu.passthrough=1 parameter, and run the following commands:
```
sudo update-bootloader --refresh
sudo reboot
```

PCIe Access Control Service#

Baremetal Systems#

IO virtualization (also known as VT-d or IOMMU on x86 and SMMU on Arm64) can interfere with NVIDIA GPUDirect by redirecting PCI point-to-point traffic to the CPU root complex, which causes a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running the following command:

$ sudo lspci -vvv | grep ACSCtl

If lines show SrcValid+, ACS might be enabled. You can check whether a PCI bridge has ACS enabled by looking at the full lspci output.

$ sudo lspci -vvv

If PCI switches have ACS enabled, ACS needs to be disabled. On some systems, this task can be completed from the BIOS by disabling IO virtualization or VT-d. For Broadcom devices, it can be completed from the OS, but the task needs to be repeated after each reboot.

To find the PCI bus IDs of Broadcom PCI bridges, run the following command:

$ sudo lspci | grep -i “Broadcom”

Use setpci to disable ACS with the following command, and replace 03:00.0 with the PCI bus ID of each PCI bridge.

$ sudo setpci -s 03:00.0 ECAP_ACS+0x6.w=0000

You can also use a script like the following:

for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
   # skip if it doesn't support ACS
   sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
   if [ $? -ne 0 ]; then
      continue
   fi
   sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Virtual Machines#

The functional and performant configuration has PCIe ACS enabled and PCIe ATS disabled.

Automatic NUMA Scheduling and Balancing#

On a Grace Hopper system, we recommend that you do not use Automatic NUMA Scheduling and Balancing (AutoNUMA) features of the Linux kernel. This is because of the additional page-faults that are introduced by AutoNUMA, which can significantly reduce GPU-heavy application performance.

To see the status of AutoNUMA, use cat /proc/sys/kernel/numa_balancing.
If the output is 1, AutoNUMA is enabled, if it is 0, it is disabled.
To disable AutoNUMA in a session, use echo 0 > /proc/sys/kernel/numa_balancing.
To disable AutoNUMA permanently, use echo "kernel.numa_balancing = 0" >> /etc/sysctl.conf.

Swap File Size#

This section applies only to Grace Hopper systems.

If an application allocates a large enough fraction of CPU memory, the kernel might decide to migrate some pages (possibly from third-party applications) from the CPU memory to the GPU memory. This behavior has been observed in applications that use deep learning frameworks, which might use nearly all of the CPU memory during initialization. Currently, if this memory was moved by the kernel to the GPU memory, it can only be reclaimed through a swap file. We recommend that, if possible, you have a large enough swap file for these scenarios.

Note

On a Grace Hopper system with sufficient disk space, we recommend that you use a swap file of at least quarter to half of the aggregate GPU memory size in the system.

Libvirt Network#

For bare metal deployments, if libvirt is installed, libvirt adds iptables rules to allow traffic to/from guests attached to the virbr0 device in the INPUT, FORWARD, OUTPUT and POSTROUTING chains. If virtualization isn’t required, removing these rules improves network performance.
For bare metal environments, where libvirt isn’t required, disabling the libvirt service and removing the libvirt installed iptables rules can improve network performance.

To remove the libvirt rules, stop the libvirtd service or manually delete the networking rules.

Warning

Only disable the libvirt service or remove the iptables rules in bare metal environments. These steps will break existing libvirt deployments.

Stop the libvirtd service

If there are no virtual machines and virtualization isn’t required, disable the libvirtd.service and reboot to clear the rules and delete the virbr0 interface.
```
sudo systemctl stop libvirtd
sudo systemctl disable libvirtd
sudo reboot now
```
Delete only the networking rules

Another way to delete these rules is to run the following commands to deactivate the libvirt network. This action will remove the virbr0 bridge, terminate the dnsmasq process, and remove the iptables rules.

Deactivate the libvirt network named default:
```
virsh net-destroy default
```
Prevent the network from automatically starting on boot:
```
virsh net-autostart --network default --disable
```
To revert the changes, reactivate the default libvirt network:
```
virsh net-start default*
```

Note

If libvirt is required, avoid NAT and consider using NIC passthrough or SR-IOV to the virtual machines for optimal performance. Performance optimzation with libvirt is outside the scope of this guide.

Unified Memory Considerations on Grace Hopper/Blackwell Systems#

The Grace Hopper/Blackwell Superchip is a unified memory architecture that enables the CPU and GPU to access a shared address space across LPDDR5 and HBM memory using NVLink-C2C. The architecture provides flexible and high-bandwidth memory access and introduces considerations for memory placement.

In this architecture, HBM memory that is attached to the GPU is visible to the operating system and can be used for general memory allocations. As a result, applications that do not explicitly manage NUMA locality might have their memory pages allocated from GPU HBM, even if they are not using the GPU.

Although HBM offers significantly higher bandwidth than LPDDR5, it might have higher latency for CPU access, especially when memory access patterns are irregular or latency sensitive. For performance-critical applications, especially applications that do not use the GPU, we recommend that you do the following: Use NUMA-aware memory allocation (for example using numactl, libnuma, or mbind) to ensure that memory is allocated from the preferred memory domain (typically LPDDR5 for CPU-only workloads). To identify unintended HBM usage, monitor memory allocation using tools such as numastat, nvidia-smi, or hwloc. Pin memory allocations to CPU-local nodes unless the HBM’s higher bandwidth is explicitly beneficial for the workload.

Being explicit about memory locality helps optimize performance and avoid scenarios where latency-sensitive code suffers from unintended memory placement.