Operating System Settings#

This chapter provides information about the operating system settings.

Page Size#

Grace supports 64K and 4K Linux kernel page sizes. To configure your Linux kernel with the page size that suits your business needs, change the following kconfig settings during the kernel compilation:

  • 4K page size: CONFIG_ARM64_4K_PAGES=y

  • 64K page size: CONFIG_ARM64_64K_PAGES=y

The 64K page size can benefit the applications that allocate a large amount of memory because there will be fewer page faults, better TLB hits, and efficiency.

Note

The recommended default value for the page size is 64K.

Huge Pages#

Huge pages might be beneficial to applications that allocate large chunks of memories, and the main benefit is fewer TLB misses.

You can use huge pages on Grace systems in the following ways:

  • Transparent Huge Pages (THP)

    • Transparent to the application.

    • Mostly automatic with a few available kernel tuning parameters.

    • When using the recommended 64 KB page size, THP pages are currently too large for practical use in most applications (refer to Transparent Huge Pages for more information).

  • Hugetlbfs

    • Does not suffer from fragmentation concerns or from allocation latency because the huge pages are preallocated and indivisible.

    • Requires application modification.

    • Requires sysadmin setup.

Transparent Huge Pages#

Transparent Huge Pages (THP) is completely transparent to applications, and applications can get the benefit of huge pages without changing their source code (refer to Transparent Hugepage Support for more information). As of kernel version 6.5, only 512 MB THP pages are supported when a 64 KB system page size is configured. If 512 MB THP is too large for your application, consider using hugetlbfs as described in Hugetlbfs.

Refer to Transparent Hugepage Support for more information about THP.

Note

The default huge page size is related to the kernel page size (refer to HugeTLBpage on ARM64 for more information).

Proactive Compaction#

Proactive Compaction reduces the allocation latency of huge pages by preemptively performing the work in the background, does not change the probability of obtaining a huge page, but it changes how fast you can get a huge page.

  • Without compaction, the kernel will return huge pages until it runs out of them. The application will experience a perf cliff, because the kernel is going to defragment the memory, and Proactive compaction smooths this out this process.

  • With compaction, when the applications hits a threshold of memory fragmentation, the kernel begins to defragment the memory pages in the background, which prevents you from running out of huge pages and hitting a performance cliff.

The proactive compaction exposes a tunable, /proc/sys/vm/compaction_proactiveness, which accepts values in the [0, 100] range, and a default value of 20. This tunable determines how aggressively the kernel should compact memory in the background and setting an aggressive value can lead to increased address translation latency. The default value of 20 is reasonable and should only be changed based on perf data. To limit the overhead of proactive compaction, you can use the on-demand compaction method, which is available only after CONFIG_COMPACTION is set. When 1 is written to the /proc/sys/vm/compact_memory file, all zones are compacted, and free memory is available in contiguous blocks where possible. This can be important, for example, when allocating huge pages, because it will also directly compact memory as required. Refer to Documentation for /proc/sys/vm/ for more information.

Hugetlbfs#

By using hugetlbfs, pools of hugetlb pages can be preallocated, and applications can use the huge pages in these pools. However, this requires changes in the applications. You can specify the minimum number of huge pages that are reserved by the system, how big the pool can grow, and configure malloc to use hugetlbfs for an app. We strongly recommend that you test your app with hugetlbfs, and if it works with your app, use your app. The benefit of reserving a pool of huge pages at boot time is that at boot time, the memory is not fragmented, so there is a greater chancethat the requested number of huge pages can be assembled. Refer to HugeTLB Pages for more information.

Configuring Linux Perf#

Refer to Configuring Perf for more information.

Performance Governor#

You can set the CPU governor using the cpupower command. For example, to set the CPU governor to Performance, run the following command:

sudo cpupower frequency-set -g performance

Note

On certain distributions, like Ubuntu, the cpufrequtils package provides a cpufrequtils service that might change the CPU governor to ondemand when the system boots. To avoid this behavior, users can disable this service by running the sudo systemctl disable cpufrequtils command.

Init on Alloc#

The CONFIG_INIT_ON_ALLOC_DEFAULT_ON kernel configuration option controls whether the kernel will fill newly allocated pages and heap objects with zeroes by default. You can overwrite this setting with the init_on_alloc=[0|1] kernel parameter.

On coherent systems, such as Grace Hopper, where GPU memory is exposed as system memory, this can cause heavy performance impacts to cudaMalloc() operations.

Note

The recommended default value on Grace Hopper is the init_on_alloc=0 parameter.

Not all distros will set the CONFIG_INIT_ON_ALLOC_DEFAULT_ON config on their kernels. For example, the SUSE and RHEL kernels do not currently set this option, but the Ubuntu -generic kernel does set this option. The current value of the init_on_alloc kernel configuration option on a system might be printed as follows:

grep init_on_alloc /proc/cmdline

Here is the output:

BOOT_IMAGE=/boot/vmlinuz-6.2.0-1010-nvidia-64k
root=UUID=7123054d-9b18-4c3d-8844-c538c751b59a ro
rd.driver.blacklist=nouveau nouveau.modeset=0 earlycon
module_blacklist=nouveau acpi_power_meter.force_cap_on=y
numa_balancing=disable init_on_alloc=0 preempt=none

Input-Output Memory Management Unit Passthrough#

The Input-Output Memory Management Unit (IOMMU) is a hardware component that performs address translation from I/O device virtual addresses (also called I/O virtual address (IOVA)) to physical addresses. Different platforms have different IOMMUs, such as the DMA Remapping Reporting (DMAR) used by the Intel IOMMU, and System Memory Management Unit (SMMU) that is used by the ARM platform. Linux provides the iommu.passthrough mode, and you can configure the DMA to use (or not) the IOMMU to access the memory for addressing. Some applications might have some performance benefits on bare metal and Virtual Machine (VM) environments when the iommu.passthrough is set to 1. Setting iommu.passthrough to 1 on the kernel command line bypasses the IOMMU translation for DMA and setting it to 0 uses IOMMU translation for DMA. This value needs to be set at deployment (in the kernel configuration) or by editing the appropriate grub configuration files. For the changes to take effect, you need to reboot the system. To add kernel parameters, complete the steps for your distro:

Ubuntu

  1. Create the /etc/default/grub.d/iommu_passthrough.cfg file with the following contents:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX iommu.passthrough=1"
  1. Run the following commands:

sudo update-grub
sudo reboot

RedHat

  1. Run the following commands:

    sudo grubby --update-kernel=ALL --args="iommu.passthrough=1"
    sudo reboot
    

SUSE

  1. Edit the /etc/default/grub file.

  2. On the line that contains the GRUB_CMDLINE_LINUX string, append the iommu.passthrough=1 parameter, and run the following commands:

    sudo update-bootloader --refresh
    sudo reboot
    

PCIe Access Control Service#

Baremetal Systems#

IO virtualization (also known as VT-d or IOMMU on x86 and SMMU on Arm64) can interfere with NVIDIA GPUDirect by redirecting PCI point-to-point traffic to the CPU root complex, which causes a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running the following command:

$ sudo lspci -vvv | grep ACSCtl

If lines show SrcValid+, ACS might be enabled. You can check whether a PCI bridge has ACS enabled by looking at the full lspci output.

$ sudo lspci -vvv

If PCI switches have ACS enabled, ACS needs to be disabled. On some systems, this task can be completed from the BIOS by disabling IO virtualization or VT-d. For Broadcom devices, it can be completed from the OS, but the task needs to be repeated after each reboot.

  1. To find the PCI bus IDs of Broadcom PCI bridges, run the following command:

$ sudo lspci | grep -i “Broadcom”
  1. Use setpci to disable ACS with the following command, and replace 03:00.0 with the PCI bus ID of each PCI bridge.

$ sudo setpci -s 03:00.0 ECAP_ACS+0x6.w=0000

You can also use a script like the following:

for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
   # skip if it doesn't support ACS
   sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
   if [ $? -ne 0 ]; then
      continue
   fi
   sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Virtual Machines#

The functional and performant configuration has PCIe ACS enabled and PCIe ATS disabled.

Automatic NUMA Scheduling and Balancing#

On a Grace Hopper system, we recommend that you do not use Automatic NUMA Scheduling and Balancing (AutoNUMA) features of the Linux kernel. This is because of the additional page-faults that are introduced by AutoNUMA, which can significantly reduce GPU-heavy application performance.

  • To see the status of AutoNUMA, use cat /proc/sys/kernel/numa_balancing.

  • If the output is 1, AutoNUMA is enabled, if it is 0, it is disabled.

  • To disable AutoNUMA in a session, use echo 0 > /proc/sys/kernel/numa_balancing.

  • To disable AutoNUMA permanently, use echo "kernel.numa_balancing = 0" >> /etc/sysctl.conf.

Swap File Size#

This section applies only to Grace Hopper systems.

If an application allocates a large enough fraction of CPU memory, the kernel might decide to migrate some pages (possibly from third-party applications) from the CPU memory to the GPU memory. This behavior has been observed in applications that use deep learning frameworks, which might use nearly all of the CPU memory during initialization. Currently, if this memory was moved by the kernel to the GPU memory, it can only be reclaimed through a swap file. We recommend that, if possible, you have a large enough swap file for these scenarios.

Note

On a Grace Hopper system with sufficient disk space, we recommend that you use a swap file of at least quarter to half of the aggregate GPU memory size in the system.

Libvirt Network#

  • For bare metal deployments, if libvirt is installed, libvirt adds iptables rules to allow traffic to/from guests attached to the virbr0 device in the INPUT, FORWARD, OUTPUT and POSTROUTING chains. If virtualization isn’t required, removing these rules improves network performance.

  • For bare metal environments, where libvirt isn’t required, disabling the libvirt service and removing the libvirt installed iptables rules can improve network performance.

To remove the libvirt rules, stop the libvirtd service or manually delete the networking rules.

Warning

Only disable the libvirt service or remove the iptables rules in bare metal environments. These steps will break existing libvirt deployments.

  • Stop the libvirtd service

    If there are no virtual machines and virtualization isn’t required, disable the libvirtd.service and reboot to clear the rules and delete the virbr0 interface.

    sudo systemctl stop libvirtd
    sudo systemctl disable libvirtd
    sudo reboot now
    
  • Delete only the networking rules

    Another way to delete these rules is to run the following commands to deactivate the libvirt network. This action will remove the virbr0 bridge, terminate the dnsmasq process, and remove the iptables rules.

    Deactivate the libvirt network named default:

    virsh net-destroy default
    

    Prevent the network from automatically starting on boot:

    virsh net-autostart --network default --disable
    

    To revert the changes, reactivate the default libvirt network:

    virsh net-start default*
    

Note

If libvirt is required, avoid NAT and consider using NIC passthrough or SR-IOV to the virtual machines for optimal performance. Performance optimzation with libvirt is outside the scope of this guide.