Appendix B: Platform-Dependent Workarounds#
Some Grace platforms require temporary (or permanent) alterations to their configurations to work around known issues, such as hardware errata. These workarounds are described in the following sections by the corresponding Grace platform.
Note
Depending on the kernel support for your distro, at least one of these workarounds might be required to install Linux on the Grace platform.
B.1 All Grace Platforms#
AST2600 BMC Workaround#
Distros that are based on a Linux kernel earlier than v6.4 and do not carry this ast driver patch, require a workaround on all Grace platforms to avoid undefined behaviors. The absence of this patch can create a variety of issues, such as kernel hangs and a distorted output from the on-board VGA port.
To workaround this issue, NVIDIA recommends denylisting the ast driver.
For most distros, this can be accomplished by adding the
modprobe.blacklist=ast
kernel parameter or by creating a file in
/etc/modprobe.d
that contains a denylist directive. As a side effect of
this workaround, because the on-board VGA port is inaccessible, a serial
console solution (for example, SOL) must be used for console access to
the system.
CUDA Application Workaround#
CUDA applications on the Grace-Hopper platform require ATS support. Distros that are based on a Linux kernel version earlier than v6.11 do not carry this patch, so ATS gets disabled on the arm64 platform when IOMMU passthrough is enabled.
If your Linux distro meets the above condition, and sets the IOMMU in
passthrough mode by default, for example
CONFIG_IOMMU_DEFAULT_PASSTHROUGH=y
, it will prevent CUDA applications
from running. NVIDIA recommends that you add the iommu.passthrough=0
kernel parameter until this issue is resolved.
B.2 Multi-Socket Grace Platforms#
NVIDIA Hardware Erratum T241-FABRIC-4 Workaround#
Distros that are based on a Linux kernel earlier than v6.4 or do not carry this patch require a workaround on Grace systems with three-and four-socket configurations. The absence of this patch can create a variety of issues, such as kernel hangs, timeouts, and other undefined behaviors. Refer to NVIDIA hardware erratum T241-FABRIC-4 for more information.
To workaround this issue, NVIDIA recommends restricting the system to a
one socket configuration. For most distros, this can be accomplished by
adding the nr_cpus=72
kernel parameter to limit the number of cores
supported by the Linux kernel to the number of cores found in a Grace
processor. As a side effect of this workaround, the server is reduced to
one quarter of its total compute capacity.
Caution
Be careful when profiling a system with this temporary workaround.