Appendix D: Platform-Dependent Workarounds#
Some Grace platforms require temporary (or permanent) alterations to their configurations to work around known issues, such as hardware errata. These workarounds are described in the following sections by the corresponding Grace platform.
D.1 All Grace Platforms#
CUDA Application Workaround#
CUDA applications on the Grace-Hopper platform require ATS support. Currently, ATS is not enabled on the arm64 platform when IOMMU passthrough is enabled. NVIDIA is working with the Linux kernel community and SUSE to resolve this issue.
Linux provides the iommu.passthrough kernel parameter to configure the
DMA to use (or not use) the IOMMU to access the memory for addressing.
By default, SLES 15 sets the IOMMU in passthrough mode, which prevents
CUDA applications from running. NVIDIA recommends that you add the
iommu.passthrough=0
kernel parameter until this issue is resolved.
To permanently deploy this workaround so that it is always active upon boot:
With administrative privileges, edit the /etc/default/grub file.
Append the parameter
iommu.passthrough=0
to the end of the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
- Run the following command to update GRUB:sudo update-bootloader -refresh
Reboot the system.
To verify the workaround, run the following commands:
sudo dmesg | grep "iommu: Default" iommu: Default domain type: Translated (set via kernel command line)
If iommu passthrough is required for a device or iommu group, refer to the instructions in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
Note
When this workaround is applied from the SLES Installer, it is automatically included in the installed system.
Refer to SUSE Modifying Kernel Boot Parameters for additional guidance about modifying kernel boot parameters.
Coresight ETM Boot Failure Workaround#
Kernel version 5.14.21-150500.55.49-64kb and later introduce a new coresight_etm4x module that might be incompatible with earlier firmware versions and ultimately prevent the system from booting. NVIDIA recommends that you update the system firmware to the latest version.
If the system is still unable to boot after a firmware update, to avoid loading the coresight_etm4x module, a workaround can be deployed.
To temporarily deploy this workaround for the duration of the current boot:
During boot, stop at the grub menu, select the boot entry, and press the e key to edit the entry.
Append
module_blacklist=coresight_etm4x
to the end of the list of kernel boot parameters.Boot the entry by clicking Ctrl-X or pressing F10.
To permanently deploy this workaround so that it is always active at boot time:
With administrative privileges, edit /etc/default/grub.
Append
module_blacklist=coresight_etm4x
to the end of the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
Run the following command to update GRUB.
update-bootloader -refresh
Reboot the system.
To permanently remove this workaround so that it is not active at boot time:
With administrative privileges, edit /etc/default/grub.
Remove
module_blacklist=coresight_etm4x
from the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
Run the following command to update GRUB:
update-bootloader -refresh
Reboot the system.
To verify the presence of the workaround:
Evaluate the kernel boot parameters set for the current boot.
cat /proc/cmdline | grep coresight_etm4x
When nothing is returned, the workaround is not active.
Note
When this workaround is applied using the temporary deployment method from the SLES Installer, it is automatically included in the installed system.
Refer to SUSE Modifying Kernel Boot Parameters for additional guidance about modifying kernel boot parameters.