Appendix D: Platform-Dependent Workarounds#
Some Grace platforms require temporary (or permanent) alterations to their configurations to work around known issues, such as hardware errata. These workarounds are described in the following sections by the corresponding Grace platform.
D.1 All Grace Platforms#
CUDA Application Workaround#
CUDA applications on the Grace-Hopper platform require ATS support. Currently, ATS is not enabled on the arm64 platform when IOMMU passthrough is enabled. NVIDIA is working with the Linux kernel community and SUSE to resolve this issue.
Linux provides the iommu.passthrough kernel parameter to configure the
DMA to use (or not use) the IOMMU to access the memory for addressing.
By default, SLES 15 sets the IOMMU in passthrough mode, which prevents
CUDA applications from running. NVIDIA recommends that you add the
iommu.passthrough=0 kernel parameter until this issue is resolved.
To permanently deploy this workaround so that it is always active upon boot:
With administrative privileges, edit the /etc/default/grub file.
Append the parameter
iommu.passthrough=0to the end of the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
- Run the following command to update GRUB:sudo update-bootloader -refresh
Reboot the system.
To verify the workaround, run the following commands:
sudo dmesg | grep "iommu: Default" iommu: Default domain type: Translated (set via kernel command line)
If iommu passthrough is required for a device or iommu group, refer to the instructions in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
Note
When this workaround is applied from the SLES Installer, it is automatically included in the installed system.
Refer to SUSE Modifying Kernel Boot Parameters for additional guidance about modifying kernel boot parameters.
Coresight ETM Boot Failure Workaround#
Kernel version 5.14.21-150500.55.49-64kb and later introduce a new coresight_etm4x module that might be incompatible with earlier firmware versions and ultimately prevent the system from booting. NVIDIA recommends that you update the system firmware to the latest version.
If the system is still unable to boot after a firmware update, to avoid loading the coresight_etm4x module, a workaround can be deployed.
To temporarily deploy this workaround for the duration of the current boot:
During boot, stop at the grub menu, select the boot entry, and press the e key to edit the entry.
Append
module_blacklist=coresight_etm4xto the end of the list of kernel boot parameters.Boot the entry by clicking Ctrl-X or pressing F10.
To permanently deploy this workaround so that it is always active at boot time:
With administrative privileges, edit /etc/default/grub.
Append
module_blacklist=coresight_etm4xto the end of the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
Run the following command to update GRUB.
update-bootloader -refreshReboot the system.
To permanently remove this workaround so that it is not active at boot time:
With administrative privileges, edit /etc/default/grub.
Remove
module_blacklist=coresight_etm4xfrom the list of kernel boot parameters specified in GRUB_CMDLINE_LINUX_DEFAULT.Save the file and exit the editor.
Run the following command to update GRUB:
update-bootloader -refreshReboot the system.
To verify the presence of the workaround:
Evaluate the kernel boot parameters set for the current boot.
cat /proc/cmdline | grep coresight_etm4xWhen nothing is returned, the workaround is not active.
Note
When this workaround is applied using the temporary deployment method from the SLES Installer, it is automatically included in the installed system.
Refer to SUSE Modifying Kernel Boot Parameters for additional guidance about modifying kernel boot parameters.
SLES 15 SP7 GM Media Installation Workaround#
The following crash might be seen when you install with the SLES 15 SP7 GM Media ISO on a system with Mellanox devices that are configured for the ethernet mode.
[ 484.838961][ T7034] Internal error: Oops: 0000000096000004 [#1] SMP
[ 484.845378][ T7034] Modules linked in: hid_generic xfs isofs vfat fat usbhid btrfs xor xor_neon zlib_deflate raid6_pq
libcrc32c dm_multipath dm_mod 8021q garp mrp stp llc fan nfs lockd grace fscache netfs nls_iso8859_1 nls_cp437 af_packet
sg st sr_mod cdrom joydev iscsi_ibft iscsi_boot_sysfs sunrpc efivarfs mlx5_ib ib_uverbs macsec ib_core sd_mod uas usb_storage
cdc_ether usbnet mii aes_ce_blk aes_ce_cipher crct10dif_ce ghash_ce gf128mul sm4_ce_gcm sm4_ce_ccm sm4_ce sm4 sm3_ce nvme sm3
nvme_core xhci_pci sha3_ce xhci_pci_renesas nvme_keyring sha512_ce mlx5_core xhci_hcd acpi_ipmi nvme_auth sha512_arm64 mlxfw
t10_pi ipmi_ssif sha2_ce psample ipmi_devintf sha256_arm64 i2c_smbus crc64_rocksoft_generic usbcore igb tls ast crc64_rocksoft
sha1_ce sbsa_gwdt crc64 ipmi_msghandler usb_common i2c_algo_bit pci_hyperv_intf(X) gpio_tegra186 i2c_tegra thermal scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod squashfs lz4_decompress loop
[ 484.928841][ T7034] Supported: Yes, External
[ 484.933204][ T7034] CPU: 77 PID: 7034 Comm: ip Not tainted 6.4.0-150700.51-default #1 SLE15-SP7 f394bee0c1b116ec8321cc4732c29c26f89759dc
[ 484.945746][ T7034] Hardware name: NVIDIA, BIOS 02.04.11 20250611
[ 484.953752][ T7034] pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 484.961582][ T7034] pc : __alloc_pages+0x128/0x378
[ 484.966497][ T7034] lr : __alloc_pages+0xf0/0x378
[ 484.971303][ T7034] sp : ffff8000b3a2aff0
[ 484.975397][ T7034] x29: ffff8000b3a2b020 x28: ffff100072a61128 x27: ffff8000b3a2b118
[ 484.983405][ T7034] x26: ffffab142a18c000 x25: 0000000000000001 x24: 00000000000d20c0
[ 484.991410][ T7034] x23: ffffab142a6ef000 x22: 0000000000000000 x21: 0000000000000003
[ 484.999416][ T7034] x20: ffffab14299ce000 x19: 00000000000d20c0 x18: ffffffffffffffff
[ 485.007422][ T7034] x17: 0000000000000000 x16: ffffab14287a00d0 x15: 0000000000000000
[ 485.015428][ T7034] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 485.023435][ T7034] x11: ffff100013e2c800 x10: ffff100081d98000 x9 : ffffab14287f9280
[ 485.031441][ T7034] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000000
[ 485.039447][ T7034] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff5563c56081c0
[ 485.047453][ T7034] x2 : 0000000000000000 x1 : 0000000000000002 x0 : ffff5563c56081c0
[ 485.055459][ T7034] Call trace:
[ 485.058667][ T7034] __alloc_pages+0x128/0x378
[ 485.063208][ T7034] new_slab+0xf4/0x5a8
[ 485.067221][ T7034] ___slab_alloc+0x408/0x860
[ 485.071761][ T7034] __slab_alloc.isra.85+0x6c/0xb8
[ 485.076745][ T7034] __kmem_cache_alloc_node+0x120/0x2e8
[ 485.082174][ T7034] __kmalloc_node+0x64/0x258
[ 485.086716][ T7034] kvmalloc_node+0xdc/0x120
[ 485.091174][ T7034] mlx5e_open_channels+0x7c4/0x1068 [mlx5_core 9e35ee0294f530b612377d73303bc11ea4299d9e]
[ 485.101112][ T7034] mlx5e_open_locked+0x50/0x140 [mlx5_core 9e35ee0294f530b612377d73303bc11ea4299d9e]
[ 485.110668][ T7034] mlx5e_open+0x34/0x70 [mlx5_core 9e35ee0294f530b612377d73303bc11ea4299d9e]
[ 485.119508][ T7034] __dev_open+0x110/0x208
[ 485.123789][ T7034] __dev_change_flags+0x184/0x1f0
[ 485.128772][ T7034] dev_change_flags+0x2c/0x80
[ 485.133400][ T7034] do_setlink+0x3e0/0xf60
[ 485.137674][ T7034] __rtnl_newlink+0x63c/0x838
[ 485.142303][ T7034] rtnl_newlink+0x5c/0x98
[ 485.146575][ T7034] rtnetlink_rcv_msg+0x144/0x420
[ 485.151474][ T7034] netlink_rcv_skb+0x6c/0x160
[ 485.156108][ T7034] rtnetlink_rcv+0x20/0x38
[ 485.160469][ T7034] netlink_unicast+0x1fc/0x2c8
[ 485.165186][ T7034] netlink_sendmsg+0x2e8/0x420
[ 485.169903][ T7034] sock_sendmsg+0x68/0xc8
[ 485.174181][ T7034] ____sys_sendmsg+0x2a4/0x320
[ 485.178899][ T7034] ___sys_sendmsg+0x98/0x108
[ 485.183438][ T7034] __sys_sendmsg+0x74/0xe8
[ 485.187799][ T7034] __arm64_sys_sendmsg+0x28/0x40
[ 485.192693][ T7034] invoke_syscall+0x74/0x100
[ 485.197239][ T7034] el0_svc_common.constprop.1+0x84/0x1a8
[ 485.202846][ T7034] do_el0_svc+0x38/0x88
[ 485.206941][ T7034] el0_svc+0x3c/0x170
[ 485.210866][ T7034] el0t_64_sync_handler+0x9c/0xc0
[ 485.215850][ T7034] el0t_64_sync+0x1a4/0x1a8
[ 485.220304][ T7034] Code: 381f03a0 b85ec3a1 aa0303e0 b5000fa2 (b9400864)
[ 485.227244][ T7034] ---[ end trace 0000000000000000 ]---
NVIDIA recommends using the SLES 15 SP7 QU1 Media ISO as it includes a fix for this issue.