Known Issues#
RShim Devices Not Created After Starting rshim.service#
Issue#
RShim devices might not appear under /dev/rshim*
because the RShim driver might be used by another entity.
Workaround#
Edit
/etc/rshim.conf
, and uncomment the line that says#FORCE_MODE 1
Restart the RShim service.
sudo systemctl restart rshim.service
Firmware Crash on DGX GB200 During Boot#
Issue#
On DGX GB200 systems, you might observe a firmware crash during some reboots, resulting in an extra reboot. This can be visible over the serial console.
Log of events seen for this issue:
INFO: th500_ras_intr_handler: External Abort reason=0 syndrome=0xbe000411 flags=0x1
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR: spmd_ffa_direct_message failed (4294967292) on CPU121
ASSERT: plat/nvidia/tegra/soc/th500/plat_ras.c:408
BACKTRACE: START: assert
0: EL3: 0x78732b2648
1: EL3: 0x78732b0294
2: EL3: 0x78732c45fc
3: EL3: 0x78732c481c
4: EL3: 0x78732bda18
5: EL3: 0x78732b9f18
6: EL3: 0x78732b1258
BACKTRACE: END: assert
Workaround#
Currently, there is no temporary solution available.
The nvidia_peermem Module Does Not Load After an OTA Update#
Issue#
After you perform an OTA update, the nvidia-peermem-loader
package is installed, but
the nvidia_peermem
module is not loaded. As a result, the following error message
appears in the dmesg
log:
nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client
nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22)
Workaround#
This issue occurs because updating the mlnx-ofed-kernel
package requires rebuilding
the nvidia-peermem
DKMS module to use the new version of the mlnx-ofed-kernel
package.
The following commands will rebuild the necessary modules:
MODULE_VERSION=`dkms status nvidia -k $(uname -r) | cut -d "," -f1` || true
if [ ! -z "${MODULE_VERSION}" ]; then
sudo dkms remove -m ${MODULE_VERSION} -k $(uname -r) || true
sudo dkms install -m ${MODULE_VERSION} -k $(uname -r) || true
fi
DGX Station A100 Failed to Boot After Applying MIG Configurations#
Issue#
After MIG configurations were successfully applied to a DGX station A100 system running DGX OS 7.0.2,
the system failed to boot when you ran the sudo reboot
command. Resetting the GPUs by performing
a DC power cycle could not recover the system.
Workaround#
The DGX OS 7.0.2 release does not support the DGX Station A100 system with MIG enabled. To resolve the boot failure, install DGX OS 6.3.2 on the system and then apply MIG configurations.
The systemd-modules-load Service Failed to Insert the nvidia_peermem Module#
Issue#
On a DGX Station A100 or DGX Station A800, when you install a Base OS version earlier than 7.0.2,
the nvidia-peermem-loader
package might have been installed on the system. As a result,
the following error message occurs:
$ sudo systemctl status systemd-modules-load.service
(code=exited, status=1/FAILURE)
...
systemd-modules-load[2143]: Failed to insert module 'nvidia_peermem': Invalid argument
Workaround#
To avoid this failure, remove the nvidia-peermem-loader
package.
sudo apt purge nvidia-peermem-loader
BMC Redfish Interface Not Active on the First Boot After Installation#
Issue#
Reported in DGX OS 7.0.0.
When the DGX system was booted the first time after the DGX OS 7.0.0 installation,
the BMC Redfish network interface was not renamed or autoconfigured correctly, as reported by
the ip a
command.
Workaround#
During the installation of DGX OS 7.0.x, the Redfish interface might not be configured correctly with the proper interface name or IP address. To resolve this issue, run the following command to reconfigure the interface:
sudo /usr/sbin/configure-redfish-intf.bash
No Permissions to Access /var/run/nvidia-fabricmanager for Non-Root Users on DGX B200#
Issue#
After DGX OS 7.0.x is installed on the DGX B200 system, the required access to the
/var/run/nvidia-fabricmanager
directory by the Fabric Manager service is not set
for a non-root account. This can cause failures running the HPL benchmark and NCCL test as a non-root user.
Workaround#
Change the permissions on the /var/run/nvidia-fabricmanager/
directory by running the following commands:
# If Fabric Manager is running, stop it.
sudo systemctl stop nvidia-fabricmanager.service
# Change the permission setting to 755.
sudo chmod 755 /var/run/nvidia-fabricmanager/
# Start Fabric Manager.
sudo systemctl start nvidia-fabricmanager.service
Kernel OOPS When Activating VFs from SR-IOV Network Operator#
Issue#
A Kernel OOPS (Out-of-Memory Panic) occurs when activating 16 or more Virtual Functions (VFs) for a single parent interface or Physical Function (PF).
Primary issue impacts:
Shell user – Shell users cannot use standard
ip
for network interface troubleshooting on interfaces with 16 VFs. There are no stats, status, etc.CPU cycles/log space - Resource consumption is a concern. The
dmesg
logs roll in minutes or seconds, preventing proper debugging.
Two failures on the network-operator side:
ib-sriov
process failure - This happened once and is reflected in the log of events below.sriov-network-config-daemon
failure - Happens continuously.
Both failures produce almost similar symptoms. The following failure from sriov-network-config-daemon
was observed in the sriov-network-config-daemon
logs:
DiscoverSriovDevices(): unable to get Link for device, skipping {"device":
"0000:81:00.0", "error": "message too long"}
Workaround#
Create fewer than 16 VFs for a PF.
Log of events seen for this issue:
[128435.409604] WARNING: CPU: 113 PID: 1061323 at net/core/rtnetlink.c:3867 rtnl_getlink+0x43a/0x470
[128435.409610] Modules linked in: xt_set xt_multiport ipt_rpfilter ip_set_hash_net ip_set_hash_ip ip_set veth ipip tunnel4 ip_tunnel wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_conntrack_netlink xt_addrtype xt_statistic xt_nat xt_MASQUERADE xt_mark xt_nfacct ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_chain_nat nfnetlink_acct nvidia_uvm(OE) overlay xt_conntrack xt_comment nft_compat nf_tables rpcsec_gss_krb5 qrtr auth_rpcgss rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac skx_edac_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel nvidia_drm(OE) iaa_crypto nvidia_modeset(OE) cmdlinepart kvm spi_nor pmt_telemetry irqbypass mtd intel_sdsi pmt_class nvidia(OE) rapl intel_cstate video dax_hmem ecc intel_th_gth idxd mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 intel_th_pci
[128435.409640] spi_intel_pci isst_if_common idxd_bus intel_vsec ast mei intel_th i2c_smbus i2c_ismt spi_intel cxl_acpi cxl_core input_leds joydev mac_hid knem(OE) ipmi_devintf ipmi_msghandler dm_multipath msr iptable_filter iptable_mangle iptable_nat xt_owner xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 nls_iso8859_1 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 bonding e1000 mpt3sas raid_class sata_sil i40e ahci libahci xfs nfsv4 mptsas forcedeth udf crc_itu_t aacraid dm_thin_pool dm_persistent_data dm_bufio megaraid_sas isofs mptspi bnx2 mptscsih mptbase hpilo igb i2c_algo_bit bnxt_en e1000e reiserfs br_netfilter bridge stp llc megaraid arcmsr dm_bio_prison hpsa scsi_transport_sas btrfs blake2b_generic xor raid6_pq ixgbevf bnx2x libcrc32c tg3 sata_svw jfs nls_ucs2_utils aic7xxx sata_nv nfsv3 nfs_acl nfs lockd grace sunrpc netfs igbvf aic79xx scsi_transport_spi mlx5_ib(OE) ib_uverbs(OE) macsec
[128435.409680] ib_core(OE) hid_generic usbhid hid cdc_ether usbnet mii uas usb_storage mlx5_core(OE) crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic mlxfw(OE) ice ghash_clmulni_intel sha256_ssse3 psample sha1_ssse3 nvme mlxdevm(OE) ixgbe aesni_intel crypto_simd nvme_core tls xfrm_algo cryptd xhci_pci dca mlx_compat(OE) gnss nvme_auth mdio pci_hyperv_intf wmi xhci_pci_renesas pinctrl_emmitsburg [last unloaded: ipmi_msghandler]
[128435.409695] CPU: 113 PID: 1061323 Comm: ib-sriov Tainted: G W OE 6.8.0-51-generic #52-Ubuntu
[128435.409698] Hardware name: , BIOS 5.32 10/24/2024
[128435.409699] RIP: 0010:rtnl_getlink+0x43a/0x470
[128435.409701] Code: c7 a0 f2 76 8c e8 c6 c5 06 00 4d 85 ed 0f 84 e2 fc ff ff 49 c7 45 00 a0 f2 76 8c e9 d5 fc ff ff b8 ea ff ff ff e9 c4 fe ff ff <0f> 0b e9 a4 fe ff ff 48 c7 c7 d0 f2 76 8c e8 93 c5 06 00 4d 85 ed
[128435.409703] RSP: 0018:ff29f566a62bb700 EFLAGS: 00010246
[128435.409704] RAX: 00000000ffffffa6 RBX: ff24d605ed223480 RCX: 0000000000000000
[128435.409706] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[128435.409706] RBP: ff29f566a62bb9d8 R08: 0000000000000000 R09: 0000000000000000
[128435.409707] R10: 0000000000000000 R11: ff24d60420219e00 R12: ff24d6042021a200
[128435.409708] R13: ffffffff8df7ecc0 R14: ffffffff8df7ecc0 R15: 00000000ffffffff
[128435.409709] FS: 00000000007a9d10(0000) GS:ff24d6fdbe480000(0000) knlGS:0000000000000000
[128435.409710] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[128435.409711] CR2: 000000c000428000 CR3: 00000004a559a006 CR4: 0000000000f71ef0
[128435.409713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[128435.409713] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[128435.409715] PKRU: 55555554
[128435.409715] Call Trace:
[128435.409716] <TASK>
[128435.409718] ? show_regs+0x6d/0x80
[128435.409720] ? __warn+0x89/0x160
[128435.409722] ? rtnl_getlink+0x43a/0x470
[128435.409724] ? report_bug+0x17e/0x1b0
[128435.409727] ? handle_bug+0x51/0xa0
[128435.409730] ? exc_invalid_op+0x18/0x80
[128435.409732] ? asm_exc_invalid_op+0x1b/0x20
[128435.409735] ? rtnl_getlink+0x43a/0x470
[128435.409740] rtnetlink_rcv_msg+0x16d/0x430
[128435.409742] ? apparmor_file_alloc_security+0x43/0x1f0
[128435.409747] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[128435.409749] netlink_rcv_skb+0x5a/0x110
[128435.409752] rtnetlink_rcv+0x15/0x30
[128435.409754] netlink_unicast+0x24a/0x390
[128435.409756] netlink_sendmsg+0x214/0x470
[128435.409758] __sys_sendto+0x21b/0x230
[128435.409761] __x64_sys_sendto+0x24/0x40
[128435.409763] x64_sys_call+0x1b2d/0x25a0
[128435.409765] do_syscall_64+0x7f/0x180
[128435.409767] ? handle_pte_fault+0x1cb/0x1d0
[128435.409770] ? __handle_mm_fault+0x653/0x790
[128435.409772] ? __count_memcg_events+0x6b/0x120
[128435.409776] ? count_memcg_events.constprop.0+0x2a/0x50
[128435.409779] ? handle_mm_fault+0xad/0x380
[128435.409781] ? do_user_addr_fault+0x333/0x670
[128435.409783] ? irqentry_exit_to_user_mode+0x7b/0x260
[128435.409786] ? irqentry_exit+0x43/0x50
[128435.409787] ? exc_page_fault+0x94/0x1b0
[128435.409789] entry_SYSCALL_64_after_hwframe+0x78/0x80
[128435.409791] RIP: 0033:0x40328e
[128435.409812] Code: 48 89 6c 24 38 48 8d 6c 24 38 e8 0d 00 00 00 48 8b 6c 24 38 48 83 c4 40 c3 cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48
[128435.409813] RSP: 002b:000000c00036b1b8 EFLAGS: 00000206 ORIG_RAX: 000000000000002c
ACCESS_REG Command Failure with Err(-22)#
Issue#
After the initial installation of DGX OS 7.0.1 on a DGX B200 system, the following non-destructive issue has been seen on every boot. This is due to the node_exporter attempting to get telemetry from PF0 and PF1, causing a dmesg error message similar to the example below to be written to the kernel log once every 30 seconds. This might cause the kernel log to fill up.
[11176.517416] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.534892] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.589308] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.607052] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
Workaround#
Accessing PF0 and PF1 is restricted. Currently, there is no temporary solution available.
nv-disk-encrypt Failed on Pre-Owned NVMe Drives#
Issue#
Using the nv-disk-encrypt
tool to initialize the system for NVMe drive encryption failed with
the following error messages:
takeOwnership failed
SED takeownership failed on /dev/nvme0n1
Workaround#
To resolve the issue, recover from lost keys and erase the drives as shown in the following single-drive example:
Take ownership of all drives one at a time.
sudo sedutil-cli --takeownership <your-sid-password> /dev/nvme6n1
If step 1 fails, specify the PSID to reset the drive using the sedutil-cli command.
Caution
Before performing this step, back up your important data to another location because it will delete everything on the drive.
You can obtain the PSID, which is printed on the label, by physically examining the drive.
For example,
sudo sedutil-cli --yesIreallywanttoERASEALLmydatausingthePSID <your-drive-psid> /dev/nvme6n1
Check taking ownership.
sudo sedutil-cli --takeownership <your-sid-password> /dev/nvme6n1
Revert ownership before the initialization process for drive encryption.
sudo sedutil-cli --reverttper <your-sid-password> /dev/nvme6n1
Initialize the system for drive encryption using the
nv-disk-encrypt init
command.sudo nv-disk-encrypt init [-k <your-vault-password>] [-f <path/to/json-file>] [-g] [-r]
GPUs Cannot Be Reset During MIG Configurations on DGX A100 and A800 Systems#
Issue#
When you run the nvidia-mig-parted
tool to apply the MIG configurations on DGX A100 or DGX A800
systems, the following error message might occur:
The following GPUs could not be reset:
GPU 00000000:01:00.0: In use by another client
GPU 00000000:47:00.0: In use by another client
GPU 00000000:81:00.0: In use by another client
GPU 00000000:C2:00.0: In use by another client
Workaround#
To recover from this error, reboot the server to apply the most recent nvidia-mig-parted
configuration.
For future updates using the nvidia-mig-parted
command, ensure to run the following command
before any additional nvidia-mig-parted
commands:
$ sudo rmmod nvidia_drm nvidia_modeset
After the nvidia-mig-parted
command is complete, reload the nvidia_drm
and nvidia_modeset
modules.
Missing the nvidia-system-station Metapackage on DGX Station A100 and DGX Station A800#
Issue#
During the installation using the DGX OS 7.0.0 ISO on a DGX Station A100 or DGX Station A800,
the nvidia-system-station
metapackage was incorrectly removed.
Workaround#
This issue occurs only on the DGX Station A100 and DGX Station A800. To resolve this issue,
install the nvidia-system-station
metapackage manually:
sudo apt install nvidia-system-station
Virtualization Not Supported#
Issue#
Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.