Known Issues#

Installation and Upgrade Problems with GPU Driver 570 Release#

Issue#

Recent package changes in GPU driver 570.158.01 might lead to errors or mismatched packages during installation or upgrade attempts. This issue can occur when you use the following commands:

apt update and apt install
apt update and apt upgrade
apt update and apt full-upgrade

Workaround#

To successfully install or upgrade to GPU driver release 570.158.01, run the following command:

sudo apt install nvidia-driver-570-open nvidia-persistenced=570.158.01-1ubuntu1

RShim Devices Not Created After Starting rshim.service#

Issue#

RShim devices might not appear under /dev/rshim* because the RShim driver might be used by another entity.

Workaround#

Edit /etc/rshim.conf, and uncomment the line that says #FORCE_MODE 1
Restart the RShim service.
```
sudo systemctl restart rshim.service
```

Firmware Crash on DGX GB200 During Boot#

Issue#

On DGX GB200 systems, you might observe a firmware crash during some reboots, resulting in an extra reboot. This can be visible over the serial console.

Log of events seen for this issue:

INFO:    th500_ras_intr_handler: External Abort reason=0 syndrome=0xbe000411 flags=0x1
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ERROR:   spmd_ffa_direct_message failed (4294967292) on CPU121
ASSERT: plat/nvidia/tegra/soc/th500/plat_ras.c:408
BACKTRACE: START: assert
0: EL3: 0x78732b2648
1: EL3: 0x78732b0294
2: EL3: 0x78732c45fc
3: EL3: 0x78732c481c
4: EL3: 0x78732bda18
5: EL3: 0x78732b9f18
6: EL3: 0x78732b1258
BACKTRACE: END: assert

Workaround#

Currently, there is no temporary solution available.

The nvidia_peermem Module Does Not Load After an OTA Update#

Issue#

After you perform an OTA update, the nvidia-peermem-loader package is installed, but the nvidia_peermem module is not loaded. As a result, the following error message appears in the dmesg log:

nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client
nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22)

Workaround#

This issue occurs because updating the mlnx-ofed-kernel package requires rebuilding the nvidia-peermem DKMS module to use the new version of the mlnx-ofed-kernel package.

The following commands will rebuild the necessary modules:

MODULE_VERSION=`dkms status nvidia -k $(uname -r) | cut -d "," -f1` || true
if [ ! -z "${MODULE_VERSION}" ]; then
         sudo dkms remove -m ${MODULE_VERSION} -k $(uname -r) || true
         sudo dkms install -m ${MODULE_VERSION} -k $(uname -r) || true
fi

The systemd-modules-load Service Failed to Insert the nvidia_peermem Module#

Issue#

On a DGX Station A100 or DGX Station A800, when you install a Base OS version earlier than 7.0.2, the nvidia-peermem-loader package might have been installed on the system. As a result, the following error message occurs:

$ sudo systemctl status systemd-modules-load.service
(code=exited, status=1/FAILURE)
...
systemd-modules-load[2143]: Failed to insert module 'nvidia_peermem': Invalid argument

Workaround#

To avoid this failure, remove the nvidia-peermem-loader package.

sudo apt purge nvidia-peermem-loader

BMC Redfish Interface Not Active on the First Boot After Installation#

Issue#

Reported in DGX OS 7.0.0.

When the DGX system was booted the first time after the DGX OS 7.0.0 installation, the BMC Redfish network interface was not renamed or autoconfigured correctly, as reported by the ip a command.

Workaround#

During the installation of DGX OS 7.0.x, the Redfish interface might not be configured correctly with the proper interface name or IP address. To resolve this issue, run the following command to reconfigure the interface:

sudo /usr/sbin/configure-redfish-intf.bash

No Permissions to Access /var/run/nvidia-fabricmanager for Non-Root Users on DGX B200#

Issue#

After DGX OS 7.0.x is installed on the DGX B200 system, the required access to the /var/run/nvidia-fabricmanager directory by the Fabric Manager service is not set for a non-root account. This can cause failures running the HPL benchmark and NCCL test as a non-root user.

Workaround#

Change the permissions on the /var/run/nvidia-fabricmanager/ directory by running the following commands:

# If Fabric Manager is running, stop it.
sudo systemctl stop nvidia-fabricmanager.service

# Change the permission setting to 755.
sudo chmod 755 /var/run/nvidia-fabricmanager/

# Start Fabric Manager.
sudo systemctl start nvidia-fabricmanager.service

Kernel OOPS When Activating VFs from SR-IOV Network Operator#

Issue#

A Kernel OOPS (Out-of-Memory Panic) occurs when activating 16 or more Virtual Functions (VFs) for a single parent interface or Physical Function (PF).

Primary issue impacts:

Shell user – Shell users cannot use standard ip for network interface troubleshooting on interfaces with 16 VFs. There are no stats, status, etc.
CPU cycles/log space - Resource consumption is a concern. The dmesg logs roll in minutes or seconds, preventing proper debugging.

Two failures on the network-operator side:

ib-sriov process failure - This happened once and is reflected in the log of events below.
sriov-network-config-daemon failure - Happens continuously.

Both failures produce almost similar symptoms. The following failure from sriov-network-config-daemon was observed in the sriov-network-config-daemon logs:

DiscoverSriovDevices(): unable to get Link for device, skipping {"device":
"0000:81:00.0", "error": "message too long"}

Workaround#

Create fewer than 16 VFs for a PF.

Log of events seen for this issue:

[128435.409604] WARNING: CPU: 113 PID: 1061323 at net/core/rtnetlink.c:3867 rtnl_getlink+0x43a/0x470
[128435.409610] Modules linked in: xt_set xt_multiport ipt_rpfilter ip_set_hash_net ip_set_hash_ip ip_set veth ipip tunnel4 ip_tunnel wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_conntrack_netlink xt_addrtype xt_statistic xt_nat xt_MASQUERADE xt_mark xt_nfacct ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_chain_nat nfnetlink_acct nvidia_uvm(OE) overlay xt_conntrack xt_comment nft_compat nf_tables rpcsec_gss_krb5 qrtr auth_rpcgss rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac skx_edac_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel nvidia_drm(OE) iaa_crypto nvidia_modeset(OE) cmdlinepart kvm spi_nor pmt_telemetry irqbypass mtd intel_sdsi pmt_class nvidia(OE) rapl intel_cstate video dax_hmem ecc intel_th_gth idxd mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 intel_th_pci
[128435.409640] spi_intel_pci isst_if_common idxd_bus intel_vsec ast mei intel_th i2c_smbus i2c_ismt spi_intel cxl_acpi cxl_core input_leds joydev mac_hid knem(OE) ipmi_devintf ipmi_msghandler dm_multipath msr iptable_filter iptable_mangle iptable_nat xt_owner xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 nls_iso8859_1 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 bonding e1000 mpt3sas raid_class sata_sil i40e ahci libahci xfs nfsv4 mptsas forcedeth udf crc_itu_t aacraid dm_thin_pool dm_persistent_data dm_bufio megaraid_sas isofs mptspi bnx2 mptscsih mptbase hpilo igb i2c_algo_bit bnxt_en e1000e reiserfs br_netfilter bridge stp llc megaraid arcmsr dm_bio_prison hpsa scsi_transport_sas btrfs blake2b_generic xor raid6_pq ixgbevf bnx2x libcrc32c tg3 sata_svw jfs nls_ucs2_utils aic7xxx sata_nv nfsv3 nfs_acl nfs lockd grace sunrpc netfs igbvf aic79xx scsi_transport_spi mlx5_ib(OE) ib_uverbs(OE) macsec
[128435.409680] ib_core(OE) hid_generic usbhid hid cdc_ether usbnet mii uas usb_storage mlx5_core(OE) crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic mlxfw(OE) ice ghash_clmulni_intel sha256_ssse3 psample sha1_ssse3 nvme mlxdevm(OE) ixgbe aesni_intel crypto_simd nvme_core tls xfrm_algo cryptd xhci_pci dca mlx_compat(OE) gnss nvme_auth mdio pci_hyperv_intf wmi xhci_pci_renesas pinctrl_emmitsburg [last unloaded: ipmi_msghandler]
[128435.409695] CPU: 113 PID: 1061323 Comm: ib-sriov Tainted: G W OE 6.8.0-51-generic #52-Ubuntu
[128435.409698] Hardware name: , BIOS 5.32 10/24/2024
[128435.409699] RIP: 0010:rtnl_getlink+0x43a/0x470
[128435.409701] Code: c7 a0 f2 76 8c e8 c6 c5 06 00 4d 85 ed 0f 84 e2 fc ff ff 49 c7 45 00 a0 f2 76 8c e9 d5 fc ff ff b8 ea ff ff ff e9 c4 fe ff ff <0f> 0b e9 a4 fe ff ff 48 c7 c7 d0 f2 76 8c e8 93 c5 06 00 4d 85 ed
[128435.409703] RSP: 0018:ff29f566a62bb700 EFLAGS: 00010246
[128435.409704] RAX: 00000000ffffffa6 RBX: ff24d605ed223480 RCX: 0000000000000000
[128435.409706] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[128435.409706] RBP: ff29f566a62bb9d8 R08: 0000000000000000 R09: 0000000000000000
[128435.409707] R10: 0000000000000000 R11: ff24d60420219e00 R12: ff24d6042021a200
[128435.409708] R13: ffffffff8df7ecc0 R14: ffffffff8df7ecc0 R15: 00000000ffffffff
[128435.409709] FS: 00000000007a9d10(0000) GS:ff24d6fdbe480000(0000) knlGS:0000000000000000
[128435.409710] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[128435.409711] CR2: 000000c000428000 CR3: 00000004a559a006 CR4: 0000000000f71ef0
[128435.409713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[128435.409713] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[128435.409715] PKRU: 55555554
[128435.409715] Call Trace:
[128435.409716] <TASK>
[128435.409718] ? show_regs+0x6d/0x80
[128435.409720] ? __warn+0x89/0x160
[128435.409722] ? rtnl_getlink+0x43a/0x470
[128435.409724] ? report_bug+0x17e/0x1b0
[128435.409727] ? handle_bug+0x51/0xa0
[128435.409730] ? exc_invalid_op+0x18/0x80
[128435.409732] ? asm_exc_invalid_op+0x1b/0x20
[128435.409735] ? rtnl_getlink+0x43a/0x470
[128435.409740] rtnetlink_rcv_msg+0x16d/0x430
[128435.409742] ? apparmor_file_alloc_security+0x43/0x1f0
[128435.409747] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[128435.409749] netlink_rcv_skb+0x5a/0x110
[128435.409752] rtnetlink_rcv+0x15/0x30
[128435.409754] netlink_unicast+0x24a/0x390
[128435.409756] netlink_sendmsg+0x214/0x470
[128435.409758] __sys_sendto+0x21b/0x230
[128435.409761] __x64_sys_sendto+0x24/0x40
[128435.409763] x64_sys_call+0x1b2d/0x25a0
[128435.409765] do_syscall_64+0x7f/0x180
[128435.409767] ? handle_pte_fault+0x1cb/0x1d0
[128435.409770] ? __handle_mm_fault+0x653/0x790
[128435.409772] ? __count_memcg_events+0x6b/0x120
[128435.409776] ? count_memcg_events.constprop.0+0x2a/0x50
[128435.409779] ? handle_mm_fault+0xad/0x380
[128435.409781] ? do_user_addr_fault+0x333/0x670
[128435.409783] ? irqentry_exit_to_user_mode+0x7b/0x260
[128435.409786] ? irqentry_exit+0x43/0x50
[128435.409787] ? exc_page_fault+0x94/0x1b0
[128435.409789] entry_SYSCALL_64_after_hwframe+0x78/0x80
[128435.409791] RIP: 0033:0x40328e
[128435.409812] Code: 48 89 6c 24 38 48 8d 6c 24 38 e8 0d 00 00 00 48 8b 6c 24 38 48 83 c4 40 c3 cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48
[128435.409813] RSP: 002b:000000c00036b1b8 EFLAGS: 00000206 ORIG_RAX: 000000000000002c

ACCESS_REG Command Failure with Err(-22)#

Issue#

After the initial installation of DGX OS 7.0.1 on a DGX B200 system, the following non-destructive issue has been seen on every boot. This is due to the node_exporter attempting to get telemetry from PF0 and PF1, causing a dmesg error message similar to the example below to be written to the kernel log once every 30 seconds. This might cause the kernel log to fill up.

[11176.517416] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.534892] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.589308] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.607052] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)

Workaround#

Accessing PF0 and PF1 is restricted. Currently, there is no temporary solution available.

DGX B200 Display Goes Blank Shortly After GRUB Menu#

Issue#

During the DGX B200 system boot process, the NVIDIA splash screen and the POST display properly. However, when the DGX OS Linux messages start to appear, the screen goes blank. The system is connected to a standard HD monitor with a resolution of 1920 x1080 pixels through the VGA port. This issue occurs due to an incompatibility between the DGX B200 monitor and the supported resolutions of the connected monitor.

Workaround#

When a low-resolution monitor is attached to the system being installed, the GRUB video= parameter should be set to the resolution of the monitor that is attached. For this specific issue, where a monitor with a resolution of 1920 x 1080 pixels is connected through the VGA port, add the video=1920x1080 parameter to the GRUB menu entry at boot time.

On the GRUB boot screen, select the e key to edit the boot entry.
Using the down arrow key, navigate to the linux line that contains the boot parameters.
Append the following parameter to the linux line.
```
video=1920x1080
```
Select Ctrl+x to boot the system with the added parameter.

The system finishes booting and remains resolution at 1920x1080 pixels.

nv-disk-encrypt Failed on Pre-Owned NVMe Drives#

Issue#

Using the nv-disk-encrypt tool to initialize the system for NVMe drive encryption failed with the following error messages:

takeOwnership failed
SED takeownership failed on /dev/nvme0n1

Workaround#

To resolve the issue, recover from lost keys and erase the drives as shown in the following single-drive example:

Take ownership of all drives one at a time.

sudo sedutil-cli --takeownership <your-sid-password> /dev/nvme6n1

If step 1 fails, specify the PSID to reset the drive using the sedutil-cli command.

Caution

Before performing this step, back up your important data to another location because it will delete everything on the drive.

You can obtain the PSID, which is printed on the label, by physically examining the drive.

For example,
```
sudo sedutil-cli --yesIreallywanttoERASEALLmydatausingthePSID <your-drive-psid> /dev/nvme6n1
```

Check taking ownership.

sudo sedutil-cli --takeownership <your-sid-password> /dev/nvme6n1

Revert ownership before the initialization process for drive encryption.
```
sudo sedutil-cli --reverttper <your-sid-password> /dev/nvme6n1
```

Initialize the system for drive encryption using the nv-disk-encrypt init command.

sudo nv-disk-encrypt init [-k <your-vault-password>] [-f <path/to/json-file>] [-g] [-r]

GPUs Cannot Be Reset During MIG Configurations on DGX A100 and A800 Systems#

Issue#

When you run the nvidia-mig-parted tool to apply the MIG configurations on DGX A100 or DGX A800 systems, the following error message might occur:

The following GPUs could not be reset:
  GPU 00000000:01:00.0: In use by another client
  GPU 00000000:47:00.0: In use by another client
  GPU 00000000:81:00.0: In use by another client
  GPU 00000000:C2:00.0: In use by another client

Workaround#

To recover from this error, reboot the server to apply the most recent nvidia-mig-parted configuration.

For future updates using the nvidia-mig-parted command, ensure to run the following command before any additional nvidia-mig-parted commands:

$ sudo rmmod nvidia_drm nvidia_modeset

After the nvidia-mig-parted command is complete, reload the nvidia_drm and nvidia_modeset modules.

Missing the nvidia-system-station Metapackage on DGX Station A100 and DGX Station A800#

Issue#

During the installation using the DGX OS 7.0.0 ISO on a DGX Station A100 or DGX Station A800, the nvidia-system-station metapackage was incorrectly removed.

Workaround#

This issue occurs only on the DGX Station A100 and DGX Station A800. To resolve this issue, install the nvidia-system-station metapackage manually:

sudo apt install nvidia-system-station

Virtualization Not Supported#

Issue#

Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.

Known Issues#

Installation and Upgrade Problems with GPU Driver 570 Release#

Issue#

Workaround#

RShim Devices Not Created After Starting rshim.service#

Issue#

Workaround#

Firmware Crash on DGX GB200 During Boot#

Issue#

Workaround#

The nvidia_peermem Module Does Not Load After an OTA Update#

Issue#

Workaround#

The systemd-modules-load Service Failed to Insert the nvidia_peermem Module#

Issue#

Workaround#

BMC Redfish Interface Not Active on the First Boot After Installation#

Issue#

Workaround#

No Permissions to Access /var/run/nvidia-fabricmanager for Non-Root Users on DGX B200#

Issue#

Workaround#

Kernel OOPS When Activating VFs from SR-IOV Network Operator#

Issue#

Workaround#

ACCESS_REG Command Failure with Err(-22)#

Issue#

Workaround#

DGX B200 Display Goes Blank Shortly After GRUB Menu#

Issue#

Workaround#

nv-disk-encrypt Failed on Pre-Owned NVMe Drives#

Issue#

Workaround#

GPUs Cannot Be Reset During MIG Configurations on DGX A100 and A800 Systems#

Issue#

Workaround#

DGX System Device ID Not Found in /usr/share/misc/pci.ids#

Issue#

Workaround#

Missing the nvidia-system-station Metapackage on DGX Station A100 and DGX Station A800#

Issue#

Workaround#

Virtualization Not Supported#

Issue#