Resolved Issues#

The following issues that were previously identified as known issues have been resolved.

ACCESS_REG Command Failure with Err(-22)#

Issue#

After the initial installation of DGX OS 7.0.1 on a DGX B200 system, the following non-destructive issue has been seen on every boot. This is due to the node_exporter attempting to get telemetry from PF0 and PF1, causing a dmesg error message similar to the example below to be written to the kernel log once every 30 seconds. This might cause the kernel log to fill up.

[11176.517416] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.534892] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.589308] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.607052] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)

Status#

Resolved in version 7.4.0.

DGX Station A100 Fails to Boot After Applying MIG Configurations#

Issue#

After MIG configurations were successfully applied to a DGX station A100 system running DGX OS 7.0.2, the system failed to boot when you ran the sudo reboot command. Resetting the GPUs by performing a DC power cycle could not recover the system.

Workaround#

The DGX OS 7.0.2 release does not support the DGX Station A100 system with MIG enabled. To resolve the boot failure, install DGX OS 6.3.2 on the system and then apply MIG configurations.

Status#

Resolved in GPU driver versions 570.117 and later, as well as 575.20 and later.

Update the MLNX Firmware for the Connect-X and Bluefield-3 Adapters#

Issue#

The online network repository for DOCA 2.9.1/Ubuntu 24.04 does not contain the mlnx_fw_updater tool, which is needed to update the Connect-X and Bluefield-3 adapters to their latest firmware versions.

Workaround#

Install the mlnx_fw_updater tool.

$ wget https://linux.mellanox.com/public/repo/mlnx_ofed/latest-24.10/ubuntu24.04/x86_64/mlnx-fw-updater_24.10-1.1.4.0_amd64.deb
$ sudo apt install mlnx-fw-updater_24.10-1.1.4.0_amd64.deb
$ sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

Status#

Resolved in version 7.1.0.

Errors Occur When Loading Mirrored Repositories on Air-Gapped Systems#

Issue#

When you run the apt update command to load mirrored repositories on an air-gapped system, the following error messages appear:

File not found - /media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64 (2: No such file or directory)
Failed to fetch file:/media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64  File not found - /media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64 (2: No such file or directory)

Explanation#

This issue occurs because a fix for the apt-mirror package, which is available in Ubuntu 23.10, has yet to be implemented in the Ubuntu 22.04 repositories. If you are using an apt-mirror package

  • Version later than 0.5.4-1: Contact NVIDIA Enterprise Services by filing a support case.

  • Version 0.5.4-1: Use the following workaround to mirror the repositories.

You can run the following command to determine the version of your apt-mirror package:

$ dpkg -l | grep apt-mirror

ii  apt-mirror                  0.5.4-1               all             APT sources mirroring tool

Status#

Resolved in version 7.1.0.