DGX A100 System Firmware Update Container Version 21.11.4

The DGX Firmware Update container version 21.11.4 is available.

  • Package name: nvfw-dgxa100_21.11.4_211111.tar.gz

  • Run file name: nvfw-dgxa100_21.11.4_211111.run

  • Image name: nvfw-dgxa100:21.11.4

  • ISO image: DGXA100_FWUI-21.11.4-2021-11-12-09-20-53.iso

  • PXE netboot: pxeboot-DGXA100-FWUI-21.11.4.tgz

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -

    • DGX OS 5.0.1 or later

      Important

      This firmware update container does not support DGX OS 4.99.xx. To use the container on DGX A100 servers, update to DGX OS 5.0.1 or later.

    • EL7-21.04 or later (See Special Instructions for Red Hat Enterprise Linux 7)

    • EL8-20.11 or later

  • Fixed BMC issues

    • Fixed incorrect temperatures reported for sensors on the NVIDIA Networking ConnectX-6 single-port and dual-port VPI cards.

    • Fixed BMC user data (username, password, privileges) getting lost after BMC upgrade.

    • Added ability to set the BMC to local time instead of the default UTC.

    • Added authentication capabilities to the BMC RESTful API.

    • Added new capabilities to identify firmware update in the System Event Log (SEL) on the BMC.

    • Fixed the bug to ensure that the BMC will boot to the latest version updated on the system.

  • Fixed SBIOS issues

    • Added Memory correctable ECC Error leaky bucket, preventing unnecessary replacement of working system DIMMs.

    • Fixed PXE boot configuration not persisting, helpful for multiple DGX A100 nodes.

    • Fixed inability to enter SBIOS Admin/User password from the Serial over LAN console.

  • Fixed U.2 NVMe driver issues

    • Improved write performance while performing drive wear-leveling.

  • Addressed the needs of security-conscious customers who no longer support Python 2.7 by using Python 3 in the NVIDIA containerless .run file.

  • IPMITool: ” ipmitool -I lan ” is no longer supported. Instead, use ” ipmitool -I lanplus “.

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

  • If updating from 21.05.7 or 21.03.6, the total update time is approximately 44 minutes.

  • If updating from 20.12.3.3 or earlier, the total update time is approximately 3 hours and 22 minutes.

The update time for each component is provided in the following table.

Component

Version

Key Changes

Update Time from 20.12.3.3 or earlier

Update Time from 21.05.7 or 21.03.6

BMC (via CEC)

00.16.09

Refer to DGX A100 BMC Changes for the list of changes.

32 minutes

32 minutes

SBIOS

1.09

Refer to DGX A100 SBIOS Changes for the list of changes.

5 minutes

5 minutes

Broadcom 88096 PCIe switch board

0.2.0

No change

1 minute

0 minute

BMC CEC SPI

v3.28

No change

22 minutes

0 minutes

PEX88064 Retimer

1.2f

No change

1 minute

0 minutes

PEX88080 Retimer

1.2f

No change

1 minute

0 minutes

NvSwitch BIOS

92.10.18.00.01

No change

2 minutes

0 minutes

VBIOS (A100 40GB)

92.00.45.00.03

Added security protection to the I2C interface.

7 minutes

3 minutes

VBIOS (A100 80GB)

92.00.45.00.05

Added security protection to the I2C interface.

Same as above.

Same as above.

U.2 NVMe (Samsung)

EPK9CB5Q

Refer to DGX A100 U.2 NVMe Changes for the list of changes.

4 minutes

4 minutes

U.2 NVMe (Kioxia)

0105

No change

Same as above.

Same as above.

M.2 NVMe (Samsung version 1)

EDA7602Q

No change

4 minutes

0 minutes

M.2 NVMe (Samsung version 2)

GDC7202Q

New support

Same as above.

Same as above.

FPGA (GPU sled)

2.A5

No change

22 minutes

0 minutes

CEC1712 SPI (GPU sled)

3.9

No change

3 minutes

0 minutes

PSU (Delta)

Primary 1.6/ Secondary 1.6/ Community 1.7

No change

90 minutes

0 minutes

PSU (LiteOn)

v0908

New support

Same as above.

Same as above.

BMC 00.16.12 on Newer CPU Motherboards

Newer CPU motherboards are manufactured and shipped with BMC version 00.16.12. This BMC version provides an updated PCIe setting that is required by the newer (-004) motherboards. Do not attempt to downgrade the BMC on these motherboards using the firmware update container.

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:

  • SBIOS: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image. When using update_fw all, the update container updates both active and inactive images.

  • BMC: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image. The update container can only update the inactive image, and will update it only if the active image needs to be updated. After the update is completed, the updated inactive image becomes the active image. Because the active image is now updated, subsequent update_fw all commands will not update the inactive image. To update the inactive image in this case, use update_fw BMC --inactive. Since the container does not support updating the active image directly, commands such as update_fw BMC -a -f will not work.

Updating the PSU FW

  • If the PSU update fails due to a failure in the PSU recovery, power cycle the PSU and then perform the PSU update again. The following are some methods for power cycling the PSU:

    • Remove power from the failed PSU by turning off the rack PDU output to that PSU and then turning it back on after a few moments. (If necessary, run the container using the show_version option to determine which PSU is reported as ” not-ok “).

    • Physically disconnect power to the PSU by disconnecting one end of the PSU power cord and then reconnect after a few moments. (If necessary, run the container using the show_version option to determine which PSU is reported as ” not-ok “).

    • AC power cycle the server.

      $ sudo ipmitool raw 0x3c 0x04
      $ sudo ipmitool chassis power cycle
      

DO NOT UPDATE DGX A100 CPLD FIRMWARE UNLESS INSTRUCTED

When updating DGX A100 firmware using the Firmware Update Container, do not update the CPLD firmware unless the DGX A100 system is being upgraded from 320GB to 640GB.

The current DGX A100 Firmware Update Container will not automatically update the CPLD firmware (for example, when running update_fw all). It is possible to update the CPLD firmware using “ update_fw CPLD ”; however, it is strongly recommended that the CPLD firmware not be updated manually unless specifically instructed by NVIDIA Enterprise Support (or email enterprisesupport@nvidia.com). If the DGX A100 is upgraded from 320GB to 640GB, the CPLD firmware update should be performed as instructed.

Special Instructions for Red Hat Enterprise Linux 7

This section describes the actions that must be taken before updating firmware on DGX A100 systems installed with Red Hat Enterprise Linux. There are two options for meeting these requirements.

Option 1: Update to EL7-21.01 or later

Refer to the DGX Software for Red Hat Enterprise Linux 7 Release Notes for more information.

Important

Updating the DGX software for Red Hat Enterprise Linux will update the Red Hat Enterprise Linux installation to 7.9 or later. If you do not want to update your Red Hat Enterprise Linux 7 installation, then choose Option 2.

Option 2: Install mpt3sas 31.101.01.00-0

These instructions apply if

  • You do not want to update your Red Hat Enterprise Linux installation, and

  • Your system is currently installed with Red Hat Enterprise Linux 7.7 or later.

Note

If your system is installed with Red Hat Enterprise Linux 7.6 or earlier, contact NVIDIA Enterprise Support for assistance.

  1. Perform this step if your system is no longer pointing to the NVIDIA DGX software repository.

    1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.

      sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
      sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
      
    2. Run the following command to install the DGX software installation package and enable the NVIDIA DGX software repository.

      Attention

      By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.

      yum install -y \
      https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-20.03-1.el7.x86_64.rpm
      
  2. Install mpt3sas 31.101.01.00-0.

    sudo yum install mpt3sas-dkms
    
  3. Load the mpt3sas driver into the Red Hat Enterprise Linux kernel.

    sudo modprobe mpt3sas
    

    You can verify the correct mpt3sas version is installed by issuing the following.

    yum list installed
    

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.

Caution

  • Do not log into the BMC dashboard UI while a firmware update is in progress.

  • Stop all unnecessary system activities before attempting to update firmware.

  • Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.

  • When issuing update_fw all, stop the following services if they are launched from Docker through the docker run command:

    • dcgm-exporter

    • nvidia-dcgm

    • nvidia-fabricmanager

    • nvidia-persistenced

    • xorg-setup

    • lightdm

    • nvsm-core

    • kubelet The container will attempt to stop these services automatically, but will be unable to stop any that are launched from Docker.

  • Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.

  • When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.

  1. Check if updates are needed by checking the installed versions.

    $ sudo ./nvfw-dgxa100:21.11.4_211111.run show_version
    
    • If there is “no” in any up-to-date column for updatable firmware, then continue with the next step.

    • If all up-to-date column entries are “yes”, then no updates are needed and no further action is necessary.

  2. Perform the update for all firmware supported by the container.

    $ sudo ./nvfw-dgxa100:21.11.4_211111.run update_fw all
    

    Depending on the firmware that is updated, you may be prompted to either reboot the system or power cycle the system.

    • If you are prompted to reboot, issue

      $ sudo reboot
      
    • If you are prompted to power cycle, you can issue the following two commands (there is no output with the first command).

      $ sudo ipmitool raw 0x3c 0x04
      $ sudo ipmitool chassis power cycle
      
  3. After rebooting or power cycling the system, you may need to perform another update_fw all to update other firmware.

    • Either repeat Step 1 to check if updates are needed and then perform Step 2 if needed, or

    • Repeat Step 2 just in case updates are needed.

    If you perform another update_fw all, you may be prompted again to either reboot the system or power cycle the system.

    See DGX A100 Firmware Update Process for more information about the update process.

You can verify the update by issuing the following.

$ sudo ./nvfw-dgxa100:21.11.4_211111.run show_version

Example output for a DGX A100 320GB system

 CEC
============
                                           Onboard Version   Manifest  up-to-date
MB_CEC(enabled)                             3.28              3.28         yes
Delta_CEC(enabled)                          3.09              3.09         yes

 BMC DGX
=========
Image Id              Status    Location    Onboard Version   Manifest  up_to_date
0:Active   Boot       Online    Local       00.16.09          00.16.09     yes
1:Inactive Updatable            Local       00.16.09          00.16.09     yes

 SBIOS
=======
Image Id                                    Onboard Version   Manifest    up_to_date
0:Inactive Updatable                        1.09              1.09          yes
1:Active   Boot Updatable                   1.09              1.09          yes

  Switches
============
PCI Bus#                  Model      Onboard Version Manifest FUB Updated? up-to-date
DGX - 0000:91:00.0(U261)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:88:00.0(U260)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:4f:00.0(U262)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:48:00.0(U225)  88080_Retimer  1.2.0        1.2.0      N/A         yes


DGX - 0000:01:00.0(U1)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:b1:00.0(U4)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:41:00.0(U2)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:81:00.0(U3)    PEX88096       2.0         2.0       N/A         yes


DGX - 0000:c4:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes
DGX - 0000:c5:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes
DGX - 0000:c8:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes
DGX - 0000:c6:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes
DGX - 0000:c9:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes
DGX - 0000:c7:00.0        LR10    92.10.18.00.01  92.10.18.00.01  N/A      yes



 Mass Storage
==============
Drive Name/Slot         Model Number        Onboard Version   Manifest   up-to-date
nvme0n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q         EPK9CB5Q     yes
nvme1n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q         EDA7602Q     yes
nvme2n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q         EDA7602Q     yes
nvme3n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q         EPK9CB5Q     yes
nvme4n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q         EPK9CB5Q     yes
nvme5n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q         EPK9CB5Q     yes

 Video BIOS
============
Bus            Model            Onboard Version   Manifest     FUB Updated? up-to-date
0000:07:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:0f:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:47:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:4e:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:87:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:90:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:b7:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes
0000:bd:00.0   A100-SXM4-40GB   92.00.45.00.03   92.00.45.00.03  yes        yes


FPGA
========
Onboard version     Manifest  up-to-date
02.a5               02.a5        yes

Power Supply
==============
                                                           Onboard
ID                  Vendor Model        MFR ID  Status   Version   Manifest  up-to-date
PSU 0: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 0: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 0: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 1: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 1: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 1: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 2: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 2: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 2: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 3: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 3: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 3: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 4: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 4: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 4: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 5: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 5: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 5: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes

  CPLD
============
                                         Onboard Version   Manifest       up-to-date
MID_CPLD                                 1.03              1.03               yes
MB_CPLD                                  1.03              1.03               yes

* CPLD won't be updated by default (`update_fw all`), use `update_fw CPLD` if it's needed

Known Issues

BMC Incorrectly Reports CPU Motherboard Overvoltage

Issue

The BMC incorrectly reports that the sensors for 3.3V and 5V_STBY on the CPU motherboard exceed the non-critical thresholds. The assertion is reported in the SEL logs.

Explanation

This is an issue with the BMC where it is not interpreting the sensor information properly. The SEL gets filled with voltage messages but otherwise there is no functional impact. The values reported in the SEL confirm that the threshold has not been exceeded.

TEMP_IO0_IB0_P0 Reading not Reported in BMC “Disabled Sensors” List

Issue

The TEMP_IO0_IB0_P0 sensor does not appear in the BMC web UI when it is disabled.

Explanation

This is an issue with the BMC web UI and will be resolved in the future release. You can issue ipmitool sensor or ipmitool sdr list to see information on disabled sensors.

nvipmitool Reports PCIe Correctable Errors as “Asserted”

Issue

The nvipmitool includes an Asserted text when reporting PCIe correctable errors without further explanation.

Explanation

Asserted ” just means that correctable errors were found in the test.

KVM “Power On Server” Option is Grayed Out

Issue

If the system is powered off, you may not be able to “power on” the system using the BMC KVM (“Power On Server” option is grayed out).

Explanation

To work around, log in to the BMC Web UI, then navigate to the Power Control dialog and select “Power On “.

BMC Web UI Performance Drop

Issue

Several BMC web UI tasks - such as BMC login, getting SEL lists, or getting SDR lists - may be slower to complete compared to previous BMC versions.

Explanation

NVIDIA is investigating this issue.

SOL Cannot be Activated for a Newly Created User Account

Issue

After created a new user account, attempts to activate SOL for that account fail.

Explanation

NVIDIA is investigating the issue. To work around, enable the SOL payload for the new user.

Example:

  $ sudo ipmitool sol payload enable 1 5
Then retry activating SOL again.

Unable to Set Static IPv6 Address Using BMC Web UI

Issue

  1. From the BMC web UI, navigate to Settings -> Network -> Network IP Settings.

  2. Deselect Enable IPv6 DHCP and input an IPv6 Address and Subnet Prefix length, then click Save.

The changes are not made.

Explanation

To work around, set the IPv6 address using the command line.

The BMC KVM May Stop Accepting Keyboard Input on the OS Command Line

Issue

When this occurs, the terminal will hang or not accepting any key strokes. After continuing to press keys, an error message appears indicating the HID queue is getting full.

Explanation

This may occur if the USB service is not enabled. To resolve, enable USB in the kernel and try again. The following is an example on Red Hat Enterprise Linux:

  1. Remove ” nousb ” from /boot/efi/EFI/redhat/grub.cfg.

  2. Configure grub using grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg.

  3. Reboot.

  4. Verify USB is enabled by using the command ” lsscsi -H | grep usb-storage “.

  5. Try KVM console.

BMC Kernel Panic Upon Power Cycle then BMC Reset Sequence

Issue

BMC kernel panic may occur when performing the following:

  1. Issue ‘ ipmitool chassis power cycle ‘.

  2. Wait several seconds.

  3. Issue ‘ impitool mc reset cold

Explanation

This is a timing issue that results in the loss of IRQ 8, resulting in the kernel panic. The BMC will continue to reboot until it is successful.

REDUNDANCY_PSU Sensor May Report 0x0a80 for Sensor Status

Issue

The REDUNDANCY_PSU sensor status of 0x0a80 indicates that redundancy is lost.

Explanation

NVIDIA is investigating this issue. The reported sensor status is misleading but has no functional impact.

SBIOS “Bootup NumLock State” not Enforced

Issue

When turning NumLock to OFF after setting “Boot NumLock State

” to ON from the SBIOS setup menu, NumLock remains off after rebooting the server. Similarly, when turning NumLock to ON after setting “Boot NumLock State

” to OFF from the SBIOS setup menu, NumLock remains on after rebooting the server.

Explanation

This feature is currently not implemented in the DGX A100 SBIOS.

Updating only Active or Inactive SBIOS Can Cause Internal Compatibility Issues

Issue

If you use the -a (active image only) or -i (inactive image only) option when updating the SBIOS, the fail-safe flag may get set and not removed upon reboot.

Explanation

When updating the SBIOS, both active and inactive SBIOS images must be updated. Do not use the -a or -i option. Instead, let the firmware update container automatically update both active and inactive images by using either ” update_fw all ” or ” update_fw SBIOS “.

IPMITool “Persistent” Flag Does not Work

Issue

The ipmitool persistent flag does not take effect when using the standard command format; for example,

ipmitool chassis bootdev options=persistent, efiboot  The persistent flag does work when part of the raw command.

Explanation

This is an issue with IPMITool. To use the persistent flag, use in conjunction with a raw command.

Example: The following raw command corresponds to the example command in the issue description:

ipmitool raw 01 05 e0 04 00 00 00

e0

” specifies PXE boot with EFI.

User is Logged Out of the BMC Web UI After Powering On the System

Issue

To reproduce the issue:

  1. AC power cycle the system.

  2. Log into the BMC web UI and then power on the system, such as through the BMC KVM.

The user is logged out of the BMC web UI.

Explanation

This behavior is the result of the BMC erroneously concluding that the BMC was idle for too long. The AC power cycle resets the BMC RTC to the default value (1999). After powering on the system, the current time is compared to the BMC RTC value and the difference exceeds the timeout value. This is a limitation in the DGX A100 BMC.

SBIOS Versions Might not be Reported After BMC Cold Reboot

Issue

After performing a BMC cold boot, the SBIOS versions (both primary and secondary) are reported as “0” either in the BMC web UI or on the command line.

Explanation

To work around, perform the following.

  1. Reboot

  2. Verify that the active SBIOS version is populated:

    $ sudo ipmitool raw 0x3c 0x24 ($ sudo ipmitool raw 0x3c 0x22)
    
  3. Switch to the inactive SBIOS.

    $ sudo ipmitool raw 0x3c 0x23 $(($(sudo ipmitool raw 0x3c 0x22)^1))
    
  4. Reboot again.

  5. Verify that both active and inactive SBIOS versions are populated.

    $ sudo ipmitool raw 0x3c 0x24 0 && sudo ipmitool raw 0x3c 0x24 1
    

NVSM Incorrectly Reports the Delta PSU Part Number Instead of the Model Numbers

Issue

When issuing show_version or show_fw_manifest, the number associated with the Delta PSU is the part number instead of the model number.

Explanation

This will be resolved in a future release.

BMC KVM Screen May Show “No Signal” Under Certain Conditions

Issue

When attempting to view the DGX A100 console from the BMC Web UI KVM, the screen may show “No Signal” if you cold reset the BMC and then reboot the server.

For example, the issue might occur after performing the following.

  1. Issue the command to cold reset the BMC.

    $ sudo ipmitool mc reset cold
    
  2. Wait about 30 seconds, then issue the command to reboot the system.

    $ sudo reboot
    

Explanation

This is due to a rare race condition between BMC and the SBIOS, and will be resolved in a future update.

“Power On Server” Option in KVM is Grayed Out

Issue

If the system is powered off, you may not be able to “power on” the system using the BMC KVM (“Power On Server” option is grayed out).

Explanation

To work around, log in to the BMC Web UI, then navigate to the Power Control dialog and select “Power On “.

BMC SEL Log May Show a Negative Value for Sensor “TEMP_MB_AD_CARD” During AC/DC/Warm reboot

Issue

After any kind of reboot (AC/DC/warm reboot), the BMC SEL log may show a negative value for ” Temperature TEMP_MB_AD_CARD0 “.

Explanation

This issue will be resolved in a future release.

Setting Up Active Directory Settings May Fail with “Invalid Domain Name” Error

Issue

After logging into the BMC dashboard UI and setting up and enabling Active Directory Authentication, an “Invalid Domain Name” error may occur.

Explanation

If you encounter this error, set up the DNS manually as follows:

  1. Login to the BMC UI dashboard.

  2. Navigate to Settings > Network Settings > DNS Configuration > “Domain Name Server Setting”

  3. Find “Domain Name Server Setting” and change “Automatic ” to “Manual “.

  4. Replace “DNS Server 1” IP to ” 8.8.8.8 ” (the IP is dns.google)

  5. Click Save and accept the alert to restart the BMC network.

Systems Won’t PXE Boot After BMC and CEC FW Update

Issue

After updating the BMC and CEC firmware, the system may fail to PXE boot.

Explanation

If you encounter this issue, perform a factory reset of the BMC and reconfigure usernames and passwords.**Using the BMC web UI**

  1. Navigate to Maintenance > Preserve Configuration, then clear all check boxes and click Save.

  2. Navigate to Maintenance > Restore Factory Defaults, then click Save.**Using the IPMITool OEM Commands**

  3. Specify “do not preserve configuration”.

    sudo ipmitool raw 0x32 0xba 0x00 0x00
    
  4. Restore defaults.

    sudo ipmitool raw 0x32 0x66
    

BMC UI May not be Accessible from Mac OS

Issue

When attempting to connect to the DGX A100 BMC from a system with Mac OS, a “Your connection is not private” message appears that prevents access to the BMC.

Explanation

Starting with version 0.13.6, the BMC provides a self-signed certificate which Mac OS flags in the browser. Most browsers will let you either accept the risk and continue, or add the certificate to the keychain and continue. The Chrome and Opera browsers, however, do not provide these options and so Mac OS users will not be able to access the BMC from the Chrome or Opera browser.

To access the DGX A100 BMC, Mac OS users can use Safari or Firefox, which provide an access path.

Unable to Launch BMC Dashboard under Pre-84.01 Firefox

Issue

After updating the BMC, attempts to access the BMC dashboard using Firefox versions earlier than 84.01 fail with a “Secure Connection Fail” message.

Explanation

To work around, update Firefox to version 84.01 or later.

The system starts the POST process several times during boot after updating the SBIOS.

Issue

After updating the SBIOS and rebooting the system, the NVIDIA splash screen appears and disappears several times before boot is completed.

Explanation

After updating the SBIOS, several component states are cleared and the system may reboot automatically 3-4 times to reset all the components. This is expected behavior.

Restoring BMC Default Affects Power LED

Issue

After restoring the factory default settings using the BMC,

  • The Power/Status LED flashes continuously after the rebooting the server.

  • The Power/Status LED stays on after powering off the server.

Explanation

NVIDIA is investigating this issue. There is no functional impact.

The “Relative Mouse Mode” Option is grayed out in the KVM Menu

Issue

In the BMC Remote KVM, the Mouse > Relative Mouse Mode option is grayed out and unavailable.

Explanation

To work around, enable Relative Mouse Mode from the BMC web UI as follows:

Navigate to Settings > KVM Mouse Setting, then select “Relative Positioning (Linux) ” from the Mouse Mode Configuration dialog and click Save.