DGX A100 System Firmware Update Container Version 22.12.1

The DGX Firmware Update container version 22.12.1 is available.

  • Package name: nvfw-dgxa100_22.12.1_221208.tar.gz

  • Run file name: nvfw-dgxa100_22.12.1_221208.run

  • Image name: nvfw-dgxa100:22.12.1

  • ISO image: DGXA100_FWUI-22.12.1-2022-12-09-01-25-20.iso

  • PXE netboot: pxeboot-DGXA100_FWUI-22.12.1.tgz

Highlights and Changes in this Release

This release is supported with the following DGX OS software:

DGX OS 5.4 or later.

Important

This firmware update container does NOT support DGX OS 4.99.xx. To use the container on DGX A100 servers, update to DGX OS 5.4 or later.

  • EL7-22.08 or later (See “Special Instructions for Red Hat Enterprise Linux 7” in the DGX A100 System Firmware Container Release Notes) EL8-21.08 Update 1 or later.

  • EL8-22.08 or later

Fixed BMC issues

  • Improved SNMP trap handling and updated SNMP MIB with additional description for better trap information.

  • Handled a rare NTP server configuration settings issue from BMC WebUI.

  • The BMC update includes software security enhancements. See the NVIDIA Security Bulletin DGX - December 2022 for details.

  • Improvements In Redfish

    • Addressed a redfish URI timeout issue by appropriately handling the session authentication mechanism.

    • Addressed a rare redfish URI connection failure issue by appropriately handling the Redfish session authentication mechanism.

    • Fixed Redfish’ chassis power state inconsistencies.

    • Fixed redfish to fetch accurate values of thermal and power sensors states and readings.

    • Revised redfish chassis to identify LED status reporting issue.

    • Reduced frequency of UPNP (Universal Plug and Play) SSDP (Simple Service Discovery Protocol) advertisements from BMC

  • Fixed SBIOS Issues - Fixed issues relating to redfish reporting of PCIe device types and speeds. - Removed unimplemented setup menu options for User Defaults and Boot NumLock State. - Updated AGESA to version 1.0.0.E. - The SBIOS update includes software security enhancements. See the NVIDIA Security Bulletin DGX - May 2023 for details.

  • Added Support - Added M.2 Micron 7400 Gen4 drive.

  • Known Issues - For more information, see Known Issues.

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

  • If you are updating from 21.11.4 the total update time is approximately sixty-eight (68) minutes.

  • If you are updating from 22.5.5 the total update time is approximately sixty-one (61) minutes.

    Component
    Version
    Key Changes
    Update time
    from 21.11.4
    (minutes)
    Update time
    from 22.5.4
    (minutes)

    BMC (via CEC)

    00.19.07

    See DGX A100 BMC Changes

    32

    31

    SBIOS

    1.18

    See DGX A100 SBIOS Changes

    6

    6

    Broadcom 88096 PCIe switch board

    0.2.0

    No change

    0

    0

    BMC CEC SPI (MB_CEC)

    3.28

    No change

    0

    0

    PEX88064 Retimer

    3.1.0

    No change

    1

    0

    PEX88080 Retimer (U225)

    3.1.0

    No change

    1

    0

    PEX88080 Retimer (U666)

    4.1.0

    New update

    1

    1

    NVSwitch BIOS

    92.10.18.00.01

    No change

    0

    0

    VBIOS (A100 PG506 SKU200 (40GB))/VBIOS (A100 40GB)

    92.00.45.00.03

    No change

    0

    0

    VBIOS (A100 PG506 SKU210 (80GB))/VBIOS (A100 80GB)

    92.00.9E.00.01

    New update

    2

    2

    VBIOS (A800 PG506 SKU215 (80GB))

    92.00.A4.00.01

    New support

    2

    2

    VBIOS (A100 PG510 SKU200 (40GB))

    92.00.81.00.01

    New support

    2

    2

    VBIOS (A100 PG510 SKU210 (80GB))

    92.00.9E.00.03

    New support

    2

    2

    VBIOS (A800 PG510 SKU215 (80GB))

    92.00.A4.00.05

    New support

    2

    2

    VBIOS (A100 SystemB 80GB)

    92.00.81.00.06

    No change

    0

    0

    U.2 NVMe (Samsung)

    EPK9CB5Q

    No change

    0

    0

    U.2 NVMe (Kioxia)

    105

    No change

    0

    0

    M.2 NVMe (Samsung version 1)

    EDA7602Q

    No change

    0

    0

    M.2 NVMe (Samsung version 2)

    GDC7302Q

    No change

    1

    0

    M.2 Micron 7400 Gen4

    E1MU23BC

    New support

    1

    1

    FPGA (GPU sled)

    4.02

    New update

    20

    20

    CEC1712 SPI (GPU sled)

    4

    No change

    3

    0

    PSU (Delta rev04)

    Primary 1.7/
    Secondary 1.7/
    Community 1.7

    No change

    0

    0

    PSU (Delta rev03)

    Primary 1.6/
    Secondary 1.6/
    Community 1.7

    No change

    0

    0

    PSU (Delta rev02)

    Primary 1.6/
    Secondary 1.6/
    Community 1.7

    No change

    0

    0

    PSU (LiteOn)

    908

    No change

    0

    0

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:

  • SBIOS: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image.

    When using update_fw all, the update container updates both active and inactive images.

  • BMC: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image.

    The update container can only update the inactive image, and will update it only if the active image needs to be updated. After the update is completed, the updated inactive image becomes the active image. Because the active image is now updated, subsequent update_fw all commands will not update the inactive image. To update the inactive image in this case, use update_fw BMC --inactive. Since the container does not support updating the active image directly, commands such as update_fw BMC -a -f will not work.

DO NOT UPDATE DGX A100 CPLD FIRMWARE UNLESS INSTRUCTED

When updating DGX A100 firmware using the Firmware Update Container, do not update the CPLD firmware unless the DGX A100 system is being upgraded from 320GB to 640GB.

The current DGX A100 Firmware Update Container will not automatically update the CPLD firmware (for example, when running update_fw all). It is possible to update the CPLD firmware using “ update_fw CPLD ”; however, it is strongly recommended that the CPLD firmware not be updated manually unless specifically instructed by NVIDIA Enterprise Support (or email enterprisesupport@nvidia.com). If the DGX A100 is upgraded from 320GB to 640GB, the CPLD firmware update should be performed as instructed.

Special Instructions for Red Hat Enterprise Linux 7

This section describes the actions that must be taken before updating firmware on DGX A100 systems installed with Red Hat Enterprise Linux. There are two options for meeting these requirements.

Option 1: Update to EL7-22.05

Refer to the DGX Software for Red Hat Enterprise Linux 7 Release Notes for more information.

Important

Updating the DGX software for Red Hat Enterprise Linux will update the Red Hat Enterprise Linux installation to 7.9 or later. If you do not want to update your Red Hat Enterprise Linux 7 installation, then choose Option 2.

Option 2: Install mpt3sas 31.101.01.00-0

These instructions apply if:

  • You do not want to update your Red Hat Enterprise Linux installation, and

  • Your system is currently installed with Red Hat Enterprise Linux 7.7 or later.

Note

If your system is installed with Red Hat Enterprise Linux 7.6 or earlier, contact NVIDIA Enterprise Support for assistance.

  1. Perform this step if your system is no longer pointing to the NVIDIA DGX software repository.

    1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.

      sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
      sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
      
    2. Run the following command to install the DGX software installation package and enable the NVIDIA DGX software repository.

      Attention

      By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.

      yum install -y \
      https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-20.03-1.el7.x86_64.rpm
      
  2. Install mpt3sas 31.101.01.00-0

    sudo yum install mpt3sas-dkms
    
  3. Load the mpt3sas driver into the Red Hat Enterprise Linux kernel:

    sudo modprobe mpt3sas
    

    You can verify the correct mpt3sas version is installed by issuing the following:

    yum list installed
    

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container.

The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.

Caution

  • Do not log into the BMC dashboard UI while a firmware update is in progress.

  • Stop all unnecessary system activities before attempting to update firmware.

  • Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.

  • When issuing update_fw all, stop the following services if they are launched from Docker through the docker run command:

    • dcgm-exporter

    • nvidia-dcgm

    • nvidia-fabricmanager

    • nvidia-persistenced

    • xorg-setup

    • lightdm

    • nvsm-core

    • kubelet The container will attempt to stop these services automatically, but will be unable to stop any that are launched from Docker.

  • Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.

  • When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.

  1. Check if updates are needed by checking the installed versions.

    $ sudo ./nvfw-dgxa100_22.12.1_221208.run show_version
    
    • If there is “no” in any up-to-date column for updatable firmware, then continue with the next step.

    • If all up-to-date column entries are “yes”, then no updates are needed and no further action is necessary.

  2. Perform the update for all firmware supported by the container.

    $ sudo ./nvfw-dgxa100_22.12.1_221208.run update_fw all
    

    Depending on the firmware that is updated, you may be prompted to either reboot the system or power cycle the system.

    • If you are prompted to reboot, issue

      $ sudo reboot
      
    • If you are prompted to power cycle, you can issue the following two commands (there is no output with the first command).

      $ sudo ipmitool raw 0x3c 0x04
      $ sudo ipmitool chassis power cycle
      
  3. After rebooting or power cycling the system, you may need to perform another update_fw all to update other firmware.

    • Either repeat Step 1 to check if updates are needed and then perform Step 2 if needed, or

    • Repeat Step 2 just in case updates are needed.

    If you perform another update_fw all, you may be prompted again to either reboot the system or power cycle the system.

    See DGX A100 Firmware Update Process for more information about the update process.

You can verify the update by issuing the following.

$ sudo ./nvfw-dgxa100_22.12.1_221208.run show_version

Sample output for a DGX A100 640GB system:

  CEC
============
                                             Onboard Version   Manifest       up-to-date
MB_CEC(enabled)                              3.28              3.28               yes
Delta_CEC(enabled)                           4.00              4.00               yes

 BMC DGX
=========
Image Id              Status         Location      Onboard Version   Manifest  up-to-date
0:Active   Boot       Online         Local         00.17.07          00.19.07     yes
1:Inactive Updatable                 Local         00.17.07          00.19.07     yes

 SBIOS
=======
Image Id                           Onboard Version   Manifest        up-to-date
0:Active   Boot Updatable          1.18              1.18               yes
1:Inactive Updatable               1.18              1.18               yes

  Switches
============
PCI Bus#                      Model          Onboard Version   Manifest        FUB Updated?  up-to-date
DGX - 0000:91:00.0(U261)      88064_Retimer  3.1.0             3.1.0                 N/A        yes
DGX - 0000:88:00.0(U260)      88064_Retimer  3.1.0             3.1.0                 N/A        yes
DGX - 0000:4f:00.0(U262)      88064_Retimer  3.1.0             3.1.0                 N/A        yes
DGX - 0000:48:00.0(U225)      88080_Retimer  3.1.0             3.1.0                 N/A        yes


DGX - 0000:01:00.0(U1)        PEX88096       2.0               2.0                   N/A        yes
DGX - 0000:81:00.0(U3)        PEX88096       2.0               2.0                   N/A        yes
DGX - 0000:b1:00.0(U4)        PEX88096       2.0               2.0                   N/A        yes
DGX - 0000:41:00.0(U2)        PEX88096       2.0               2.0                   N/A        yes


DGX - 0000:c4:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes
DGX - 0000:c5:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes
DGX - 0000:c6:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes
DGX - 0000:c7:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes
DGX - 0000:c8:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes
DGX - 0000:c9:00.0            LR10           92.10.18.00.01    92.10.18.00.01        N/A        yes



 Mass Storage
==============
Drive Name/Slot         Model Number         Onboard Version   Manifest        up-to-date
nvme0n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme1n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme2n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q          EDA7602Q           yes
nvme3n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q          EDA7602Q           yes
nvme4n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme5n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme6n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme7n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme8n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes
nvme9n1          Samsung MZWLJ3T8HBLS-00007  EPK9CB5Q          EPK9CB5Q           yes

 Video BIOS
============
Bus            Model            Onboard Version   Manifest         FUB Updated?  up-to-date
0000:07:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:0f:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:47:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:4e:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:87:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:90:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:b7:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes
0000:bd:00.0   A100-SXM4-80GB   92.00.45.00.05    92.00.45.00.05         yes        yes


Power Supply
==============
ID                       Vendor Model        MFR ID              Revision  Status    Onboard Version     Manifest       up-to-date
PSU 0: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 0: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 0: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 1: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 1: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 1: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 2: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 2: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 2: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 3: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 3: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 3: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 4: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 4: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 4: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 5: Communication     Delta ECD16010092   Delta               03        ok        01.07               01.07             yes
PSU 5: Secondary         Delta ECD16010092   Delta               03        ok        01.06               01.06             yes
PSU 5: Primary           Delta ECD16010092   Delta               03        ok        01.06               01.06             yes


  CPLD
============
                                             Onboard Version   Manifest       up-to-date
MB_CPLD                                      1.05              1.05               yes
MID_CPLD                                     1.03              1.03               yes

* CPLD won't be updated by default (`update_fw all`), use `update_fw CPLD` if it's needed

FPGA
========
Onboard version     Manifest  up-to-date
03.0e               03.0e        yes

Known Issues

VBIOS cannot update due to running service processes

Issue

VBIOS update fails on Red Hat Enterprise Linux 9 due to system service/process caching the resource to be upgraded.

Explanation

The following services (system processes) must be stopped manually for the firmware update to start:

  • process nvidia-persistenced(pid 5372)

  • process nv-hostengine(pid 2723)

  • process cache_mgr_event(pid 5276)

  • process cache_mgr_main(pid 5278)

  • process dcgm_ipc(pid 5279)

If xorg is holding the resources, try to stop it by running

$ sudo systemctl stop <display manager> where the (display manager) can be acquired by
$ cat /etc/X11/default-display-manager

[BCM users only] Firmware Update Completes with Error on Base Command Manager

Issue

When attempting to update the new -0R4 CPU Trays, a failure occurs during the update process where FWUC fails to list services:

Failure messages may include:

  • Failed to install DGX 88064_Retimer dev 91 3.1.o

  • Unable to unload NVIDIA drivers. The following process(es)/service(s) need to be stopped in order for switch firmware update to occur:

  • <blank>

Workaround

  1. Run:

    $ scontrol update NodeName=hostname State=drain Reason="FW update"
    
  2. Wait for jobs on the host to complete and the status of node shows drained.

NOTE: if the output for the following command returns draining implies the node has jobs running and not ready; only proceed to step 2 only if the node status returns drained.

$ si$ sinfo --state=drained | grep hostname
  1. Stop slurmd service on compute node

    $ ansible -i /opt/provisioning/inventory/ --become -m shell -a  'systemctl stop slurmd.service ' 'hostname'
    
  2. Post firmware update: if host has been rebooted after firmware update, change host state to resume:

    $ scontrol update NodeName=hostname state=resume