Using the DGX A100 FW Update Utility

The NVIDIA DGX A100 System Firmware Update utility is provided in a tarball and also as a .run file. Copy the files to the DGX A100 system, then update the firmware using one of the following three methods:

  • NVSM provides convenient commands to update the firmware using the firmware update container

  • Using Docker to run the firmware update container

  • Using the .run file which is a self-extracting package embedding the firmware update container tarball

Note

Fan speeds may increase while updating the BMC firmware. This is a normal part of the BMC firmware update process.

Requirements

Refer to the Highlights and Changes in the specific release for the DGX OS and EL7/EL8 versions supported by the firmware update container.

The firmware update container requires that the following modules are installed on the system.

  • nvidia_vgpu_vfio

  • nvidia-uvm

  • nvidia-drm

  • nvidia-modeset

  • nv_peer_mem

  • nvidia_peermem

  • nvidia

  • i2c_nvidia_gpu

  • ipmi_devintf

  • ipmi_ssif

  • acpi_ipmi

  • ipmi_si

  • ipmi_msghandler

These modules are installed as part of the standard DGX OS, EL7, or EL8 installation. The container may fail if any of these modules are not installed. Be sure to follow the provided instructions when installing or upgrading DGX OS, EL7, or EL8.

Caution

Observe the following before running the firmware update container:

  • Do not log into the BMC dashboard UI while a firmware update is in progress.

  • Stop all unnecessary system activities before attempting to update firmware.

  • Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.

  • When issuing update_fw all, stop the following services if they are launched from Docker through the docker run command:

    • dcgm-exporter

    • nvidia-dcgm

    • nvidia-fabricmanager

    • nvidia-persistenced

    • xorg-setup

    • lightdm

    • nvsm-core

    • kubelet

    The container will attempt to stop these services automatically, but will be unable to stop any that are launched from Docker.

  • Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.

  • When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.

Using NVSM

The NVIDIA DGX A100 system software includes Docker software required to run the container.

  1. Copy the tarball to a location on the DGX system.

  2. From the directory where you copied the tarball, enter the following command to load the container image.

    $ sudo docker load -i nvfw-dgxa100_24.6.1_240604.tar.gz
    
  3. To verify that the container image is loaded, enter the following.

    $ sudo docker images
    
    REPOSITORY    TAG
    nvfw-dgxa100  24.6.1
    
  4. Using NVSM interactive mode, enter the firmware update module.

    $ sudo nvsm
    nvsm-> cd systems/localhost/firmware/install
    
  5. Set the flags corresponding to the action you want to take.

    $ nvsm(/system/localhost/firmware/install)-> set Flags=<option>
    

    See the Command and Argument Summary section below for the list of common flags.

  6. Set the container image to run.

    $ nvsm(/system/localhost/firmware/install)-> set DockerImageRef=nvfw-dgxa100:24.6.1
    
  7. Run the command.

    $ nvsm(/system/localhost/firmware/install)-> start
    

Using docker run

The NVIDIA DGX A100 system software includes Docker software required to run the container.

  1. Copy the tarball to a location on the DGX system.

  2. From the directory where you copied the tarball, enter the following command to load the container image.

    $ sudo docker load -i nvfw-dgxa100_24.6.1_240604.tar.gz
    
  3. To verify that the container image is loaded, enter the following.

    $ sudo docker images
    
    REPOSITORY    TAG
    nvfw-dgxa100  24.6.1
    
  4. Use the following syntax to run the container image.

    $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 <command> <[arg1] [arg2] ... [argn]
    

See the Command and Argument List section below for the list of common commands and arguments.

Note

If you do not have the tarball file, but you do have the .run file, you can extract the tarball from the .run file by issuing the following:

sudo nvfw-dgxa100_24.6.1_240604.run -x

Using the .run File

The update container is also available as a .run file. The .run file uses Docker or Podman software if either is installed on the system, but can also be run without either installed.

The container uses Podman on Red Hat Enterprise Linux and attempts to use Docker if Podman is not available. The container uses Docker on other platforms. You can override the behavior by specifying -docker or -podman.

  1. After obtaining the .run file, make the file executable.

    $ chmod +x nvfw-dgxa100_24.6.1_240604.run
    
  2. Use the following syntax to run the container image.

    $ sudo ./nvfw-dgxa100_24.6.1_240604.run <command> <[arg1] [arg2] ... [argn]
    

Command and Argument List

The following are common commands and arguments.

  • Show the manifest

    show_fw_manifest
    
    • NVSM Example: $ nvsm(/system/localhost/firmware/install)-> set Flags=show_fw_manifest

    • Docker Run Example: $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 show_fw_manifest

    • .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run show_fw_manifest

  • Show version information

    show_version
    
    • NVSM Example: $ nvsm(/system/localhost/firmware/install)-> set Flags=show_version

    • Docker Run Example: $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 show_version

    • .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run show_version

  • Check the onboard firmware against the manifest and update all down-level firmware.

    update_fw all
    
    • NVSM Example: $ nvsm(/system/localhost/firmware/install)-> set Flags=update_fw\ all

      For NVSM, an escape is needed before blank spaces when setting the flags.

    • Docker Run Example: $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 update_fw all

    • .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run update_fw all

  • Check the specified onboard firmware against the manifest and update if down-level.

    update_fw [fw]
    

    Where [fw] corresponds to the specific firmware as listed in the manifest. Multiple components can be listed within the same command. The following are examples of updating the BMC and SBIOS.

    • NVSM Example: $ nvsm(/system/localhost/firmware/install)-> set Flags=update_fw\ BMC\ SBIOS

      For NVSM, an escape is needed before blank spaces when setting the flags.

    • Docker Run Example: $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 update_fw BMC SBIOS

    • .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run update_fw BMC SBIOS

  • Run the DGX A100 firmware update in non-interactive mode.

    set_flags auto=1
    
  • Force the .run file to use Docker or Podman

    • Force Docker .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run -docker update_fw_all

    • Force Podman .run File Example: $ sudo ./nvfw-dgxa100_24.6.1_240604.run -podman update_fw_all

List of Arguments

Update flags:
   Updates all, a specified combination, or an individual firmware component
   if the image currently on the device is prior to the available version.
   syntax:
      update_fw  < firmware_components >
      update_fw < component [ -f | --force ] [ component options ] >

Update flag Definitions :
   --force  For single component updates. Bypass the checks and upgrade regardless of the version.
   all      Update firmware on all components. Cannot be used with the '--force' flag.
            syntax: update_fw all

   SBIOS    Update the System BIOS firmware.
            syntax: update_fw SBIOS [ -a | --active]
                                    [ -i | --inactive]

   BMC      Update the firmware on all, or a specified Baseboard Management
            Controller.

            syntax: update_fw BMC [ -i | --inactive]
                                  [ -b | --bmc-access-path <BMC IP:login_id:password> ]
                                  [ -m | --intermediate-fw ]
                                  [ -t | --target-bmc <target BMC> ]
            where:
               --bmc-access-path <val>   Non-default access parameters to the BMC

   SSD      Update firmware on all, or a specified Solid State Drive.
            syntax: update_fw SSD [ -s | --select-ssd <SSD target> ]
            where:
               --select-ssd <target>  Name of the specific drive to update

   PSU      Update the firmware on all, or a specified Power Supply
            syntax: update_fw PSU [ -s | --select-psu <PSU number> ] [ -S | --select-slot <PSU slot> }
            where:
               --select-psu <target>  Name of the specific PSU to update.
               --select-slot <slot>   Name of the specific PSU slot to update

   VBIOS    Update the Video BIOS firmware on all detected GPUs.
            It is not currently possible to update individual GPU devices.
            syntax: update_fw VBIOS

   FPGA     Update firmware on the FPGA devices on lower and upper GPU trays.
            syntax: update_fw FPGA

   SWITCH   Update firmware on one, specific set, or all switch devices.
            syntax: update_fw SWITCH [ -s | --select-switch <switch-model[:BDF]> ]

   CEC      Update firmware on one or multiple CEC
            syntax: update_fw CEC [ -s | --select-cec [ MB_CEC | Delta_CEC ]

   CPLD     Update MB CPLD / MID CPLD firmware
            syntax: update_fw CPLD [ -s | --select-cpld [ MB_CPLD | MID_CPLD ]

Troubleshooting Update Issues

Missing Software Modules

The container may fail if any of these modules are not installed on the system.

  • nvidia_vgpu_vfio

  • nvidia-uvm

  • nvidia-drm

  • nvidia-modeset

  • nv_peer_mem

  • nvidia_peermem

  • nvidia

  • i2c_nvidia_gpu

  • ipmi_devintf

  • ipmi_ssif

  • acpi_ipmi

  • ipmi_si

  • ipmi_msghandler

The following are examples of error messages:

  • Firmware update not started
    Following service(s)/process(es) are holding onto the resource about
    to be upgraded. These need to be manually stopped for firmware update to occur.
    If xorg is holding the resources, try to stop it by 'sudo systemctl stop <display manager>,'
    where the <display manager> can be acquired by 'cat /etc/X11/default-display-manager':
    process nvidia-persiste(pid 7554)
    ● session-1.scope - Session 1 of user swqa
    Loaded: loaded (/run/systemd/system/session-1.scope; static; vendor preset: disabled)
    Transient: yes
      Drop-In: /run/systemd/system/session-1.scope.d
         └─50-After-systemd-logind\x2eservice.conf, 50-After-systemd-user-sessions\x2eservice.conf, 50-Description.conf, 50-SendSIGHUP.conf, 50-Slice.conf, 50-TasksMax.conf
      Active: active (running) since Wed 2021-11-17 00:36:22 EST; 1min 49s ago
      CGroup: /user.slice/user-1000.slice/session-1.scope
    
  • modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.4.0-80-generic
    

To recover, perform an update of the DGX OS (refer to the DGX OS User Guide for instructions), then retry the firmware update.