Using the DGX A100 FW Update Utility
The NVIDIA DGX A100 System Firmware Update utility is provided in a tarball and also as a .run file. Copy the files to the DGX A100 system, then update the firmware using one of the following three methods:
NVSM provides convenient commands to update the firmware using the firmware update container
Using Docker to run the firmware update container
Using the .run file which is a self-extracting package embedding the firmware update container tarball
Note
Fan speeds may increase while updating the BMC firmware. This is a normal part of the BMC firmware update process.
Requirements
Refer to the Highlights and Changes in the specific release for the DGX OS and EL7/EL8 versions supported by the firmware update container.
The firmware update container requires that the following modules are installed on the system.
nvidia_vgpu_vfio
nvidia-uvm
nvidia-drm
nvidia-modeset
nv_peer_mem
nvidia_peermem
nvidia
i2c_nvidia_gpu
ipmi_devintf
ipmi_ssif
acpi_ipmi
ipmi_si
ipmi_msghandler
These modules are installed as part of the standard DGX OS, EL7, or EL8 installation. The container may fail if any of these modules are not installed. Be sure to follow the provided instructions when installing or upgrading DGX OS, EL7, or EL8.
Caution
Observe the following before running the firmware update container:
Do not log into the BMC dashboard UI while a firmware update is in progress.
Stop all unnecessary system activities before attempting to update firmware.
Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.
When issuing
update_fw all
, stop the following services if they are launched from Docker through thedocker run
command:dcgm-exporter
nvidia-dcgm
nvidia-fabricmanager
nvidia-persistenced
xorg-setup
lightdm
nvsm-core
kubelet
The container will attempt to stop these services automatically, but will be unable to stop any that are launched from Docker.
Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.
Using NVSM
The NVIDIA DGX A100 system software includes Docker software required to run the container.
Copy the tarball to a location on the DGX system.
From the directory where you copied the tarball, enter the following command to load the container image.
$ sudo docker load -i nvfw-dgxa100_24.6.1_240604.tar.gz
To verify that the container image is loaded, enter the following.
$ sudo docker images REPOSITORY TAG nvfw-dgxa100 24.6.1
Using NVSM interactive mode, enter the firmware update module.
$ sudo nvsm nvsm-> cd systems/localhost/firmware/install
Set the flags corresponding to the action you want to take.
$ nvsm(/system/localhost/firmware/install)-> set Flags=<option>
See the Command and Argument Summary section below for the list of common flags.
Set the container image to run.
$ nvsm(/system/localhost/firmware/install)-> set DockerImageRef=nvfw-dgxa100:24.6.1
Run the command.
$ nvsm(/system/localhost/firmware/install)-> start
Using docker run
The NVIDIA DGX A100 system software includes Docker software required to run the container.
Copy the tarball to a location on the DGX system.
From the directory where you copied the tarball, enter the following command to load the container image.
$ sudo docker load -i nvfw-dgxa100_24.6.1_240604.tar.gz
To verify that the container image is loaded, enter the following.
$ sudo docker images REPOSITORY TAG nvfw-dgxa100 24.6.1
Use the following syntax to run the container image.
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 <command> <[arg1] [arg2] ... [argn]
See the Command and Argument List section below for the list of common commands and arguments.
Note
If you do not have the tarball file, but you do have the .run file, you can extract the tarball from the .run file by issuing the following:
sudo nvfw-dgxa100_24.6.1_240604.run -x
Using the .run File
The update container is also available as a .run
file.
The .run
file uses Docker or Podman software if either is installed on the system, but can also be run without either installed.
The container uses Podman on Red Hat Enterprise Linux and attempts to use Docker if Podman is not available.
The container uses Docker on other platforms.
You can override the behavior by specifying -docker
or -podman
.
After obtaining the .run file, make the file executable.
$ chmod +x nvfw-dgxa100_24.6.1_240604.run
Use the following syntax to run the container image.
$ sudo ./nvfw-dgxa100_24.6.1_240604.run <command> <[arg1] [arg2] ... [argn]
Command and Argument List
The following are common commands and arguments.
Show the manifest
show_fw_manifest
NVSM Example:
$ nvsm(/system/localhost/firmware/install)-> set Flags=show_fw_manifest
Docker Run Example:
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 show_fw_manifest
.run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run show_fw_manifest
Show version information
show_version
NVSM Example:
$ nvsm(/system/localhost/firmware/install)-> set Flags=show_version
Docker Run Example:
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 show_version
.run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run show_version
Check the onboard firmware against the manifest and update all down-level firmware.
update_fw all
NVSM Example:
$ nvsm(/system/localhost/firmware/install)-> set Flags=update_fw\ all
For NVSM, an escape is needed before blank spaces when setting the flags.
Docker Run Example:
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 update_fw all
.run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run update_fw all
Check the specified onboard firmware against the manifest and update if down-level.
update_fw [fw]
Where
[fw]
corresponds to the specific firmware as listed in the manifest. Multiple components can be listed within the same command. The following are examples of updating the BMC and SBIOS.NVSM Example:
$ nvsm(/system/localhost/firmware/install)-> set Flags=update_fw\ BMC\ SBIOS
For NVSM, an escape is needed before blank spaces when setting the flags.
Docker Run Example:
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgxa100:24.6.1 update_fw BMC SBIOS
.run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run update_fw BMC SBIOS
Run the DGX A100 firmware update in non-interactive mode.
set_flags auto=1
Force the
.run
file to use Docker or PodmanForce Docker .run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run -docker update_fw_all
Force Podman .run File Example:
$ sudo ./nvfw-dgxa100_24.6.1_240604.run -podman update_fw_all
List of Arguments
Update flags:
Updates all, a specified combination, or an individual firmware component
if the image currently on the device is prior to the available version.
syntax:
update_fw < firmware_components >
update_fw < component [ -f | --force ] [ component options ] >
Update flag Definitions :
--force For single component updates. Bypass the checks and upgrade regardless of the version.
all Update firmware on all components. Cannot be used with the '--force' flag.
syntax: update_fw all
SBIOS Update the System BIOS firmware.
syntax: update_fw SBIOS [ -a | --active]
[ -i | --inactive]
BMC Update the firmware on all, or a specified Baseboard Management
Controller.
syntax: update_fw BMC [ -i | --inactive]
[ -b | --bmc-access-path <BMC IP:login_id:password> ]
[ -m | --intermediate-fw ]
[ -t | --target-bmc <target BMC> ]
where:
--bmc-access-path <val> Non-default access parameters to the BMC
SSD Update firmware on all, or a specified Solid State Drive.
syntax: update_fw SSD [ -s | --select-ssd <SSD target> ]
where:
--select-ssd <target> Name of the specific drive to update
PSU Update the firmware on all, or a specified Power Supply
syntax: update_fw PSU [ -s | --select-psu <PSU number> ] [ -S | --select-slot <PSU slot> }
where:
--select-psu <target> Name of the specific PSU to update.
--select-slot <slot> Name of the specific PSU slot to update
VBIOS Update the Video BIOS firmware on all detected GPUs.
It is not currently possible to update individual GPU devices.
syntax: update_fw VBIOS
FPGA Update firmware on the FPGA devices on lower and upper GPU trays.
syntax: update_fw FPGA
SWITCH Update firmware on one, specific set, or all switch devices.
syntax: update_fw SWITCH [ -s | --select-switch <switch-model[:BDF]> ]
CEC Update firmware on one or multiple CEC
syntax: update_fw CEC [ -s | --select-cec [ MB_CEC | Delta_CEC ]
CPLD Update MB CPLD / MID CPLD firmware
syntax: update_fw CPLD [ -s | --select-cpld [ MB_CPLD | MID_CPLD ]
Troubleshooting Update Issues
Missing Software Modules
The container may fail if any of these modules are not installed on the system.
nvidia_vgpu_vfio
nvidia-uvm
nvidia-drm
nvidia-modeset
nv_peer_mem
nvidia_peermem
nvidia
i2c_nvidia_gpu
ipmi_devintf
ipmi_ssif
acpi_ipmi
ipmi_si
ipmi_msghandler
The following are examples of error messages:
Firmware update not started Following service(s)/process(es) are holding onto the resource about to be upgraded. These need to be manually stopped for firmware update to occur. If xorg is holding the resources, try to stop it by 'sudo systemctl stop <display manager>,' where the <display manager> can be acquired by 'cat /etc/X11/default-display-manager': process nvidia-persiste(pid 7554) ● session-1.scope - Session 1 of user swqa Loaded: loaded (/run/systemd/system/session-1.scope; static; vendor preset: disabled) Transient: yes Drop-In: /run/systemd/system/session-1.scope.d └─50-After-systemd-logind\x2eservice.conf, 50-After-systemd-user-sessions\x2eservice.conf, 50-Description.conf, 50-SendSIGHUP.conf, 50-Slice.conf, 50-TasksMax.conf Active: active (running) since Wed 2021-11-17 00:36:22 EST; 1min 49s ago CGroup: /user.slice/user-1000.slice/session-1.scope
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.4.0-80-generic
To recover, perform an update of the DGX OS (refer to the DGX OS User Guide for instructions), then retry the firmware update.