Version 24.3.1
The DGX Firmware Update container version 24.3.1 is available.
Package name:
nvfw-dgx2_24.3.1_240304.tar.gz
Run file name:
nvfw-dgx2_24.3.1_240304.run
Image name:
nvfw-dgx2:24.3.1
Highlights and Changes in this Release
Operating System Support
DGX OS 5.5
DGX OS 6.2
DGX EL8-24.01
DGX EL9-23.12
Fixed BMC Issues
Increased the configurable maximum limit of KVM idle timeout to 10,800 seconds. This avoids timeouts while mounting large ISO images over a slow BMC network.
The following table lists a potential security vulnerability that has been reported by AMI. It is addressed in DGX-2 BMC version 01.09.00.
Affected BMC versions: All BMC versions prior to 01.09.00
Updated BMC version: 01.09.00
Firmware container version: 24.3.1
CVE IDs Addressed
Vendor (per NVD)
CVE-2023-37293
AMI
Contents of the DGX-2 System Firmware Container
This container includes the firmware binaries and update utilities for the firmware listed in the following table.
Component |
Update 8 Version |
Key Changes |
---|---|---|
BMC |
01.09.00 |
Refer to DGX-2 BMC Changes. |
SBIOS |
0.33 |
No change |
M.2 NVMe (Samsung PM963) |
CXV8601Q |
No change |
U.2 SSD (Micron) 9200 |
SSD 9200: 101008S0 |
No change |
U.2 SSD (Micron) 9300 |
SSD 9300: 11300DU0 |
No change |
VBIOS (DGX-2) |
88.00.6B.00.01 |
No change |
VBIOS (DGX-2H) |
88.00.6B.00.08 |
No change |
PSU |
3.1 |
No change |
FPGA |
3.1 |
No change Note The FPGA has two images:
The Firmware Update Container will update the primary FPGA image only. |
Note
Refer to the instructions in Special Instructions for PSU, SBIOS, and BMC Firmware Updates to determine applicable actions to take.
Change to the Update Process
Originally, only certain firmware components, such as the SBIOS, required power cycling the system after performing the update. In order to ensure that all DGX-2 services continue running, you must power cycle the DGX-2 after any firmware update for any component or group of components.
The addition of Intel ME update capability results in the need to run update_fw all twice when updating all firmware components. Refer to Instructions for Updating Firmware for detailed instructions.
Updating Components with Secondary Images
Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
SBIOS: Only the primary image is updated. To update both images, follow the instructions at Special Instructions for PSU, SBIOS, and BMC Firmware Updates.
BMC: Only the primary image is updated. To update the secondary (backup) image, include the
--update-backup-bmc
option in the update command.FPGA: Only the primary image is updated.
Enabling SNMP RO/RW Strings
The SNMP RO/RW strings are disabled by default. The following table provides the ipmitool arguments for enabling the strings. After enabling, disabling, or setting the RO/RW strings, either issue the restart SNMP Server command or reset the BMC for the changes to go into effect.
LUN |
CMD |
Requested Data |
|
---|---|---|---|
Offset |
Description |
||
3Ch/00h |
26h |
1 |
00h: Enable RO string 01h: Enable RW string 02h: Disable RO string 03h: Disable RW string 04h: Set RO string 05h: Set RW string 06h: Start SNMP Server 07h: Stop SNMP Server 08h: Restart SNMP Server |
2:21 |
Community string in ASCII code. Maximum string length is 20 characters. If request byte is set to 0x4 or 0x5, but empty from byte 2 to byte 21, then the corresponding community string will be cleared. |
For example, to enable the RO String, set the Community to “test”, and then restart the SNMP service on the BMC as follows:
Enable RO.
sudo ipmitool raw 0x3c 0x26 0x00
Set the RO string to “test”.
sudo ipmitool raw 0x3c 0x26 0x04 0x74 0x65 0x73 0x74
Restart the SNMP service on the BMC.
sudo ipmitool raw 0x3c 0x26 0x08
Special Instructions for PSU, SBIOS, U.2 SSD, and BMC Firmware Updates
Before updating the PSU, SBIOS, U.2 SSD, or the BMC, refer to the following special instructions for guidance to ensure the updates are successful.
PSU Updates
If the BMC version is older than 01.00.01, the BMC must be updated first before updating the PSU. See Updating the BMC from Versions older than 01.00.01..
SBIOS Updates
If the current BMC is version 1.05.7, update the BMC first before updating the SBIOS.
To update both primary and secondary SBIOS (after updating the BMC) using the container, do the following (assumes the primary SBIOS is the current, active SBIOS).
Refer to Special Instructions for all Updates to see if services need to be stopped and how to do it.
Update the active SBIOS using the
update_fw SBIOS
argument from the firmware update container.Designate booting from the secondary (inactive) SBIOS on the next boot.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgx2:20.08.8 sbios_slot --switch-nextboot-slot
Reboot the DGX-2 to switch to the secondary SBIOS.
sudo telinit 1 sudo umount /raid sync sudo ipmitool chassis power cycle
Update the secondary (now active) SBIOS.
Designate booting from the primary SBIOS on the next boot (to restore the primary SBIOS as the active SBIOS).
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgx2:20.08.8 sbios_slot --switch-nextboot-slot
Reboot the DGX-2 to switch back to the primary SBIOS.
sudo telinit 1 sudo umount /raid sync sudo ipmitool chassis power cycle
U.2 SSD Updates
Before updating U.2 SSD firmware, dismount and stop the RAID array following these steps.
Stop the NFS cache on the array:
sudo systemctl stop cachefilesd
Check for any processes using the
/raid
volume and stop them.To identify processes using the volume, run:
sudo lsof /raid
Dismount the
/raid
volume and stop the RAID array:sudo umount /raid sudo mdadm --stop /dev/md1
BMC Updates
If the current BMC is older than 01.00.01, follow the instructions at Updating the BMC from Versions older than 01.00.01..
If the current BMC is 01.00.01, follow the instructions at Updating the BMC from Version 01.00.01.
Instructions for Updating Firmware
This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it. The commands use the .run file, but you can also use any method described in Using the DGX-2 FW Update Utility.
Caution
Stop all unnecessary system activities before attempting to update firmware.
Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.
Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.
Check if updates are needed by checking the installed versions.
sudo ./nvfw-dgx2_24.3.1_240304.run show_version
If there is “no” in any up-to-date column for updatable firmware, then continue with the next step.
If all up-to-date column entries are “yes”, then no updates are needed and no further action is necessary.
If step 1 shows that U.2 SSD firmware updates are required, run the special instructions described in U.2 SSD Updates and then proceed.
Begin the process of updating all the firmware supported by the container.
sudo ./nvfw-dgx2_24.3.1_240304.run update_fw all
You will be prompted to power cycle the server.
Power cycle the server.
sudo ipmitool chassis power cycle
After power cycling the system, perform another
update_fw all
to update the Intel ME.
Note
Since the Intel ME is part of the SBIOS, the container messaging may indicate that the SBIOS is getting updated. This is expected.
Perform another power cycle.
sudo ipmitool chassis power cycle
See DGX-2 Firmware Update Process for more information about the update process.
You can verify the update by issuing the following.
./nvfw-dgx2_24.3.1_240304.run show_version
Known Issues
Running Another Firmware Update Container Using Docker or Podman Causes the First Container to Abort
Issue
When you run the Firmware Update Container through Podman, any attempts to start a second Firmware Update Container through Docker or Podman will cause the first instance of the Firmware Update Container to halt. Running multiple instances of the Firmware Update Container concurrently on the same system is not supported and can lead to system issues.