DGX H100/H200 System Firmware Update Guide Version 24.08.1
Note
Starting with this release, the versioning scheme of the DGX H100/H200 documentation has changed to a 5-digit version. For the new version, the first two digits are the current year followed by two digits of month and one digit of the build number; for example, version 24.08.1 was the first build released in August, 2024.
Highlights
Added Support
Introducing support for the NVIDIA DGX H200 System.
Enabled 3 + 3 power limiting feature to provide continual power source in the event of power distribution unit failure, but at a reduced performance level.
Added Redfish API support for creating, modifying, and deleting power policies.
Support for deploying firmware update using the Web UI.
Redfish Disable Host Interface: keeps redfish functional from BIOS to BMC but prevents the direct path from OS to BMC.
Added ability to specify intermediate certificate authorities in a provisioned certificate chain.
Incorporates updated firmware for GPU tray, network, and NVMe drives.
BMC Fixes
Included additional Redfish metrics reports.
Fixed SNMP, syslog, and rsyslog issues.
Added per BMC AES key for encrypting user/password files during the configuration save and restore process.
Fixed invalid domain issues in the LDAP/AD settings.
Enhanced Redfish diagnostics.
General performance improvements in Redfish APIs and IPMI.
Added support for ConnectX-7 temperature sensors.
Improved resolution for energy counters.
Enhanced Remote Media with support for port numbers and domain names.
General improvements to the Web UI.
SBIOS Fixes
DIMM that experienced uncorrectable errors at runtime will be mapped out on the next boot.
Exposed the
C1AutoDemotion
,C1AutoUnDemotion
, andC6Enable
setup options.Moved the CPU setup options page to under the Advanced page in the setup UI.
Added a setup option to restrict host access via IPMI.
Provided the
NvramVarsProtectionInOs
setup option to prevent the OS from changing the NVRAM at runtime.Implemented uncorrectable error rate limiting, disabled CSMI (correctable system management interrupts) on error flooding and on the core that reported MLC (middle-level cache) yellow state, and SEL logging when ANF (advisory non-fatal error) threshold was crossed.
Changed the
SncEn
default setting todisable
.
The nvfwupd
Command Updates
Improved log sanitization to mask the IP address and login credentials by default.
Added support for the
--target
and--package
override from the command-line interface (CLI) using a configuration file.Enhanced the
--target
option with theservertype
sub-option to resolve unidentified platform errors.
Firmware Package Details
This firmware release supports the following systems:
NVIDIA DGX H100
NVIDIA DGX H200
This firmware release supports the following operating systems:
NVIDIA DGX OS 6.2.1, 6.1, 6.0.11, and higher
NVIDIA DGX Software EL9-24.06, EL9-23.12, and EL9-23.08
NVIDIA DGX Software EL8-24.07, EL8-24.01, and EL8-23.08
For more information about the operating systems, refer to the NVIDIA Base OS documentation.
You can download firmware packages from the NVIDIA Enterprise Support Portal.
The following table shows the firmware package files:
Components |
Sample File Name |
---|---|
Combined archive |
The combined archive includes the firmware for the system components and the firmware for the GPU tray. |
|
|
If you are updating from version 1.1.3, the total update time is approximately
92 minutes for the CPU tray using sequential updating.
34 minutes for the CPU tray using parallel updating.
12 minutes for the GPU tray using parallel updating.
The following table shows the information about component firmware versions and update time breakdown.
Component
|
Version
|
Update Time
from 1.1.3
(Minutes)
|
---|---|---|
Host BMC |
24.08.20 Refer to BMC Changes for DGX H100/H200 Systems for the list of changes. |
25 |
Host BMC ERoT |
04.0052 |
2 |
SBIOS ERoT |
04.0052 |
2 |
SBIOS |
1.05.03 Refer to SBIOS Changes for DGX H100/H200 Systems for the list of changes. |
7 |
Motherboard CPLD |
0.2.1.8 |
19 |
Midplane CPLD |
0.2.1.1 |
13 |
PSU (Delta ECD16020137) |
Primary 0204
Secondary 0201
Community 0203
|
PSU_0: 2.75
PSU_1: 2.75
PSU_2: 2.75
PSU_3: 2.75
PSU_4: 2.75
PSU_5: 2.75
|
Broadcom Gen5
PCIe Switch
(PEX89072-B01)
|
Switch 0: 0.0.7
Switch 1: 1.0.7
|
Switch 0: 1
Switch 1: 1
|
Astera Labs Gen5 PCIe Retimer
(PT5161L)
|
2.07.19 |
Retimer 0: 3
Retimer 1: 2.5
|
Network (Cluster) Card - ConnectX-7 |
28.39.3560 |
|
Network (Storage) Card - ConnectX-7 |
28.39.3560 |
|
Network Card - BlueField-3 |
32.40.1000 |
|
|
|
GPU Tray (total): 12 |
NVSwitch (GPU Tray) |
96.10.57.00.01 |
|
ERoT (GPU Tray) |
02.0182 |
|
HMC (GPU Tray) |
HGX-22.10-1-rc67 |
|
FPGA (GPU Tray) |
2.53 |
|
PCIe Switch (GPU Tray) |
1.9.5F |
|
Astera Labs Gen5 PCIe Retimer (GPU Tray)
(PT5161L)
|
2.7.20 |
|
Intel 10G Ethernet |
v3.60 |
|
Intel Ethernet Network Adapter
(E810-C-Q2)
|
v4.50 |
|
M.2 NVMe
(Samsung PM9A3)
|
GDC7502Q |
|
M.2 NVMe
(Micron 7450)
|
E2MU200 |
|
U.2 Kioxia Gen5 CM7 |
1UET7104 |
|
U.2 Samsung
(EVT2 PM1733)
|
MPK95B5Q |
|
U.2 Samsung
(Gen5 PM1743)
|
OPPA4B5Q |
|
FRU |
0.6 |
|
TPM |
v15.21 |
Firmware Update Procedure
Refer to Firmware Update Steps.