DGX H100/H200 System Firmware Update Guide Version 24.08.1

Note

Starting with this release, the versioning scheme of the DGX H100/H200 documentation has changed to a 5-digit version. For the new version, the first two digits are the current year followed by two digits of month and one digit of the build number; for example, version 24.08.1 was the first build released in August, 2024.

Highlights

Added Support

  • Introducing support for the NVIDIA DGX H200 System.

  • Enabled 3 + 3 power limiting feature to provide continual power source in the event of power distribution unit failure, but at a reduced performance level.

  • Added Redfish API support for creating, modifying, and deleting power policies.

  • Support for deploying firmware update using the Web UI.

  • Redfish Disable Host Interface: keeps redfish functional from BIOS to BMC but prevents the direct path from OS to BMC.

  • Added ability to specify intermediate certificate authorities in a provisioned certificate chain.

  • Incorporates updated firmware for GPU tray, network, and NVMe drives.

BMC Fixes

  • Included additional Redfish metrics reports.

  • Fixed SNMP, syslog, and rsyslog issues.

  • Added per BMC AES key for encrypting user/password files during the configuration save and restore process.

  • Fixed invalid domain issues in the LDAP/AD settings.

  • Enhanced Redfish diagnostics.

  • General performance improvements in Redfish APIs and IPMI.

  • Added support for ConnectX-7 temperature sensors.

  • Improved resolution for energy counters.

  • Enhanced Remote Media with support for port numbers and domain names.

  • General improvements to the Web UI.

SBIOS Fixes

  • DIMM that experienced uncorrectable errors at runtime will be mapped out on the next boot.

  • Exposed the C1AutoDemotion, C1AutoUnDemotion, and C6Enable setup options.

  • Moved the CPU setup options page to under the Advanced page in the setup UI.

  • Added a setup option to restrict host access via IPMI.

  • Provided the NvramVarsProtectionInOs setup option to prevent the OS from changing the NVRAM at runtime.

  • Implemented uncorrectable error rate limiting, disabled CSMI (correctable system management interrupts) on error flooding and on the core that reported MLC (middle-level cache) yellow state, and SEL logging when ANF (advisory non-fatal error) threshold was crossed.

  • Changed the SncEn default setting to disable.

The nvfwupd Command Updates

  • Improved log sanitization to mask the IP address and login credentials by default.

  • Added support for the --target and --package override from the command-line interface (CLI) using a configuration file.

  • Enhanced the --target option with the servertype sub-option to resolve unidentified platform errors.

Firmware Package Details

This firmware release supports the following systems:

  • NVIDIA DGX H100

  • NVIDIA DGX H200

This firmware release supports the following operating systems:

  • NVIDIA DGX OS 6.2.1, 6.1, 6.0.11, and higher

  • NVIDIA DGX Software EL9-24.06, EL9-23.12, and EL9-23.08

  • NVIDIA DGX Software EL8-24.07, EL8-24.01, and EL8-23.08

For more information about the operating systems, refer to the NVIDIA Base OS documentation.

You can download firmware packages from the NVIDIA Enterprise Support Portal.

The following table shows the firmware package files:

Components

Sample File Name

Combined archive

DGXH100_H200_24.08.1.tar

The combined archive includes the firmware for the system components and the firmware for the GPU tray.

  • Motherboard tray package

  • GPU tray package

  • nvfw_DGX_240820.1.0.fwpkg

  • nvfw_HGX_DGXH100-H200x8_240603.1.0.fwpkg

If you are updating from version 1.1.3, the total update time is approximately

  • 92 minutes for the CPU tray using sequential updating.

  • 34 minutes for the CPU tray using parallel updating.

  • 12 minutes for the GPU tray using parallel updating.

The following table shows the information about component firmware versions and update time breakdown.

Component
Version
Update Time
from 1.1.3
(Minutes)

Host BMC

24.08.20

Refer to BMC Changes for DGX H100/H200 Systems for the list of changes.

25

Host BMC ERoT

04.0052

2

SBIOS ERoT

04.0052

2

SBIOS

1.05.03

Refer to SBIOS Changes for DGX H100/H200 Systems for the list of changes.

7

Motherboard CPLD

0.2.1.8

19

Midplane CPLD

0.2.1.1

13

PSU (Delta ECD16020137)

Primary 0204
Secondary 0201
Community 0203
PSU_0: 2.75
PSU_1: 2.75
PSU_2: 2.75
PSU_3: 2.75
PSU_4: 2.75
PSU_5: 2.75
Broadcom Gen5
PCIe Switch
(PEX89072-B01)
Switch 0: 0.0.7
Switch 1: 1.0.7
Switch 0: 1
Switch 1: 1
Astera Labs Gen5 PCIe Retimer
(PT5161L)

2.07.19

Retimer 0: 3
Retimer 1: 2.5

Network (Cluster) Card - ConnectX-7

28.39.3560

Network (Storage) Card - ConnectX-7

28.39.3560

Network Card - BlueField-3

32.40.1000

  • VBIOS (H100 80GB)

  • VBIOS (H200 141GB)

  • 96.00.A5.00.01

  • 96.00.A5.00.03

GPU Tray (total): 12

NVSwitch (GPU Tray)

96.10.57.00.01

ERoT (GPU Tray)

02.0182

HMC (GPU Tray)

HGX-22.10-1-rc67

FPGA (GPU Tray)

2.53

PCIe Switch (GPU Tray)

1.9.5F

Astera Labs Gen5 PCIe Retimer (GPU Tray)
(PT5161L)

2.7.20

Intel 10G Ethernet

v3.60

Intel Ethernet Network Adapter
(E810-C-Q2)

v4.50

M.2 NVMe
(Samsung PM9A3)

GDC7502Q

M.2 NVMe
(Micron 7450)

E2MU200

U.2 Kioxia Gen5 CM7

1UET7104

U.2 Samsung
(EVT2 PM1733)

MPK95B5Q

U.2 Samsung
(Gen5 PM1743)

OPPA4B5Q

FRU

0.6

TPM

v15.21

Firmware Update Procedure

Refer to Firmware Update Steps.