NVIDIA DGX SuperPOD: Release Notes 10.25.03#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.25.03 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Component Versions#

DGX SuperPOD component versions for this release are in Latest Validated SuperPOD Component Matrix.

Table 1 Latest Validated SuperPOD Component Matrix#

Component

DGX B200 with BaseOS 7.x

DGX H100/H200 with BaseOS 6.x

BCM ISO

10.25.03

10.25.03

DGX OS

7.0.2

6.3.1

Ubuntu

24.04.2

22.04.4 LTS

Enroot

3.5.0

3.5.0

CUDA toolkit

12.8.1

12.4.1

DCGM

4.1.1

4.1.1

Cumulus OS

5.11.0

5.11.0

Mellanox InfiniBand Switch

MLNX OS version: 3.12.2002

HCA Firmware: CX7 - 28.43.2026

MLNX OS version: 3.12.2002

HCA Firmware: CX7 - 28.43.2026

Slurm

23.11.10

23.02.8

Mellanox OFED Driver

DOCA OFED 2.9.1-3

MLNX_OFED_LINUX-23.10-4.0.9.1

DGX kernel

6.8.0-55-generic

5.15.0-1063-nvidia

GPU Driver

570.124.06

550.90.07

Lustre Client

ddn145

ddn145

UFM

UFM Enterprise Appliance 1.10.1

UFM Enterprise Appliance 1.10.1

HPL

hpc-benchmarks:25.04

hpc-benchmarks:25.04

NCCL

2.25.1

2.25.1

DGX FW

25.02.5

24.09.17

Kubernetes

1.31

1.31

GPU Operator

24.9.1

24.9.1

Network Operator

254.7.0

254.7.0

metallb

0.14.9

0.14.9

MPI Operator

0.6.0

0.6.0

kube-prometheus-stack

65.2.4

65.2.4

Calico

3.28.2

3.28.2

Run:AI Control Plane

2.19.58

2.19.58

Note

The table above shows the latest validated DGX SuperPOD component matrix. Customers may update components if specific versions are publicly available. Contact your NVIDIA Technical Account Manager or support team for more details.

General#

New Features#

  • Updated multiple Kubernetes operators when performing new setups: ingress-nginx 4.12.1 (fix CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24513, CVE-2025-24514), kube-prometheus-stack 70.3.0 (fix CVE-2025-22868, CVE-2025-22870), kube-state-metrics 5.31.0. Existing Kubernetes setups need to be updated manually.

  • Added BFB pre/post-install sections in cm-dpu-manage

  • Added CUDA 12.6 packages

  • Added mlnx-ofed24.10 packages

  • Updated Slurm 24.05 to 24.05.7

  • Updated Slurm 24.11 to 24.11.3

  • Updated cm-nvhpc to 25.1

  • Updated cuda-driver to 570.124.06

  • Updated cuda-driver-535 to 535.216.03

  • Updated cuda-driver-550 to 550.127.08

  • Updated cuda12.6 to 12.6.3

  • Updated mlnx-ofed23.10 to 23.10-4.0.9.1

  • Updated mlnx-ofed58 to 5.8-6.0.4.2

  • Updated the Ubuntu 24.04 base distribution to Ubuntu 24.04.1

  • Updated BaseOS image to version 7.0.2

Fixed Issues#

  • Updated the grub images provided by cm-tftpboot to support booting PE32+ Linux kernels which otherwise prevents aarch64 node booting with RHEL 9.5

CMDaemon#

New Features#

  • An issue where in the case of a write failure CMDaemon may generate a large number of websocket related log error messages

  • An issue parsing the bgp information in cm-lite-daemon

  • An issue where an invalid “plain/text” MIME type may be used by CMDaemon

  • Added new advanced configuration option HttpStrictTransportSecurityMaxAge

  • Update the Slurm topology.conf file when a cloud node goes UP or DOWN

  • Allow the option to configure network security group for a specific VNIC in OCI

  • An issue where DPU apply is not working for a brief period of time immediately after the node becomes UP

  • Added bound checks in the monitoring storage to prevent possible crashes due to data corruption

  • Extend cm-deploy-lite-daemon to download packages on the head node for switches that are not connected to the external network

  • An issue where on Ubuntu 24.04 CMDaemon configures the ntp service instead of ntpsec

  • Include the arch/os software image information in the CMDaemon XML dump file

  • Allow the option to specify onboot=no for network interfaces via extra configuration values

  • Added total cpu and memory utilization metrics

  • An issue with generating Slurm topology configuration from switches connected to other switches

  • Restrict the ability to use cmsh foreach or range commands for inefficient power or terminate operations on a large number of devices

  • Add a new REST API endpoint for WLM drain operations

  • Allow the option to override the list of enabled OCI agent plugins

  • Allow the option to customize /var/lib/kubelet/config.yaml via the Kubelet role

Fixed Issues#

  • An issue where redfish metrics containing a ~ symbol are being exported to Prometheus

  • Perform periodic checks for certificate signing requests for new Kubelets and Cert rotations which otherwise may prevent the CSR approvals

  • Rare CMDaemon crash when performing PDU-port power operations

  • An issue where the DPU apply RPC timeout is too short for the operation to complete

  • Improved pagination of the REST API /network/topology endpoint

  • An issue where the sysinfo GPU UUID does not match the nvidia-smi UUID

  • An issue where switching to a different consolidator can leave behind the old monitoring data

  • An issue where kill-no-job-user-ssh-sessions returns no-data instead of failing

  • An issue where recently added labeled entities may be removed and prevent returning correct PromQL results

  • An issue where sysinfo disk information is shown multiple times for devices managed by the lite-daemon

  • An issue with parsing jobs information when group name is not set

  • Rare deadlock within the head node CMDaemon process when both head nodes and many other nodes are being committed at the same time from different threads

  • An issue where the slurmctld service may be restarted when CMDaemon is restarted

  • An issue where chargeback queries can incorrectly report the value is out of range

Node Installer#

Fixed Issues#

  • An issue with copying dangling symbolic links from the /cm/conf/- configuration directories to the node

COD#

New Features#

  • Disable by default the public networks access for the storage account in Azure

Machine Learning#

New Features#

  • Added NCCL 2.25.1 and CuDNN 9.6 and 9.7 for Cuda 12.8

cm-kubernetes-setup#

New Features#

  • Modify the default configuration for NVIDIA container toolkit to match the Run:ai requirements

  • Tune the Kubernetes API Server to more sensible defaults for production systems

  • [Kubernetes] Various improvements to out-of-the-box Kube Prometheus Stack configuration (+ patch to fix existing BCM from pre-10.25.02 (cm-kubernetes-setup –patch-kube-prometheus-stack))

  • [Kubernetes] Simplify BCM landingpage ingress (no need for running Pod)

  • Enable Typha by default on clusters with less than 50 nodes when setting up Kubernetes with Calico CNI

  • Allow the option to setup NetQ 4.13 with cm-kubernetes-setup using Kubernetes v1.31 on Ubuntu 22.04

  • Allow the option to install NVIDIA GPU Operator without installing BCM NVIDIA GPU packages

Improvements#

  • Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough

Fixed Issues#

  • An issue where the cluster-admin service account is not created in the correct ‘default’ namespace

  • Suppress an incorrect warning message about existing /etc/kubernetes directory as a symlink on the secondary head node

  • Allow the option to install BCM NVIDIA GPU packages without installing NVIDIA GPU Operator

  • Wait for the Etcd information to become available to prevent cm-kubernetes-setup failures with error message ‘NoneType’ object has no attribute ‘advertiseClientUrls’

cm-wlm-setup#

New Features#

  • Allow the option to select NRT GPU configuration settings in cm-wlm-setup

Fixed Issues#

  • An issue with setting up pyxis if the secondary head node is down

cmsh#

Fixed Issues#

  • An issue with creating a ‘node’ type execution multiplexer in cmsh

  • An issue with the addinterface command which can result in a crash of cmsh

jupyter#

Fixed Issues#

  • An issue where kernel icons are not available when the certificates are generated with openssl 3.2.2

pythoncm#

New Features#

  • Added pythoncm parallel MIG function RPC