NVIDIA DGX SuperPOD: Release Notes 10.24.01#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.01 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.24.01 is also available as a PDF.

Component Versions#

DGX SuperPOD component versions for this release are in Table 7.

Table 7 Common component versions#
Component	Version
BCM ISO	10.24.01
DGX OS	6.1.0
Ubuntu	Ubuntu 22.04.1 LTS
Enroot	3.4.1-1
CUDA toolkit	12.2
DCGM	3.1.8
Cumulus OS	5.5.1
Mellanox InfiniBand Switch (DGX H100)	MLNX OS version: 3.11.2016 HCA Firmware: CX7 - 28.39.2048
Mellanox InfiniBand Switch (DGX A100)	MLNX OS version: 3.11.2016 HCA Firmware: CX7 - 28.39.2048
Slurm	23.02.7
Mellanox OFED Driver (A100 and H100)	23.10-1.1.9.0 LTS (Slogin and DGX nodes)
DGX kernel	5.15.0-1042-nvidia
GPU Driver	535.129.03
Lustre Client	lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125
UFM	UFM Enterprise SW: 6.15.1-4
HPL	hpc-benchmarks:23.10
NCCL	tensorrt:23.12-py3
DGX FW	1.1.3

Change Requests#

General#

New Features#

The head node installer will now create a new /etc/cm-install-release file to keep a record of the installation time and media that has been used
Added support for upgrading BCM3/Bright9.2 clusters to BCM 10

Improvements#

Added cuda-driver-535 package
The mlnx-ofed packages’ installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
Updated mlnx-ofed58 to 5.8-4.1.5.0
Updated mlnx-ofed23.10 to 23.10-1.1.9.0
Updated cuda12.3 to 12.3 update 2
Updated cm-nvhpc to 23.11

CMDaemon#

Improvements#

Update the Kubernetes users’ configuration files with Run:ai configuration settings
Redirect the output from cm-burn to tty1
Added new GPU totals metrics for temperature and nvlink bandwidth
Allow the option to select BCM GPU autodetection configuration mechanism also in the Slurm WLM cluster settings, and not only in the Slurm WLM client role
Ensure kubelets are able to join a Kubernetes cluster also after the initial certificates have expired (which typically happens after 4 hours)

Fixed Issues#

An issue with sorting the data passed to the PromQL engine, which can result in an error “expanding series: closed SeriesSet” when running instant queries
An issue where the exclude list snippets are not being cloned when cloning a software image
Rare deadlock in CMDaemon which can occur while committing a head node
An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface
An issue where /etc/systemd/resolved.conf is not being added to the imageupdate exclude list for the compute nodes
An issue where install-license may not copy some certificates to all /cm/shared* on a multi-arch or multi-os cluster
An issue with the Prometheus exporter when entities have recently been removed
An issue with parsing multiple pending Kubernetes CSR per node, which can result in none of the CSR’s being approved
On SLES base distribution, an issue with updating the cluster landing page with links to the dashboards of other integrations such as Kubernetes or Ceph
An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change
An issue where CMDaemon may hang waiting for events while stopping
An issue where the cmsh call to create a certificate may return before the certificate is written
An issue where entering the cmsh biossettings mode may result in an “Error parsing JSON” error message
In some cases, an issue with configuring Slurm when GPU automatic configuration by BCM has been selected
In some cases, an issue with setting up Etcd due to insufficient permissions to access the Etcd certificate files
An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes
An issue with collecting GPU job metrics for containerized Pyxis jobs

cm-kubernetes-setup#

New Features#

Use Calico 3.27, Run:ai 2.15.2, and GPU operator v23.9.1 for new Kubernetes deployments using cm-kubernetes-setup

Improvements#

Allow the option to choose Network Operator version 23.10.0
Allow the option to configure a custom Kubernetes Ingress certificate

cm-lite-daemon#

Improvements#

Added new metrics for the total traffic on network interfaces

cm-wlm-setup#

Fixed Issues#

In some cases, an issue with installing Pyxis on multi-arch or multi-distro software images
Pyxis enroot is now configured to use its internal value for the cache directory, which previously was being set to a directory under /run

cmsh#

New Features#

Added a new cmsh “multiplexers” command in monitoring / setup which can show which nodes will run for other entities a specified dataproducer

pythoncm#

Improvements#

Added a new pythoncm example script total-job-power-usage.py for calculating WLM jobs power usage