NVIDIA DGX SuperPOD: Release Notes 10.23.12#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.23.12 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.23.12 is also available as a PDF.

Component Versions#

DGX SuperPOD component versions for this release are in Table 13.

Table 13 Common component versions#
Component	Version
BCM ISO	10.23.12
DGX OS	6.1.0
Ubuntu	Ubuntu 22.04.1 LTS
Enroot	3.4.1-1
CUDA toolkit	12.2
DCGM	3.1.8
Cumulus OS	5.5.1
Mellanox InfiniBand Switch (DGX H100)	MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2050
Mellanox InfiniBand Switch (DGX A100)	MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2050
Slurm	23.02.6
Mellanox OFED Driver (A100 and H100)	23.10-0.5.5.0 (Slogin and DGX nodes)
DGX kernel	5.15.0-1040-nvidia
GPU Driver	535.129.03
Lustre Client	lustre -client-modules-5.19.0-45-generic
UFM	UFM 3.0 SDN version: 1.3.1
HPL	hpc-benchmarks:23.10
NCCL	tensorrt:23.11-py3

Change Requests#

General#

New Features#

Added support for Kubernetes v1.28

Improvements#

Added mlnx-ofed23.10 package
Added CUDA 12.3 packages
Updated cuda-driver to 545.23.08
Updated cm-openssl to 3.1.4
Updated cuda-driver-legacy-470 to 470.223.02

CMDaemon#

New Features#

Allow the Kubernetes kubelet service to start when swap is enabled, which resolves an issue where Kubernetes setup may fail if the head node has swap and is selected for Kubernetes master node

Improvements#

Added support for two VLANs on top of a bond interface with a VLAN as the provisioning interface
Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage
Display both the nodes and the accelerator counts in the license information in cmsh
Added an hourly cron job to clean up files left behind when the sample-ipmi script is being killed
Prevent already outdated monitoring data with timestamps in the past to be saved in the CMDaemon database
Added an “export as CSV” option in the REST monitoring data API
Added new monitoring metrics ManagedServicesOK for the partition and the categories to get totals for all of the nodes in the cluster or the nodes in a given category
Allow the option to disable the nvlink and nvswitch metrics by specifying extra_values configuration settings for the nodes
Include the outdated packages for the passive head node, the node-installer, and the software images in the notification for available BCM updates

Fixed Issues#

An issue with sorting on timestamps in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests
An issue with reporting the correct InfiniBand count in the hardware overview of the compute nodes
In some cases, a timing issue that may prevent the pbsmom service from starting in an on-perm+edge workload manager setup
An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
An issue with performing parallel device power operations with cmsh
Allow the option to use GET in the Prometheus sampler for older exporters that do not allow POST operations
An issue with moving the software image revisions directories when updating the path of the parent software image

Node Installer#

Improvements#

On Ubuntu based distributions, make the updates of the ntp.conf drift file settings consistent between the Node Installer and CMDaemon, which until now could generate different configuration files

cm-diagnose#

Improvements#

Include syslog in cm-diagnose
Sanitize all mysqldumps in cm-diagnose

cm-harbor#

Fixed Issues#

In some cases, fixed a race condition where Harbor from the cm-harbor package and Shorewall are concurrently updating the iptables rules, which can prevent enabling the required iptables rules

cm-kubernetes-setup#

Fixed Issues#

An issue selecting the correct Kubernetes namespace in the retry mechanism of cm-kubernetes-setup when uninstalling failed operators

cm-scale#

New Features#

The Auto Scaler now takes into account the NVIDIA GPU requests made by Kubernetes pods and jobs when selecting the compute nodes to power on

Fixed Issues#

Auto Scaler now takes Slurm mincpus parameter into account

cm-setup#

Fixed Issues#

An issue with ssh connections initiated by cm-setup scripts when a ssh-agent is running and has a SSH ECDSA certificate added to the agent
A regression in cm-container-registry-setup for Harbor on HA head node setup, which can result in “no such file or directory” error messages for files not present on the passive head node

cm-wlm-setup#

Improvements#

Create the enroot cache shared directory automatically if it does not already exist

Fixed Issues#

An issue where enroot is not configured by default on a head node when the pyxis Slurm plugin is enabled
Allow the option to configure full BCM GPU autodetection for Slurm with cm-wlm-setup

jupyter#

Improvements#

In some cases, an issue where duplicated pods or services may be created due to a race condition in the Kubernetes API
Update the JupyterLab and JupyterHub dependencies to the most recent versions

pythoncm#

Improvements#

Added an option to use gzip compression in RPC calls