NVIDIA DGX SuperPOD: Release Notes 10.23.12
Introduction
This document covers the NVIDIA Base Command™ Manager (BCM) 10.23.12 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.
Information about BCM and DGX SuperPOD is available at:
Important
The NVIDIA DGX SuperPOD: Release Notes 10.23.12 is also available as a PDF
.
Component Versions
DGX SuperPOD component versions for this release are in Table 1.
Component |
Version |
---|---|
BCM ISO |
10.23.12 |
DGX OS |
6.1.0 |
Ubuntu |
Ubuntu 22.04.1 LTS |
Enroot |
3.4.1-1 |
CUDA toolkit |
12.2 |
DCGM |
3.1.8 |
Cumulus OS |
5.5.1 |
Mellanox InfiniBand Switch (DGX H100) |
MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2050 |
Mellanox InfiniBand Switch (DGX A100) |
MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2050 |
Slurm |
23.02.6 |
Mellanox OFED Driver (A100 and H100) |
23.10-0.5.5.0 (Slogin and DGX nodes) |
DGX kernel |
5.15.0-1040-nvidia |
GPU Driver |
535.129.03 |
Lustre Client |
lustre -client-modules-5.19.0-45-generic |
UFM |
UFM 3.0 SDN version: 1.3.1 |
HPL |
hpc-benchmarks:23.10 |
NCCL |
tensorrt:23.11-py3 |
Change Requests
General
New Features
Added support for Kubernetes v1.28
Improvements
Added mlnx-ofed23.10 package
Added CUDA 12.3 packages
Updated cuda-driver to 545.23.08
Updated cm-openssl to 3.1.4
Updated cuda-driver-legacy-470 to 470.223.02
CMDaemon
New Features
Allow the Kubernetes kubelet service to start when swap is enabled, which resolves an issue where Kubernetes setup may fail if the head node has swap and is selected for Kubernetes master node
Improvements
Added support for two VLANs on top of a bond interface with a VLAN as the provisioning interface
Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage
Display both the nodes and the accelerator counts in the license information in cmsh
Added an hourly cron job to clean up files left behind when the sample-ipmi script is being killed
Prevent already outdated monitoring data with timestamps in the past to be saved in the CMDaemon database
Added an “export as CSV” option in the REST monitoring data API
Added new monitoring metrics ManagedServicesOK for the partition and the categories to get totals for all of the nodes in the cluster or the nodes in a given category
Allow the option to disable the nvlink and nvswitch metrics by specifying extra_values configuration settings for the nodes
Include the outdated packages for the passive head node, the node-installer, and the software images in the notification for available BCM updates
Fixed Issues
An issue with sorting on timestamps in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests
An issue with reporting the correct InfiniBand count in the hardware overview of the compute nodes
In some cases, a timing issue that may prevent the pbsmom service from starting in an on-perm+edge workload manager setup
An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
An issue with performing parallel device power operations with cmsh
Allow the option to use GET in the Prometheus sampler for older exporters that do not allow POST operations
An issue with moving the software image revisions directories when updating the path of the parent software image
Node Installer
Improvements
On Ubuntu based distributions, make the updates of the ntp.conf drift file settings consistent between the Node Installer and CMDaemon, which until now could generate different configuration files
cm-diagnose
Improvements
Include syslog in cm-diagnose
Sanitize all mysqldumps in cm-diagnose
cm-harbor
Fixed Issues
In some cases, fixed a race condition where Harbor from the cm-harbor package and Shorewall are concurrently updating the iptables rules, which can prevent enabling the required iptables rules
cm-kubernetes-setup
Fixed Issues
An issue selecting the correct Kubernetes namespace in the retry mechanism of cm-kubernetes-setup when uninstalling failed operators
cm-scale
New Features
The Auto Scaler now takes into account the NVIDIA GPU requests made by Kubernetes pods and jobs when selecting the compute nodes to power on
Fixed Issues
Auto Scaler now takes Slurm mincpus parameter into account
cm-setup
Fixed Issues
An issue with ssh connections initiated by cm-setup scripts when a ssh-agent is running and has a SSH ECDSA certificate added to the agent
A regression in cm-container-registry-setup for Harbor on HA head node setup, which can result in “no such file or directory” error messages for files not present on the passive head node
cm-wlm-setup
Improvements
Create the enroot cache shared directory automatically if it does not already exist
Fixed Issues
An issue where enroot is not configured by default on a head node when the pyxis Slurm plugin is enabled
Allow the option to configure full BCM GPU autodetection for Slurm with cm-wlm-setup
jupyter
Improvements
In some cases, an issue where duplicated pods or services may be created due to a race condition in the Kubernetes API
Update the JupyterLab and JupyterHub dependencies to the most recent versions
pythoncm
Improvements
Added an option to use gzip compression in RPC calls