NVIDIA DGX SuperPOD: Release Notes 10.24.01
Introduction
This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.01 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.
Information about BCM and DGX SuperPOD is available at:
Important
The NVIDIA DGX SuperPOD: Release Notes 10.24.01 is also available as a PDF
.
Component Versions
DGX SuperPOD component versions for this release are in Table 1.
Component |
Version |
---|---|
BCM ISO |
10.24.01 |
DGX OS |
6.1.0 |
Ubuntu |
Ubuntu 22.04.1 LTS |
Enroot |
3.4.1-1 |
CUDA toolkit |
12.2 |
DCGM |
3.1.8 |
Cumulus OS |
5.5.1 |
Mellanox InfiniBand Switch (DGX H100) |
MLNX OS version: 3.11.2016 HCA Firmware: CX7 - 28.39.2048 |
Mellanox InfiniBand Switch (DGX A100) |
MLNX OS version: 3.11.2016 HCA Firmware: CX7 - 28.39.2048 |
Slurm |
23.02.7 |
Mellanox OFED Driver (A100 and H100) |
23.10-1.1.9.0 LTS (Slogin and DGX nodes) |
DGX kernel |
5.15.0-1042-nvidia |
GPU Driver |
535.129.03 |
Lustre Client |
lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125 |
UFM |
UFM Enterprise SW: 6.15.1-4 |
HPL |
hpc-benchmarks:23.10 |
NCCL |
tensorrt:23.12-py3 |
DGX FW |
1.1.3 |
Change Requests
General
New Features
The head node installer will now create a new /etc/cm-install-release file to keep a record of the installation time and media that has been used
Added support for upgrading BCM3/Bright9.2 clusters to BCM 10
Improvements
Added cuda-driver-535 package
The mlnx-ofed packages’ installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
Updated mlnx-ofed58 to 5.8-4.1.5.0
Updated mlnx-ofed23.10 to 23.10-1.1.9.0
Updated cuda12.3 to 12.3 update 2
Updated cm-nvhpc to 23.11
CMDaemon
Improvements
Update the Kubernetes users’ configuration files with Run:ai configuration settings
Redirect the output from cm-burn to tty1
Added new GPU totals metrics for temperature and nvlink bandwidth
Allow the option to select BCM GPU autodetection configuration mechanism also in the Slurm WLM cluster settings, and not only in the Slurm WLM client role
Ensure kubelets are able to join a Kubernetes cluster also after the initial certificates have expired (which typically happens after 4 hours)
Fixed Issues
An issue with sorting the data passed to the PromQL engine, which can result in an error “expanding series: closed SeriesSet” when running instant queries
An issue where the exclude list snippets are not being cloned when cloning a software image
Rare deadlock in CMDaemon which can occur while committing a head node
An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface
An issue where /etc/systemd/resolved.conf is not being added to the imageupdate exclude list for the compute nodes
An issue where install-license may not copy some certificates to all /cm/shared* on a multi-arch or multi-os cluster
An issue with the Prometheus exporter when entities have recently been removed
An issue with parsing multiple pending Kubernetes CSR per node, which can result in none of the CSR’s being approved
On SLES base distribution, an issue with updating the cluster landing page with links to the dashboards of other integrations such as Kubernetes or Ceph
An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change
An issue where CMDaemon may hang waiting for events while stopping
An issue where the cmsh call to create a certificate may return before the certificate is written
An issue where entering the cmsh biossettings mode may result in an “Error parsing JSON” error message
In some cases, an issue with configuring Slurm when GPU automatic configuration by BCM has been selected
In some cases, an issue with setting up Etcd due to insufficient permissions to access the Etcd certificate files
An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes
An issue with collecting GPU job metrics for containerized Pyxis jobs
cm-kubernetes-setup
New Features
Use Calico 3.27, Run:ai 2.15.2, and GPU operator v23.9.1 for new Kubernetes deployments using cm-kubernetes-setup
Improvements
Allow the option to choose Network Operator version 23.10.0
Allow the option to configure a custom Kubernetes Ingress certificate
cm-lite-daemon
Improvements
Added new metrics for the total traffic on network interfaces
cm-wlm-setup
Fixed Issues
In some cases, an issue with installing Pyxis on multi-arch or multi-distro software images
Pyxis enroot is now configured to use its internal value for the cache directory, which previously was being set to a directory under /run
cmsh
New Features
Added a new cmsh “multiplexers” command in monitoring / setup which can show which nodes will run for other entities a specified dataproducer
pythoncm
Improvements
Added a new pythoncm example script total-job-power-usage.py for calculating WLM jobs power usage