NVIDIA DGX SuperPOD: Release Notes 10.23.11#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.23.11 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.23.11 is also available as a PDF.

Component Versions#

DGX SuperPOD component versions for this release are in Table 9.

Table 9 Common component versions#
Component	Version
BCM ISO	10.23.11
DGX OS	6.1.0
Ubuntu	Ubuntu 22.04.1 LTS
Enroot	3.4.1-1
CUDA toolkit	12.2
DCGM	3.1.8
Cumulus OS	5.5.1
Mellanox InfiniBand Switch (DGX H100)	MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2024
Mellanox InfiniBand Switch (DGX A100)	MLNX OS version: 3.11.1014 HCA Firmware: CX7 - 28.36.2024
Slurm	23.02.6
Mellanox OFED Driver (A100 and H100)	23.10-0.5.5.0 (Slogin and DGX nodes)
DGX kernel	5.15.0-1040-nvidia
GPU Driver	535.129.03
Lustre Client	lustre -client-modules-5.19.0-45-generic
UFM	UFM 3.0 SDN version: 1.3.1
HPL	hpc-benchmarks:23.10
NCCL	tensorrt:23.10-py3

Change Requests#

General#

New Features#

Added support for SLES15 SP5

Improvements#

Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)
Updated cuda-driver package to 535.129.03

CMDaemon#

New Features#

Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
Added cmsh command to show dhcpd leases
Added Border Gateway Protocol (BGP) overview for Cumulus switches
Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

Improvements#

Allow nodes to be automatically powered off or reset upon installer failure
Allow devices to be identified by serial in DHCP
Relaxed SSL checks when registering a new Cumulus switch via ZTP
Improved CMDaemon startup speed in HA mode
Prevent multiple identical failover group status
Added a flag to allow changing a user home directory to an existing directory
Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
Write chrony.conf instead of ntp.conf in node-installer on RHEL9
Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

Fixed Issues#

Fixed counting of nodes and accelerators towards the license limit
Fixed service status in cmsh of a lite-node
Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
Store services added to lite-node to DB
Fixed cmsh imageupdate ^^pattern <path>

Workload Management#

New Features#

Automatically configure non-MIG GPUs in Slurm when detected
Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
Added new package pyxis-sources to allow building pyxis in air-gapped environments

Improvements#

Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

Fixed Issues#

Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
Cleaned up database node entries of Slurm jobs that were requeued
Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
Install enroot dependencies on Ubuntu 20.04

Container Engines#

Improvements#

Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

Monitoring#

New Features#

Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
Added ManagedServicesOk health check to lite devices

Improvements#

Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
Do not use linear interpolation for health check data, but rather the last known value

Fixed Issues#

Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
Fixed job-metrics in the base-view monitoring tree