NVIDIA DGX SuperPOD: Release Notes 10.23.11#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.23.11 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.23.11 is also available as a PDF.

Component Versions#

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions#

Component

Version

BCM ISO

10.23.11

DGX OS

6.1.0

Ubuntu

Ubuntu 22.04.1 LTS

Enroot

3.4.1-1

CUDA toolkit

12.2

DCGM

3.1.8

Cumulus OS

5.5.1

Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.1014

HCA Firmware: CX7 - 28.36.2024

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.1014

HCA Firmware: CX7 - 28.36.2024

Slurm

23.02.6

Mellanox OFED Driver (A100 and H100)

23.10-0.5.5.0 (Slogin and DGX nodes)

DGX kernel

5.15.0-1040-nvidia

GPU Driver

535.129.03

Lustre Client

lustre -client-modules-5.19.0-45-generic

UFM

UFM 3.0 SDN version: 1.3.1

HPL

hpc-benchmarks:23.10

NCCL

tensorrt:23.10-py3

Change Requests#

General#

New Features#

  • Added support for SLES15 SP5

Improvements#

  • Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)

  • Updated cuda-driver package to 535.129.03

CMDaemon#

New Features#

  • Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run

  • Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd

  • Added cmsh command to show dhcpd leases

  • Added Border Gateway Protocol (BGP) overview for Cumulus switches

  • Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches

  • Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

Improvements#

  • Allow nodes to be automatically powered off or reset upon installer failure

  • Allow devices to be identified by serial in DHCP

  • Relaxed SSL checks when registering a new Cumulus switch via ZTP

  • Improved CMDaemon startup speed in HA mode

  • Prevent multiple identical failover group status

  • Added a flag to allow changing a user home directory to an existing directory

  • Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions

  • Write chrony.conf instead of ntp.conf in node-installer on RHEL9

  • Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

Fixed Issues#

  • Fixed counting of nodes and accelerators towards the license limit

  • Fixed service status in cmsh of a lite-node

  • Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node

  • Store services added to lite-node to DB

  • Fixed cmsh imageupdate ^^pattern <path>

Workload Management#

New Features#

  • Automatically configure non-MIG GPUs in Slurm when detected

  • Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)

  • Added new package pyxis-sources to allow building pyxis in air-gapped environments

Improvements#

  • Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

Fixed Issues#

  • Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role

  • Cleaned up database node entries of Slurm jobs that were requeued

  • Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name

  • Install enroot dependencies on Ubuntu 20.04

Container Engines#

Improvements#

  • Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)

  • Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

Monitoring#

New Features#

  • Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION

  • Added ManagedServicesOk health check to lite devices

Improvements#

  • Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes

  • Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs

  • Do not use linear interpolation for health check data, but rather the last known value

Fixed Issues#

  • Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created

  • Fixed job-metrics in the base-view monitoring tree