NVIDIA DGX SuperPOD: Release Notes 10.23.11

Introduction

This document covers the NVIDIA Base Command™ Manager (BCM) 10.23.11 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.23.11 is also available as a PDF.

Component Versions

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions

Component

Version

BCM ISO

10.23.11

DGX OS

6.1.0

Ubuntu

Ubuntu 22.04.1 LTS

Enroot

3.4.1-1

CUDA toolkit

12.2

DCGM

3.1.8

Cumulus OS

5.5.1

Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.1014

HCA Firmware: CX7 - 28.36.2024

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.1014

HCA Firmware: CX7 - 28.36.2024

Slurm

23.02.6

Mellanox OFED Driver (A100 and H100)

23.10-0.5.5.0 (Slogin and DGX nodes)

DGX kernel

5.15.0-1040-nvidia

GPU Driver

535.129.03

Lustre Client

lustre -client-modules-5.19.0-45-generic

UFM

UFM 3.0 SDN version: 1.3.1

HPL

hpc-benchmarks:23.10

NCCL

tensorrt:23.10-py3

Change Requests

General

New Features

  • Added support for SLES15 SP5

Improvements

  • Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)

  • Updated cuda-driver package to 535.129.03

CMDaemon

New Features

  • Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run

  • Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd

  • Added cmsh command to show dhcpd leases

  • Added Border Gateway Protocol (BGP) overview for Cumulus switches

  • Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches

  • Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

Improvements

  • Allow nodes to be automatically powered off or reset upon installer failure

  • Allow devices to be identified by serial in DHCP

  • Relaxed SSL checks when registering a new Cumulus switch via ZTP

  • Improved CMDaemon startup speed in HA mode

  • Prevent multiple identical failover group status

  • Added a flag to allow changing a user home directory to an existing directory

  • Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions

  • Write chrony.conf instead of ntp.conf in node-installer on RHEL9

  • Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

Fixed Issues

  • Fixed counting of nodes and accelerators towards the license limit

  • Fixed service status in cmsh of a lite-node

  • Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node

  • Store services added to lite-node to DB

  • Fixed cmsh imageupdate ^^pattern <path>

Workload Management

New Features

  • Automatically configure non-MIG GPUs in Slurm when detected

  • Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)

  • Added new package pyxis-sources to allow building pyxis in air-gapped environments

Improvements

  • Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

Fixed Issues

  • Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role

  • Cleaned up database node entries of Slurm jobs that were requeued

  • Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name

  • Install enroot dependencies on Ubuntu 20.04

Container Engines

Improvements

  • Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)

  • Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

Monitoring

New Features

  • Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION

  • Added ManagedServicesOk health check to lite devices

Improvements

  • Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes

  • Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs

  • Do not use linear interpolation for health check data, but rather the last known value

Fixed Issues

  • Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created

  • Fixed job-metrics in the base-view monitoring tree