NVIDIA DGX SuperPOD: Release Notes 10.24.03

Introduction

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.03 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.24.03 is also available as a PDF.

Component Versions

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions

Component

Version

BCM ISO

10.24.03

DGX OS

6.2.0

Ubuntu

Ubuntu 22.04.3 LTS

Enroot

3.4.1

CUDA toolkit

12.2

DCGM

3.3.5

Cumulus OS

5.5.1

Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 28.39.2048

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 20.39.1002

Slurm

23.02.7

Mellanox OFED Driver (A100 and H100)

23.10-1.1.9.0 LTS (Slogin and DGX nodes)

DGX kernel

5.15.0-1046-nvidia

GPU Driver

535.161.07

Lustre Client

lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125

UFM

UFM Enterprise SW: 1.7.0

HPL

hpc-benchmarks:23.10

NCCL

tensorrt:24.02-py3

DGX FW

1.1.3

Change Requests

General

New Features

  • Added support for RHEL8u9 and ROCKY8u9

  • Added support for RHEL9u3 and ROCKY9u3

  • The DGX OS software images included on the head node installer ISOs are now based on DGX OS 6.2.0 (Release 1)

  • Enable NVSM metrics for DGX systems

  • Added a new BCM package for NVIDIA Nsight systems (cm-nsight-systems)

Improvements

  • Include the gsp firmware with the cuda-driver- packages

  • Updated DCGM to 3.3.5

  • Updated cuda-driver to 550.54.15

  • Updated cuda-driver-535 to 535.161.07

  • Updated cuda-driver-legacy-470 to 470.239.06

  • Updated munge to v0.5.16

  • Updated cm-nvhpc to 24.1

  • Updated cm-openssl to 3.1.5

  • Removed the dependency of the cm-nvidia-container-toolkit package on the cuda-driver package, which otherwise can cause some package managers to remove the toolkit package when the CUDA driver is replaced with a different version

Fixed Issues

  • Increase the stack and nofile limits in cm-config-limits for the root user on Ubuntu 22.04 to prevent possible issues with head nodes hanging under heavy load

CMDaemon

New Features

  • An issue where CMDaemon may store the Slurm array’s main job information in the CMDaemon monitoring DB, which creates unnecessary entries in the DB since these jobs expand to individual tasks as they get are scheduled

  • Allow the option to open a remote-request-assistance session from within cmsh

  • CMDaemon user profiles can now include new tokens, such as SET_USER_PROFILE_TOKEN, allowing a cmsh or Base View client connected using a certificate with this profile to set or update the profile setting for users

  • CMDaemon will now generate a combined kubeconfig file in .kube/config in the home directory of a user containing all clusters the user has access to, allowing the user to connect to the Kubernetes cluster without first loading the environment module

  • Allow the option to use negative matching such as “!resource!=category-name” in the monitoring comparison expressions

Improvements

  • Improved Prometheus exporter cache management to ensure the memory usage does not grow over time

  • CMDaemon will now set the InfiniBand (IB) interface GUID as extra values for the IB interfaces, which then can be shown in cmsh / Base View

  • Exclude the link MAC entry from the Cumulus switch overview information

  • Allow the option to set the MAC address of non-node devices to the MAC address identified on the respective switch port for the device

  • Restrict users’ abilities with a profile that allows them to add other users to also set arbitrary group ids for the users they create. The behavior can be tuned with advanced configuration options

  • Increase the Slurm job queue QOS table size in the CMDaemon DB

  • Redirect all base-view and userportal HTTPS calls from the passive to the active head node

  • CMDaemon will now send an event when network interfaces are added or removed

Fixed Issues

  • An issue where setting the extra_values nvlink property to false is not picked up by CMDaemon, causing CMDaemon to continue sampling nvlink metrics

  • An issue that can result in duplicate entries in the /exporter Prometheus endpoint

  • An issue where a network switch ZTP.sh can point to an incorrect IP address for the head node

  • An issue where a disabled backup role is still being used for monitoring backup, preventing it from being removed from the target list

  • An issue where a small buffer size for the user’s groups can prevent CMDaemon from correctly listing all user’s groups

  • Ensure the head nodes can provision each other regardless of the ProvisioningRole selections

  • In some cases, an issue with folding the compute nodes hostnames when generating slurm.conf configuration file

  • An issue with Prometheus sampler with exporter that only supports http GET

  • An issue where committing a monitoring action without setting a script can cause CMDaemon on the head node to crash

  • An issue with Slurm job management operations in cmsh and Base View unable to handle Slurm job arrays ids

  • An issue where metrics from the AggregateNode producer do not have correct data / do not have “no data” values when there are no nodes in a rack

  • In some cases, an issue that can leave a passive head node CMDaemon process using 100% CPU

  • In some cases, an issue that may prevent CMDaemon from loading old jobs information from the CMDaemon DB

  • An issue with sending a test email when using cmsh

  • An issue with the passive head node forwarding labeled entity information to the active, preventing it from being used in PromQL queries

  • An issue where terminated cloud nodes that are consequently powered on may still be listed as ‘terminated’ in cmsh

  • An issue that prevents two different configuration overlays with the same priority and different generic roles to be committed in CMDaemon

Node Installer

Fixed Issues

  • An issue where setting the frozenFilesPerNode directive may not cause the node-installer to freeze /etc/sysconfig/network on RHEL

Head Node Installer

Improvements

  • Updated the default partition sizes for the standard RAID1 and RAID5 head node disk layouts to match the sizes of the non-raid standard layout

Fixed Issues

  • An issue with head node installations with Lmod where the DefaultModules.lua module file is not created by default, resulting in messages about empty LMOD_SYSTEM_DEFAULT_MODULES environment variable

cm-kubernetes-setup

Fixed Issues

  • Improved error reporting when kubeadm init step fails

cm-scale

New Features

  • Improved handling of a lack of vCPUs in AWS in the same way as a lack of capacity

cm-wlm-setup

New Features

  • Allow the option to set the enroot temporary directory using cm-wlm-setup

Fixed Issues

  • Ensure cm-wlm-setup can install AGE 2023.1.1 (8.8.1)

cmsh

Improvements

  • Added new cmsh WLM jobs mode command pidsgpus to list the pids and the gpus used by a WLM job

Fixed Issues

  • An issue in cmsh user mode with case-sensitive compare of profile names

pyxis-sources

New Features

  • Updated pyxis sources package to 0.17.0

slurm23.11

Improvements

  • Updated slurm23.11 to 23.11.4