NVIDIA DGX SuperPOD: Release Notes 10.24.05

Introduction

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.05 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.24.05 is also available as a PDF.

Component Versions

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions

Component

Version

BCM ISO

10.24.05

DGX OS

6.2.1

Ubuntu

Ubuntu 22.04.2 LTS

Enroot

3.4.1

CUDA toolkit

12.2

DCGM

3.3.5

Cumulus OS

5.5.1

Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 28.39.2048

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.2016

HCA Firmware: CX6 - 20.39.2048

Slurm

23.02.7

Mellanox OFED Driver (A100 and H100)

MLNX_OFED_LINUX-23.10-2.1.3.1 LTS

DGX kernel

5.15.0-1053-nvidia

GPU Driver

535.161.08

Lustre Client

lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125

UFM

UFM Enterprise SW: 1.7.0

HPL

hpc-benchmarks:24.03

NCCL

tensorrt:24.02-py3

DGX FW

1.1.3

Change Requests

General

New Features

  • Added mlnx-ofed24.01 package

  • Added CUDA 12.4 toolkit packages

  • Added PBS Professional 2024 packages

  • cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page

  • Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package

  • bcm-post-install: Allow slogin node base name to configurable

  • bcm-post-install: added command for SuperPOD validation

Improvements

  • Updated mlnx-ofed23.10 to 23.10-2.1.3.1

  • For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemon

Fixed Issues

  • An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass

  • An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver– —packages

  • Fix SuperPOD enroot configuration

CMDaemon

New Features

  • Mark the devices monitored by MQTT as UP when recent monitoring data exists

  • Allow the option to run Slurm accounting database daemon in high-availability mode

Improvements

  • Added per node Slurm state metric

  • Allow the option to select automatically a random free port for the IMEX service

  • Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces

  • Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics

  • Improved performance of the cm-mqtt service

  • Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots

  • Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested

  • Added lxc- interfaces to the default exclude list for the ProcNetDev monitoring producer

  • Allow the option to disable a MQTT with a flag in the configuration file

Fixed Issues

  • An issue with validating of the LDAP group during commit of a user

  • An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed

  • An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks

  • An issue where monitoring consolidators may not be created for all entity-measurable pairs

  • An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database

  • Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception

  • An issue with chargeback calculations using per node requested CPU/GPU information

  • An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage

  • An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon

  • An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed

  • An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover

  • An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover

  • An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners

  • An issue with the reporting of the GPU chargeback information

  • An issue where a category can be removed while another category’s provisioning role still has a reference to it

  • An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node’s power status is ON

  • An issue where on failure AWS node power on actions produce an error message “Unable to parse output” instead of the AWS error message

  • An issue with the cmsh dropunused command that can result in removing too many measurables

  • An issue with the cmsh device syncinfo command when specifying an fspart path

Base View

Fixed Issues

  • An issue with displaying the SNMP system information data for switches

Cluster Tools

Fixed Issues

  • An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package

COD

New Features

  • Added support for creating HA COD clusters in Azure

  • Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool

  • Added support for OCI defined tags. This changes the original –head-node-tags command line option to –head-node-freeform-tags and adds new command line option –head-node-defined-tags

Improvements

  • Allow the option to select the Azure availability zone on the command line of the cluster create command

Machine Learning

New Features

  • Introduced ML NCCL and CuDNN packages for CUDA 12.4

Fixed Issues

  • An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start

cm-clone-install

Fixed Issues

  • An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution

cm-cluster-extension

Fixed Issues

  • An issue where ‘germany’ is incorrectly listed as an Azure region

cm-create-image

Fixed Issues

  • An issue where missing modular metadata for the ‘default’ package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images

cm-kubernetes-setup

Improvements

  • Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster

Fixed Issues

  • A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters

  • An issue with cm-kubernetes-setup –pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure

cm-scale

New Features

  • Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed

Fixed Issues

  • An issue where the shutdown state from files may be used incorrectly

cm-wlm-setup

Fixed Issues

  • Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically

cmsh

Improvements

  • Allow the options to specify the IP increment with the cmsh addinterface command

  • Include the job run time data in the cmsh WLM jobs info command

  • Allow the option to specify the network CIDR on the cmsh “add network” command line

Fixed Issues

  • An issue where the cmsh monitoring trigger info command does not show grouped expressions

  • An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history

jupyter

New Features

  • Allow the option to use sqsh files to run Jupyter kernels based on enroot

  • Restrict the access to Jupyter based on group memberships

Improvements

  • Allow the option to install and configure VNC when setting up Jupyter

pythoncm

Fixed Issues

  • An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used

pyxis-sources

Improvements

  • Updated pyxis-sources to 0.19.0