NVIDIA DGX SuperPOD: Release Notes 10.24.11#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.11 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Component Versions#

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions#

Component

Version

BCM ISO

10.24.11

DGX OS

6.3.1

Ubuntu

22.04.4 LTS

Enroot

3.5.0

CUDA toolkit

12.4

DCGM

3.3.8

Cumulus OS

5.10.0

Mellanox InfiniBand Switch (DGX A100/H100)

MLNX OS version: 3.11.2300

HCA Firmware: CX7 - 28.39.3560

Slurm

23.02.8

Mellanox OFED Driver (A100 and H100)

MLNX_OFED_LINUX-23.10.3.2.0 LTS

DGX kernel

5.15.0-1063-nvidia

GPU Driver

550.90.07

Lustre Client

ddn145

UFM

UFM Enterprise SW: 1.7.0

HPL

hpc-benchmarks:24.09

NCCL

tensorrt:24.10-py3

DGX FW

24.09.17

General#

New Features#

  • Added cuda-driver-550 and cuda-fabric-manager-550 packages.

  • Added mlnx-ofed24.07 package

  • Added support for SUSE Linux Enterprise Server (SLES) 15 SP6

Improvements#

  • Updated Nsight Systems to 2024.6.1

  • Updated cm-openssl to 3.1.7

  • Updated cuda-driver to 565.57.01

  • Updated cuda-driver-535 to 535.216.01

  • Updated cuda12.6-toolkit to 12.6 Update 2

  • Updated freeipmi to 1.6.14

CMDaemon#

Improvements#

  • Reduced time required for all compute nodes to reconnect when the head node CMDaemon is restarted

  • Use 64 bit OID versions for in and out octets in the SNMP switch monitoring sampler

  • Added a REST endpoint to get the network topology

  • Allow the option to configure /home exports per category for individual users/tenants

  • Improved REST rack API call result to include the device type information

  • Allow the option to configure additional DNS forward zones for networks defined in CMDaemon

  • Added support for Equal Cost Multi-Path Route (ECMP) to IP Routing in layer3 setups

  • Added the /[a-f0-9]{12}_[hc]/ regex to the default list of options for the IgnoreInotifyInterface advanced configuration option

  • Added subgroups support in the WLM check-alloc implementation for allowing or denying user logins to compute nodes

  • Allow the option to deploy cm-lite-daemon with a cm-deploy-lite-daemon.sh deployment script without using ZTP on Cumulus switches

  • Improved CMDaemon commit validation for bootable networks when there is an overlap in the network ranges/CIDR

  • Allow the option to disable the json login for a list of users defined in an advanced configuration option DisableLoginServiceUsers

  • Added Raritan PDU monitoring sampler script

  • Allow the option to use FQDN for the compute nodes with the global configuration option ShortHostname=0

  • Kubernetes module files will now be created on all nodes with kubelet or firewall roles

  • Use the Cumulus nv commands for setting up the username and password on Cumulus 5.9 and newer

  • Sample the DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS metric for gpu nvswitches

  • CMDaemon will now manage the mst service on all nodes with a DPU

  • Added a ZTP stage directory /cm/local/apps/cmd/etc/htdocs/dpu/ztp for scripts executed on DPUs after BFB push

  • Added new endpoint to REST API for power management

  • Added new REST API endpoint for node categories

  • Added Forge IB interface support

  • Added CMDaemon health check for expiring Kubernetes certificates

Fixed Issues#

  • An issue where Azure cloud node instance creation failures may not be correctly reported by cmsh

  • An issue with the head node HA shared interfaces not being brought up automatically after they are manually brought down

  • An issue with the duplex regex in the interfaces health check

  • An issue with the Slurm takeover script

  • An issue where the unit of the PDUUptime and SwitchUptime metrics is not correctly shown in cmsh/Base View

  • An issue where automatic file system exports may not removed when a network configuration is updated

  • An issue with missing newlines in /var/spool/cmd/events.log

  • An issue where cmd -x can produce an XML configuration file with duplicate values for the “revision” or “extra values” properties

  • An issue where the TotalGPUTemperature metric is reported as 0

  • An issue with configuring the Slurm accounting service on edge setups

  • An issue where the sgeexecd service on the compute nodes may be restarted when CMDaemon is restarted

  • An issue with applying the search domain index in the partition and category settings when generating the resolver configuration

  • An issue with applying the search domain index in the network settings when generating the resolver configuration

  • An issue where level 3 switches are not being added to the Slurm topology.conf file when the tree Slurm plugin is configured

  • An issue with the Prometheus monitoring sampler data collection when a username and a password are configured

  • An issue with the Slurm power management scripts raising an AttributeError exception

  • An issue where empty configuration options in the Kubelet role may not result in Kubernetes manifest files updates

  • An issue where duplicate provisioning requests may be queued when a cloud director is not UP

  • An issue where the megaraid health checks may not be able to report a failure writting the FAIL message to an incorrect file descriptor

  • An issue with the node-installer unable to copy symbolic links from the /cm/conf/ directories

  • A timing issue with image update of compute nodes running systemd-managed automount filesystems, where CMDaemon may not detect the automounted filesystem mount point and may not add it to the exclude list

  • An issue where CMDaemon may not be able to update the kernel hash in the mysql database when a software image initrd is updated

  • An issue with the redfish monitoring sampler printing informational messages to an incorrect file descriptor

  • In some cases, an issue where ramdisk creation may be started while CMDaemon is stopping

  • An issue where Azure compute nodes may be left dangling after powering on more nodes than the allowed quota

  • An issue where the WLM slots values may be expressed in bytes in cmsh/Base View

  • An issue where switch control scripts may not correctly separate the stdout and stderr output

  • An issue where the restart required flag is set for DPUs when they are not running CMDaemon

  • An issue where the dhcpd service configuration file may include compute nodes BMC interfaces which are not in use

  • An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file

  • An issue with the system interrupts metrics not being expressed in number of interrupts per second

  • An issue where CPU- metrics are displayed in Jiffies instead of Jiffies/s

  • An issue where ProcSNMP metrics such as IpInDelivers are not configured as cumulative

  • An issue where the SlurmState metrics may not include hostnames that include hyphens

  • An issue where a ramdisk creation task does not transition to a failed state when trying to create a ramdisk for a locked image

Node Installer#

Fixed Issues#

  • An issue where the compute nodes /etc/machine-id are not unique

  • An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file

COD#

Fixed Issues#

  • COD Openstack: Make cluster start wait for renamed nodes

Machine Learning#

New Features#

  • Added NCCL 2.23.4 for CUDA12.6

cm-bios-tools#

Improvements#

  • An issue with random redfish disconnect errors

  • An issue with performing flash operations of H100 GPU tray firmware in parallel

cm-cluster-extension#

Fixed Issues#

  • A validation issue in the advanced settings dialog which can result in validation error messages such as “Create tunnel networks is not integer”

cm-kubernetes-setup#

Improvements#

  • Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough

  • Allow the option to setup Kubernetes version 1.31 with cm-kubernetes-setup. Kubernetes versions 1.27 and older are no longer available options for performing new Kubernetes setups

  • The use of kube-rbac-proxy is now deprecated in the Jupyter operator and in the permissions manager in favor of using the internal kubebuilder mechanism

  • Added support for the NIM operator in cm-kubertenes-setup

  • Updated Kubernetes OVN CNI to 1.1.13

  • Updated local path provisioner to version 0.0.29

  • Improved retry mechanism when Kubernetes certificate signing requests time out

Fixed Issues#

  • An issue with using cm-kubernetes-setup –pull command line option on Ubuntu 24.04

  • An issue with handling older versions of the Kubernetes permission manager where not all API endpoints exist

cm-lite-daemon#

Improvements#

  • Added MemoryUtilization metric for devices running cm-lite-daemon

  • Added ARP table information to the switch overview

  • Added reported network speed metric

cm-scale#

Improvements#

  • Allow the option to reboot compute nodes when the software image changes instead of performing power off and on cycle

cmsh#

Improvements#

  • An issue with calculating the IPs when cloning devices with cmsh when using “layer3” network setup

  • Allow the option to override the default timeout for monitoring scripts when running samplenow with –max-run-time option

  • An issue in cmsh with displaying the monitoring data when using monitoringdump –uncompress

  • cmsh will now show the Azure availability zone also in the cases when the zone has been auto-selected by Azure

Fixed Issues#

  • An issue where when using –next-ip to clone a device with multiple with network interfaces on same network the resulting IPs of the cloned device may be identical

  • An issue with using regular expressions with the foreach command in the interfaces submode for devices

  • An issue where the networks IP is incorrectly updated when an interface is configured with startif = active and the device is cloned with cmsh

  • An issue where the cmsh monitoringbackuprings command does not take into account a backup role may be disabled when showing the information

  • An issue with alignment of the power results table when some hostnames are too long

  • A timing issue in cmsh where a device power operation may not be executed if it is initiated shortly (within ~2s) after the device is committed

pythoncm#

Improvements#

  • Added send_warning_event pythoncm cluster method

slurm#

Improvements#

  • Updated Slurm 24.05.4 Sharp Plugin to 1.0.1

topograph#

Improvements#

  • The cluster-topology-generator is now renamed to topograph