NVIDIA DGX SuperPOD: Release Notes 10.24.09#

Introduction#

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.09 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Component Versions#

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions#

Component

Version

BCM ISO

10.24.09

DGX OS

6.3.0

Ubuntu

22.04.4 LTS

Enroot

3.5.0

CUDA toolkit

12.2

DCGM

3.3.8

Cumulus OS

5.10.0

Mellanox InfiniBand Switch (DGX A100/H100)

MLNX OS version: 3.11.2300

HCA Firmware: CX7 - 28.39.3560

Slurm

23.02.8

Mellanox OFED Driver (A100 and H100)

MLNX_OFED_LINUX-23.10.3.2.0 LTS

DGX kernel

5.15.0-1063-nvidia

GPU Driver

550.90.07

Lustre Client

ddn145

UFM

UFM Enterprise SW: 1.7.0

HPL

hpc-benchmarks:24.06

NCCL

tensorrt:24.08-py3

DGX FW

24.08.1

General#

New Features#

  • Added the compose and buildx CLI plugins to cm-docker

  • Added CuDNN 8.9 for CUDA 12.4

  • Added the cm-ngc-cli package for RHEL9 and Ubuntu22/24

  • Added support for Ubuntu 24.04

  • Added CuDNN 9.3

  • Added cm-iperf.py wrapper around iperf3 to make testing multiple nodes easier

  • Added mlnx-ofed24.04 packages

  • Added CUDA 12.5 packages

Improvements#

  • Updated cm-nvidia-container-toolkit to v1.16.2 (CVE-2024-0132 and CVE-2024-0133)

  • Updated cm-docker to v26.1.5

  • Allow DPU nodes to be defined without an OOB interface

  • Updated NCCL2 for CUDA 12.5

  • Added pmix3/pmix4 plugin to Slurm 23.02, 23.11 and 24.05

  • Updated mlnx-ofed58 to 5.8-5.1.1.2

  • Updated mlnx-ofed23.10 to 23.10-3.2.2.0

  • Upgraded cm-nsight-systems to 2024.5.1

  • Updated cm-iperf to 3.17.1

  • Public IP address resources in cluster extension Azure are now created with “Standard” SKU.

Fixed Issues#

  • An issue with Slurm jobs chargeback when a job requests MIGs

  • An issue with cmdaemonctl on the compute nodes using cm-cmd-ports, which is only installed on the head nodes

  • An issue that allowed an user with the readonly profile to restart services

  • An issue with cm-nfs-checker not writing its logs

CMDaemon#

New Features#

  • Allow the prometheus service and exporter to be authenticated with a bearer token

  • Added flag to enable creation of /tftpboot/pxelinux.cfg/<mac> symlinks

  • Added option to log events to a hook script

  • Allow the configuration of the DNS resolver on Ubuntu in stub or uplink mode via the extra_values flag resolv=stub or resolv=uplink in the nodes or categories

  • Added /rest/v1/status/wait REST endpoint

  • Fetch job history from the workload managers even if cmdaemon was down when the job was running

  • Added support for specifying SSH-authorized keys for users via notes

Improvements#

  • Optimized WLM job operations by executing them in batches

  • Reduced load on the head node caused by many nodes changing states quickly

  • Reduced load caused by Slurm config writer calling MIG status too often

  • Changed audit log to write in the machine local timezone instead of GMT

  • Use home_mode instead of umask from /etc/login.defs when setting home directory permissions for a new user

  • Display new certificate request event with info, not notice, when autosign takes care of it

  • Sped up verification of profile/token access to RPC

  • Allow DataTranslator::delay to throw away data that fails to resolve for 60s, to prevent it from staying in the cache forever when metrics are deleted in the meantime

  • Added advanced config option DisableLdap=1 to disable all ldap integration

  • Added advanced config option SoftwareImageDisableZFS=1 to disable ZFS

  • Added support for BMC event logs using Redfish

  • Optimized Trigger::Actuator evaluation of triggers for all data

  • Sped up the entity, measurable, and parameter comparison in triggers

  • Reduced number of log messages when /cm/shared is not mounted

  • Increased the BFB push timeout from 15m to 30m

  • Allow the option to manage separated DPUs without a MAC

  • Configure a ssh ProxyJump via the DPU host when a DPU is running in embedded mode

  • Improved BF3 support in cm-dpu-manage

  • Added “Disable PXE” extra_value flag to prevent CMDaemon from writing the dhcpd.conf

  • Added rshim to the services managed by CMDaemon

  • Enabled the “Compute RDMA GPU Monitoring” agent plugin for OCI

  • Allow Ubuntu based distributions to configure a layer3 network setup with a /31 connection between the node and the switch

  • Spread out the schedulers healthcheck to reduce the flood of parallel WLM calls

  • Optimized memory usage on large numbers of jobs when job tracing is disabled

Fixed Issues#

  • An issue preventing diskless nodes from running rsync with –xattrs and –acls options

  • An issue with CMD_SERVER_IP pointing to the node IP in /tftpboot/

  • An issue where incomplete HTTP headers can result in cmdaemon threads reading ssl and consuming memory

  • An issue with cmdaemon prometheus scraper not accepting metrics without labels

  • An issue with adding AllocNodes to the Slurm partition parameters

  • An issue with the prometheus /exporter endpoint not returning data grouped by metric

  • An issue with Trigger::Actuator being too slow to process all incoming samples

  • An issue with shared_mutex lock that can cause a crash or memory consumption increasing if two threads change the same data

  • An issue in gzip inflate that can cause a RPC to loop while consuming CPU resources

  • An issue with service cmd stop taking too long on compute nodes when the active head node is down

  • An issue in cm-lite-daemon that prevented the calculation of the derivate for cumulative metrics in samplenow

  • An issue preventing ZTP from getting a new certificate when cluster.pem has been updated

  • An issue with an unequal split of the DHCPD range when it is shared between two head nodes

  • An issue with removing stopped service information from the database

  • An issue when adding AccountingStorageTRES to slurm.conf

  • An issue with cmdaemonctl logconf

  • An issue with slurm_states sampler returning exit code 1 when presented with a node state that it doesn’t handle

  • An issue with the reporting of power status for non-instantiated nodes in Azure

  • An issue with Slurm configuration update when bcm slurmautodetect is set

Node Installer#

Fixed Issues#

  • An issue with node-installer not always setting the selected node MAC when rebooting

cmsh#

New Features#

  • Added cmsh redfishsubscriptions command

  • Added mlxconfig reset to the DPU commands

Improvements#

  • Improved help for the GPU profiling command

  • Improved tab completion to support metrics with a space in their name

Fixed Issues#

  • An issue when setting the value of the switchport property from devices to a single value after the property had multiple

  • An issue entering selinux mode

  • An issue with cmsh sometimes hanging when running with -c

  • An issue with event acknowledge

  • An issue when automatically setting IP in the clone command

pythoncm#

New Features#

  • Added pythoncm.Network.devices

  • Added RPC timeouts to pythoncm

  • Added an example of overloading the event handler in pythoncm

Base View#

Fixed Issues#

  • An issue with run command not showing the full output

Base View NG#

Fixed Issues#

  • An issue with package updates not showing in ubuntu

Cluster Tools#

Improvements#

  • Ensure cm-container-registry-setup writes the registry certificates to software images without assigned nodes

  • Update shorewall routes with Kube Service Network for air-gapped Kubernetes deployments

  • Improved NetQ version vs. Kubernetes version compatibility checks in cm-kubernetes-setup

  • Improved checks for the prerequisite SSH configuration required for NetQ in cm-kubernetes-setup

  • Do not allow to run cm-chroot-sw-img on the passive head node or provisioning nodes unless forced

Fixed Issues#

  • An issue with the retry of NetQ installation where a bootstrap reset purge-db was not done before retrying

  • An issue with an infinite loop in request-remote-assistance with nohup

  • An issue in request-license doing bad hostname comparison and then unnecessarily trying to ssh to itself

  • An issue with cmha-setup that prevented clusters with BCME licenses from setting up HA

  • An issue when modifying the secondary head node entity during cm-cloud-ha-setup

COD#

New Features#

  • Added support for setting up accelerated networking for head nodes when creating clusters with cm-cod-azure

Improvements#

  • Set the default region for cm-cod-oci to us-sanjose-1

  • Allow enabling/disabling creation of the NAT gateway and shared public ip in cm-cloud-ha-setup via the GUI

  • Store the cluster password hashed (instead of plain text) on the headnode

  • Improved robustness of cluster deletion in cm-cod-oci

Fixed Issues#

  • An issue when recreating clusters created with managed identity

pyxis-sources#

Improvements#

  • Updated pyxis to 0.20.0

slurm#

New Features#

  • Added integration with the Topology Generation Service

  • Set the ENROOT_MOUNT_HOME configuration option to “no” by default for new setups

slurm23.02#

Improvements#

  • Updated Slurm 23.02 to 23.02.8

slurm23.11#

Improvements#

  • Updated Slurm 23.11 to 23.11.10

slurm24.05#

New Features#

  • Added extra Slurm packages with support for NVIDIA SHARP

Improvements#

  • Updated Slurm 24.05 to 24.05.3