NVIDIA DGX SuperPOD: Release Notes 10.24.09#
Introduction#
This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.09 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.
Information about BCM and DGX SuperPOD is available at:
Component Versions#
DGX SuperPOD component versions for this release are in Table 1.
Component |
Version |
---|---|
BCM ISO |
10.24.09 |
DGX OS |
6.3.0 |
Ubuntu |
22.04.4 LTS |
Enroot |
3.5.0 |
CUDA toolkit |
12.2 |
DCGM |
3.3.8 |
Cumulus OS |
5.10.0 |
Mellanox InfiniBand Switch (DGX A100/H100) |
MLNX OS version: 3.11.2300 HCA Firmware: CX7 - 28.39.3560 |
Slurm |
23.02.8 |
Mellanox OFED Driver (A100 and H100) |
MLNX_OFED_LINUX-23.10.3.2.0 LTS |
DGX kernel |
5.15.0-1063-nvidia |
GPU Driver |
550.90.07 |
Lustre Client |
ddn145 |
UFM |
UFM Enterprise SW: 1.7.0 |
HPL |
hpc-benchmarks:24.06 |
NCCL |
tensorrt:24.08-py3 |
DGX FW |
24.08.1 |
General#
New Features#
Added the compose and buildx CLI plugins to cm-docker
Added CuDNN 8.9 for CUDA 12.4
Added the cm-ngc-cli package for RHEL9 and Ubuntu22/24
Added support for Ubuntu 24.04
Added CuDNN 9.3
Added cm-iperf.py wrapper around iperf3 to make testing multiple nodes easier
Added mlnx-ofed24.04 packages
Added CUDA 12.5 packages
Improvements#
Updated cm-nvidia-container-toolkit to v1.16.2 (CVE-2024-0132 and CVE-2024-0133)
Updated cm-docker to v26.1.5
Allow DPU nodes to be defined without an OOB interface
Updated NCCL2 for CUDA 12.5
Added pmix3/pmix4 plugin to Slurm 23.02, 23.11 and 24.05
Updated mlnx-ofed58 to 5.8-5.1.1.2
Updated mlnx-ofed23.10 to 23.10-3.2.2.0
Upgraded cm-nsight-systems to 2024.5.1
Updated cm-iperf to 3.17.1
Public IP address resources in cluster extension Azure are now created with “Standard” SKU.
Fixed Issues#
An issue with Slurm jobs chargeback when a job requests MIGs
An issue with cmdaemonctl on the compute nodes using cm-cmd-ports, which is only installed on the head nodes
An issue that allowed an user with the readonly profile to restart services
An issue with cm-nfs-checker not writing its logs
CMDaemon#
New Features#
Allow the prometheus service and exporter to be authenticated with a bearer token
Added flag to enable creation of /tftpboot/pxelinux.cfg/<mac> symlinks
Added option to log events to a hook script
Allow the configuration of the DNS resolver on Ubuntu in stub or uplink mode via the extra_values flag resolv=stub or resolv=uplink in the nodes or categories
Added /rest/v1/status/wait REST endpoint
Fetch job history from the workload managers even if cmdaemon was down when the job was running
Added support for specifying SSH-authorized keys for users via notes
Improvements#
Optimized WLM job operations by executing them in batches
Reduced load on the head node caused by many nodes changing states quickly
Reduced load caused by Slurm config writer calling MIG status too often
Changed audit log to write in the machine local timezone instead of GMT
Use home_mode instead of umask from /etc/login.defs when setting home directory permissions for a new user
Display new certificate request event with info, not notice, when autosign takes care of it
Sped up verification of profile/token access to RPC
Allow DataTranslator::delay to throw away data that fails to resolve for 60s, to prevent it from staying in the cache forever when metrics are deleted in the meantime
Added advanced config option DisableLdap=1 to disable all ldap integration
Added advanced config option SoftwareImageDisableZFS=1 to disable ZFS
Added support for BMC event logs using Redfish
Optimized Trigger::Actuator evaluation of triggers for all data
Sped up the entity, measurable, and parameter comparison in triggers
Reduced number of log messages when /cm/shared is not mounted
Increased the BFB push timeout from 15m to 30m
Allow the option to manage separated DPUs without a MAC
Configure a ssh ProxyJump via the DPU host when a DPU is running in embedded mode
Improved BF3 support in cm-dpu-manage
Added “Disable PXE” extra_value flag to prevent CMDaemon from writing the dhcpd.conf
Added rshim to the services managed by CMDaemon
Enabled the “Compute RDMA GPU Monitoring” agent plugin for OCI
Allow Ubuntu based distributions to configure a layer3 network setup with a /31 connection between the node and the switch
Spread out the schedulers healthcheck to reduce the flood of parallel WLM calls
Optimized memory usage on large numbers of jobs when job tracing is disabled
Fixed Issues#
An issue preventing diskless nodes from running rsync with –xattrs and –acls options
An issue with CMD_SERVER_IP pointing to the node IP in /tftpboot/
An issue where incomplete HTTP headers can result in cmdaemon threads reading ssl and consuming memory
An issue with cmdaemon prometheus scraper not accepting metrics without labels
An issue with adding AllocNodes to the Slurm partition parameters
An issue with the prometheus /exporter endpoint not returning data grouped by metric
An issue with Trigger::Actuator being too slow to process all incoming samples
An issue with shared_mutex lock that can cause a crash or memory consumption increasing if two threads change the same data
An issue in gzip inflate that can cause a RPC to loop while consuming CPU resources
An issue with service cmd stop taking too long on compute nodes when the active head node is down
An issue in cm-lite-daemon that prevented the calculation of the derivate for cumulative metrics in samplenow
An issue preventing ZTP from getting a new certificate when cluster.pem has been updated
An issue with an unequal split of the DHCPD range when it is shared between two head nodes
An issue with removing stopped service information from the database
An issue when adding AccountingStorageTRES to slurm.conf
An issue with cmdaemonctl logconf
An issue with slurm_states sampler returning exit code 1 when presented with a node state that it doesn’t handle
An issue with the reporting of power status for non-instantiated nodes in Azure
An issue with Slurm configuration update when bcm slurmautodetect is set
Node Installer#
Fixed Issues#
An issue with node-installer not always setting the selected node MAC when rebooting
cmsh#
New Features#
Added cmsh redfishsubscriptions command
Added mlxconfig reset to the DPU commands
Improvements#
Improved help for the GPU profiling command
Improved tab completion to support metrics with a space in their name
Fixed Issues#
An issue when setting the value of the switchport property from devices to a single value after the property had multiple
An issue entering selinux mode
An issue with cmsh sometimes hanging when running with -c
An issue with event acknowledge
An issue when automatically setting IP in the clone command
pythoncm#
New Features#
Added pythoncm.Network.devices
Added RPC timeouts to pythoncm
Added an example of overloading the event handler in pythoncm
Base View#
Fixed Issues#
An issue with run command not showing the full output
Base View NG#
Fixed Issues#
An issue with package updates not showing in ubuntu
Cluster Tools#
Improvements#
Ensure cm-container-registry-setup writes the registry certificates to software images without assigned nodes
Update shorewall routes with Kube Service Network for air-gapped Kubernetes deployments
Improved NetQ version vs. Kubernetes version compatibility checks in cm-kubernetes-setup
Improved checks for the prerequisite SSH configuration required for NetQ in cm-kubernetes-setup
Do not allow to run cm-chroot-sw-img on the passive head node or provisioning nodes unless forced
Fixed Issues#
An issue with the retry of NetQ installation where a bootstrap reset purge-db was not done before retrying
An issue with an infinite loop in request-remote-assistance with nohup
An issue in request-license doing bad hostname comparison and then unnecessarily trying to ssh to itself
An issue with cmha-setup that prevented clusters with BCME licenses from setting up HA
An issue when modifying the secondary head node entity during cm-cloud-ha-setup
COD#
New Features#
Added support for setting up accelerated networking for head nodes when creating clusters with cm-cod-azure
Improvements#
Set the default region for cm-cod-oci to us-sanjose-1
Allow enabling/disabling creation of the NAT gateway and shared public ip in cm-cloud-ha-setup via the GUI
Store the cluster password hashed (instead of plain text) on the headnode
Improved robustness of cluster deletion in cm-cod-oci
Fixed Issues#
An issue when recreating clusters created with managed identity
pyxis-sources#
Improvements#
Updated pyxis to 0.20.0
slurm#
New Features#
Added integration with the Topology Generation Service
Set the ENROOT_MOUNT_HOME configuration option to “no” by default for new setups
slurm23.02#
Improvements#
Updated Slurm 23.02 to 23.02.8
slurm23.11#
Improvements#
Updated Slurm 23.11 to 23.11.10
slurm24.05#
New Features#
Added extra Slurm packages with support for NVIDIA SHARP
Improvements#
Updated Slurm 24.05 to 24.05.3