NVIDIA DGX SuperPOD: Release Notes 10.24.11#
Introduction#
This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.11 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.
Information about BCM and DGX SuperPOD is available at:
Component Versions#
DGX SuperPOD component versions for this release are in Table 1.
Component |
Version |
---|---|
BCM ISO |
10.24.11 |
DGX OS |
6.3.1 |
Ubuntu |
22.04.4 LTS |
Enroot |
3.5.0 |
CUDA toolkit |
12.4 |
DCGM |
3.3.8 |
Cumulus OS |
5.10.0 |
Mellanox InfiniBand Switch (DGX A100/H100) |
MLNX OS version: 3.11.2300 HCA Firmware: CX7 - 28.39.3560 |
Slurm |
23.02.8 |
Mellanox OFED Driver (A100 and H100) |
MLNX_OFED_LINUX-23.10.3.2.0 LTS |
DGX kernel |
5.15.0-1063-nvidia |
GPU Driver |
550.90.07 |
Lustre Client |
ddn145 |
UFM |
UFM Enterprise SW: 1.7.0 |
HPL |
hpc-benchmarks:24.09 |
NCCL |
tensorrt:24.10-py3 |
DGX FW |
24.09.17 |
General#
New Features#
Added cuda-driver-550 and cuda-fabric-manager-550 packages.
Added mlnx-ofed24.07 package
Added support for SUSE Linux Enterprise Server (SLES) 15 SP6
Improvements#
Updated Nsight Systems to 2024.6.1
Updated cm-openssl to 3.1.7
Updated cuda-driver to 565.57.01
Updated cuda-driver-535 to 535.216.01
Updated cuda12.6-toolkit to 12.6 Update 2
Updated freeipmi to 1.6.14
CMDaemon#
Improvements#
Reduced time required for all compute nodes to reconnect when the head node CMDaemon is restarted
Use 64 bit OID versions for in and out octets in the SNMP switch monitoring sampler
Added a REST endpoint to get the network topology
Allow the option to configure /home exports per category for individual users/tenants
Improved REST rack API call result to include the device type information
Allow the option to configure additional DNS forward zones for networks defined in CMDaemon
Added support for Equal Cost Multi-Path Route (ECMP) to IP Routing in layer3 setups
Added the /[a-f0-9]{12}_[hc]/ regex to the default list of options for the IgnoreInotifyInterface advanced configuration option
Added subgroups support in the WLM check-alloc implementation for allowing or denying user logins to compute nodes
Allow the option to deploy cm-lite-daemon with a cm-deploy-lite-daemon.sh deployment script without using ZTP on Cumulus switches
Improved CMDaemon commit validation for bootable networks when there is an overlap in the network ranges/CIDR
Allow the option to disable the json login for a list of users defined in an advanced configuration option DisableLoginServiceUsers
Added Raritan PDU monitoring sampler script
Allow the option to use FQDN for the compute nodes with the global configuration option ShortHostname=0
Kubernetes module files will now be created on all nodes with kubelet or firewall roles
Use the Cumulus nv commands for setting up the username and password on Cumulus 5.9 and newer
Sample the DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS metric for gpu nvswitches
CMDaemon will now manage the mst service on all nodes with a DPU
Added a ZTP stage directory /cm/local/apps/cmd/etc/htdocs/dpu/ztp for scripts executed on DPUs after BFB push
Added new endpoint to REST API for power management
Added new REST API endpoint for node categories
Added Forge IB interface support
Added CMDaemon health check for expiring Kubernetes certificates
Fixed Issues#
An issue where Azure cloud node instance creation failures may not be correctly reported by cmsh
An issue with the head node HA shared interfaces not being brought up automatically after they are manually brought down
An issue with the duplex regex in the interfaces health check
An issue with the Slurm takeover script
An issue where the unit of the PDUUptime and SwitchUptime metrics is not correctly shown in cmsh/Base View
An issue where automatic file system exports may not removed when a network configuration is updated
An issue with missing newlines in /var/spool/cmd/events.log
An issue where cmd -x can produce an XML configuration file with duplicate values for the “revision” or “extra values” properties
An issue where the TotalGPUTemperature metric is reported as 0
An issue with configuring the Slurm accounting service on edge setups
An issue where the sgeexecd service on the compute nodes may be restarted when CMDaemon is restarted
An issue with applying the search domain index in the partition and category settings when generating the resolver configuration
An issue with applying the search domain index in the network settings when generating the resolver configuration
An issue where level 3 switches are not being added to the Slurm topology.conf file when the tree Slurm plugin is configured
An issue with the Prometheus monitoring sampler data collection when a username and a password are configured
An issue with the Slurm power management scripts raising an AttributeError exception
An issue where empty configuration options in the Kubelet role may not result in Kubernetes manifest files updates
An issue where duplicate provisioning requests may be queued when a cloud director is not UP
An issue where the megaraid health checks may not be able to report a failure writting the FAIL message to an incorrect file descriptor
An issue with the node-installer unable to copy symbolic links from the /cm/conf/ directories
A timing issue with image update of compute nodes running systemd-managed automount filesystems, where CMDaemon may not detect the automounted filesystem mount point and may not add it to the exclude list
An issue where CMDaemon may not be able to update the kernel hash in the mysql database when a software image initrd is updated
An issue with the redfish monitoring sampler printing informational messages to an incorrect file descriptor
In some cases, an issue where ramdisk creation may be started while CMDaemon is stopping
An issue where Azure compute nodes may be left dangling after powering on more nodes than the allowed quota
An issue where the WLM slots values may be expressed in bytes in cmsh/Base View
An issue where switch control scripts may not correctly separate the stdout and stderr output
An issue where the restart required flag is set for DPUs when they are not running CMDaemon
An issue where the dhcpd service configuration file may include compute nodes BMC interfaces which are not in use
An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file
An issue with the system interrupts metrics not being expressed in number of interrupts per second
An issue where CPU- metrics are displayed in Jiffies instead of Jiffies/s
An issue where ProcSNMP metrics such as IpInDelivers are not configured as cumulative
An issue where the SlurmState metrics may not include hostnames that include hyphens
An issue where a ramdisk creation task does not transition to a failed state when trying to create a ramdisk for a locked image
Node Installer#
Fixed Issues#
An issue where the compute nodes /etc/machine-id are not unique
An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file
COD#
Fixed Issues#
COD Openstack: Make cluster start wait for renamed nodes
Machine Learning#
New Features#
Added NCCL 2.23.4 for CUDA12.6
cm-bios-tools#
Improvements#
An issue with random redfish disconnect errors
An issue with performing flash operations of H100 GPU tray firmware in parallel
cm-cluster-extension#
Fixed Issues#
A validation issue in the advanced settings dialog which can result in validation error messages such as “Create tunnel networks is not integer”
cm-kubernetes-setup#
Improvements#
Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough
Allow the option to setup Kubernetes version 1.31 with cm-kubernetes-setup. Kubernetes versions 1.27 and older are no longer available options for performing new Kubernetes setups
The use of kube-rbac-proxy is now deprecated in the Jupyter operator and in the permissions manager in favor of using the internal kubebuilder mechanism
Added support for the NIM operator in cm-kubertenes-setup
Updated Kubernetes OVN CNI to 1.1.13
Updated local path provisioner to version 0.0.29
Improved retry mechanism when Kubernetes certificate signing requests time out
Fixed Issues#
An issue with using cm-kubernetes-setup –pull command line option on Ubuntu 24.04
An issue with handling older versions of the Kubernetes permission manager where not all API endpoints exist
cm-lite-daemon#
Improvements#
Added MemoryUtilization metric for devices running cm-lite-daemon
Added ARP table information to the switch overview
Added reported network speed metric
cm-scale#
Improvements#
Allow the option to reboot compute nodes when the software image changes instead of performing power off and on cycle
cmsh#
Improvements#
An issue with calculating the IPs when cloning devices with cmsh when using “layer3” network setup
Allow the option to override the default timeout for monitoring scripts when running samplenow with –max-run-time option
An issue in cmsh with displaying the monitoring data when using monitoringdump –uncompress
cmsh will now show the Azure availability zone also in the cases when the zone has been auto-selected by Azure
Fixed Issues#
An issue where when using –next-ip to clone a device with multiple with network interfaces on same network the resulting IPs of the cloned device may be identical
An issue with using regular expressions with the foreach command in the interfaces submode for devices
An issue where the networks IP is incorrectly updated when an interface is configured with startif = active and the device is cloned with cmsh
An issue where the cmsh monitoringbackuprings command does not take into account a backup role may be disabled when showing the information
An issue with alignment of the power results table when some hostnames are too long
A timing issue in cmsh where a device power operation may not be executed if it is initiated shortly (within ~2s) after the device is committed
pythoncm#
Improvements#
Added send_warning_event pythoncm cluster method
slurm#
Improvements#
Updated Slurm 24.05.4 Sharp Plugin to 1.0.1
topograph#
Improvements#
The cluster-topology-generator is now renamed to topograph