NVIDIA DGX SuperPOD: Release Notes 10.24.07#
This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.07 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.
Information about BCM and DGX SuperPOD is available at:
The NVIDIA DGX SuperPOD: Release Notes 10.24.07 is also available as a PDF
Component Versions#
DGX SuperPOD component versions for this release are in Table 1.
Component |
Version |
10.24.07 |
6.2.1 |
Ubuntu |
Ubuntu 22.04.4 LTS |
Enroot |
3.5.0 |
CUDA toolkit |
12.2 |
3.3.5 |
Cumulus OS |
5.5.1 |
Mellanox InfiniBand Switch (DGX A100/H100) |
MLNX OS version: 3.11.2016 HCA Firmware: CX7 - 28.39.2048 |
Slurm |
23.02.7 |
Mellanox OFED Driver (A100 and H100) |
DGX kernel |
5.15.0-1053-nvidia |
GPU Driver |
535.161.08 |
Lustre Client |
ddn145 |
UFM Enterprise SW: 1.7.0 |
hpc-benchmarks:23.10 |
tensorrt:24.06-py3 |
1.1.3 |
Change Requests#
New Features#
Added support for air-gapped Kubernetes setup
Added support for Slurm 24.05
Added support for Kubernetes v1.29
Updated Ubuntu 22.04 base distribution to 22.04.4
Updated cm-nvhpc to 24.5
Updated cm-openssl to 3.1.6
Updated cuda-driver-535 to 535.183.06
Updated DCGM to 3.3.6
Updated enroot to 3.5.0
Updated Lmod to 8.7.39
Updated PBS Professional 2022 to 2022.1.6
Updated PBS Professional 2024 to 2024.1.1
Updated Slurm 23.11 to 23.11.8
New Features#
Allow the option to configure the real hostname of the head node instead of the “master” hostname alias as the StorageHost in slurmdbd.conf
Use the hardware information which corresponds to the cloud instance type when configuring cloud compute nodes in Slurm. This allows the nodes to be configured before the nodes have been powered on for the first time
Allow the option to update the IB switches firmware via cmsh or pythoncm
Added support for accelerated networking (SR-IOV) in Azure
Allow the option to use NVIDIA sharp plugin as the Slurm resource selection plugin (SelectType)
Added a new OOM kill count metric based on data from /proc/vmstat
Added new GPU/NVlink monitoring metrics for CRC errors and remapped rows
Added cm-remove-orphaned-pending-job-info.py helper script that can be used to remove jobs information in the CMDaemon database for WLM jobs that have been cached by cmd as pending while they are removed from the workload manager
Allow the option with Ubuntu base distributions to configure a layer3 network setup with a /31 connection between the individual node switch port and the router
Spread out the schedulers health check to reduce the flood of parallel WLM calls
Display the standalone entities as a path in the monitoring tree in Base View
Include the version information in the info-message reported by the cuda-dcgm health check when the health check is sampled with the –debug option
Allow the option to select the ICMP protocol for the firewall role’s openports options
The DNS allow-transfer setting is now set to ‘none’ by default for all zones maintained by CMDaemon
Allow the option to define extra Prometheus labels for devices
Allow the option to expose enum values as labels with the Prometheus exporter
Allow the option to dump the monitoring data for health checks and enum metrics grouped by value
Allow the option to update the “FrozenFile” setting with the cm-manipulate-advanced-config.py helper script
Improved execution speed of the rogueprocess health check
Improved speed of the monitoring triggers evaluation when a regex is used
Validate the project managers when adding, updating, or removing users
Added a commit validation warning when a GPU sampler update frequency is too high. Too high frequency can lead to incorrect values for certain GPU metrics
Improved IMEX epilog stop script with a timeout of 15s and ability to kill the process
Allow the option to perform a periodic check if the head node IP has been changed on external DHCP renewal
Allow the option to change the behavior of the monitoring drain action to not set a drain reason when draining the node(s)
Fixed Issues#
An issue in pythoncm prometheus range_query not converting the interval to nanoseconds
An issue with using cmha status for failover groups
An issue with freezing the /etc/systemd/resolved.conf on Ubuntu base distributions
An issue with the sample_ibmetrics.py script not returning floating point values
An issue with adding the additionalHostnames of GenericDevice and LiteNode devices to the DNS configuration
An issue with applying node tags to bare-metal instances in OCI
An issue with providing Kubernetes job information to the Auto Scaler
An issue with the interfaces health check when the interface’s speed is defined with a unit
An issue that prevents auditd from being added or removed when SELinux settings are changed for a partition or a category
An issue that can cause CMDaemon to crash on clusters with head node HA setup when an RPC is called within a small window during startup
An issue with setting the user and group ownership of static configuration files managed by generic roles
An issue where executing MIG, BIOS, or DPU commands may not clear the “busy” flag when the commands time out
Allow the option to select the AWS node-installer volume type with an advanced configuration option NodeInstallerEbsVolumeType
An issue where the values of enum metrics may not be translated to enums on the compute nodes until CMDaemon is restarted
Standardize all monitoring scripts to use CMD_SCRIPTTIMEOUT environment variable. The CMD_SCRIPT_TIMEOUT environment variable is no longer passed to the monitoring action scripts, now only CMD_SCRIPTTIMEOUT is used
An issue with the monitoring trigger actuator when many samples are picked up at once or when using complex expressions
An issue where a Cumulus ZTP script may not use the hostname provided by DHCP
An issue with generating Slurm topology.conf in the case when the switches are connected to both nodes and other switches
An issue where CMDaemon may execute systemctl daemon-reload also when the Slurm service drop-in file has not changed
An issue with creating AWS compute nodes with multiple EFA interfaces
An issue with parsing of the requested CPUs setting for UGE jobs information
Node Installer#
Fixed Issues#
An issue with setting up bonded network interfaces on diskless nodes
An issue where the auditd service may not be disabled by the node-installer when SELinux is not enabled
Cluster Tools#
Allow the option to create clusters spanning over multiple existing AWS subnets
Allow the option to setup Azure HA COD clusters by using existing public IP resources
Allow the option to not create a shared public IP for HA COD clusters in OCI
Allow the option to deploy docker on diskless nodes
The /etc/cm-install-release information file is now created also for cluster on demand head nodes
Updated oci-hpc-network-device-names to 1.0.13 with support for L40S
Allow the option to create clusters spanning over multiple existing AWS subnets
Added support for creating AWS clusters in a subnet outside of the main VPC CIDR block
Fixed Issues#
An issue with creating COD clusters when FIDO/U2F keys (e.g. ED25519-SK) are added to a running SSH agent:
An issue where the delete ^^dry-run output does not show the actual resources that are to be deleted
An issue with restricting SSH access to OCI clusters to only a predefined CIDR due to default OCI security list rules
An issue with creating COD clusters when encrypted SSH keys are used
Head Node Installer#
Updated the STIG disk setup configurations
Machine Learning#
Updated CuDNN to 9.1.1
Allow the option to deploy docker on diskless nodes
New Features#
Added support for GPU operator v24.3.0
Added support for Kubernetes v1.29
Updated local path provisioner to 0.0.28
The Grafana web interface is now added to the cluster landing page for new Kubernetes setup installations when the Prometheus Operator stack is selected in cm-kubernetes-setup
Use Kyverno 1.11.4 for new Kubernetes deployments and by default add tolerations to allow running Kyverno on the control-plane nodes, and not only on the worker nodes, which allows the worker nodes to be shutdown without affecting Kyverno
Deprecated Features#
Kubernetes v1.24 and v1.25 are no longer available options for setting up Kubernetes
Fixed Issues#
An issue with Cumulus switch overview showing state=unknown when CumulusOS 5.9 is installed
Fixed Issues#
In some cases, an issue with saving the state of drained nodes when the head node is restarted, which can prevent the Auto Scaler from considering the nodes as available for scaling up
Allow the option with cmsh to set a trigger expression as a parsed string
Allow the option to show raw numbers and columns with all zero data in the chargeback report table
Fixed Issues#
An issue with the cmsh addinterface command assigning incremental IPs without sorting the nodes by their hostnames
An issue that can cause cmsh to crash when running many cmsh -c ‘…’ commands one after the other
Migrate the WLM kernel templates to Jypter Kernel Starter