Release notes for NVIDIA Base Command™ Manager (BCM) 10.25.03
Released: 28 March 2025
General
New Features
Updated multiple Kubernetes operators when performing new setups: ingress-nginx 4.12.1 (fix CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24513, CVE-2025-24514), kube-prometheus-stack 70.3.0 (fix CVE-2025-22868, CVE-2025-22870), kube-state-metrics 5.31.0. Existing Kubernetes setups need to be updated manually.
Added BFB pre/post-install sections in cm-dpu-manage
Added CUDA 12.6 packages
Added mlnx-ofed24.10 packages
Updated Slurm 24.05 to 24.05.7
Updated Slurm 24.11 to 24.11.3
Updated cm-nvhpc to 25.1
Updated cuda-driver to 570.124.06
Updated cuda-driver-535 to 535.216.03
Updated cuda-driver-550 to 550.127.08
Updated cuda12.6 to 12.6.3
Updated mlnx-ofed23.10 to 23.10-4.0.9.1
Updated mlnx-ofed58 to 5.8-6.0.4.2
Updated the Ubuntu 24.04 base distribution to Ubuntu 24.04.1
Updated BaseOS image to version 7.0.2
Fixed Issues
Updated the grub images provided by cm-tftpboot to support booting PE32+ Linux kernels which otherwise prevents aarch64 node booting with RHEL 9.5
CMDaemon
New Features
An issue where in the case of a write failure CMDaemon may generate a large number of websocket related log error messages
An issue parsing the bgp information in cm-lite-daemon
An issue where an invalid “plain/text” MIME type may be used by CMDaemon
Added new advanced configuration option HttpStrictTransportSecurityMaxAge
Update the Slurm topology.conf file when a cloud node goes UP or DOWN
Allow the option to configure network security group for a specific VNIC in OCI
An issue where DPU apply is not working for a brief period of time immediately after the node becomes UP
Added bound checks in the monitoring storage to prevent possible crashes due to data corruption
Extend cm-deploy-lite-daemon to download packages on the head node for switches that are not connected to the external network
An issue where on Ubuntu 24.04 CMDaemon configures the ntp service instead of ntpsec
Include the arch/os software image information in the CMDaemon XML dump file
Allow the option to specify onboot=no for network interfaces via extra configuration values
Added total cpu and memory utilization metrics
An issue with generating Slurm topology configuration from switches connected to other switches
Restrict the ability to use cmsh foreach or range commands for inefficient power or terminate operations on a large number of devices
Add a new REST API endpoint for WLM drain operations
Allow the option to override the list of enabled OCI agent plugins
Allow the option to customize /var/lib/kubelet/config.yaml via the Kubelet role
Fixed Issues
An issue where redfish metrics containing a ~ symbol are being exported to Prometheus
Perform periodic checks for certificate signing requests for new Kubelets and Cert rotations which otherwise may prevent the CSR approvals
Rare CMDaemon crash when performing PDU-port power operations
An issue where the DPU apply RPC timeout is too short for the operation to complete
Improved pagination of the REST API /network/topology endpoint
An issue where the sysinfo GPU UUID does not match the nvidia-smi UUID
An issue where switching to a different consolidator can leave behind the old monitoring data
An issue where kill-no-job-user-ssh-sessions returns no-data instead of failing
An issue where recently added labeled entities may be removed and prevent returning correct PromQL results
An issue where sysinfo disk information is shown multiple times for devices managed by the lite-daemon
An issue with parsing jobs information when group name is not set
Rare deadlock within the head node CMDaemon process when both head nodes and many other nodes are being committed at the same time from different threads
An issue where the slurmctld service may be restarted when CMDaemon is restarted
An issue where chargeback queries can incorrectly report the value is out of range
Node Installer
Fixed Issues
An issue with copying dangling symbolic links from the /cm/conf/* configuration directories to the node
COD
New Features
Disable by default the public networks access for the storage account in Azure
Machine Learning
New Features
Added NCCL 2.25.1 and CuDNN 9.6 and 9.7 for Cuda 12.8
cm-kubernetes-setup
New Features
Modify the default configuration for NVIDIA container toolkit to match the Run:ai requirements
Tune the Kubernetes API Server to more sensible defaults for production systems
[Kubernetes] Various improvements to out-of-the-box Kube Prometheus Stack configuration (+ patch to fix existing BCM from pre-10.25.02 (cm-kubernetes-setup –patch-kube-prometheus-stack))
[Kubernetes] Simplify BCM landingpage ingress (no need for running Pod)
Enable Typha by default on clusters with less than 50 nodes when setting up Kubernetes with Calico CNI
Allow the option to setup NetQ 4.13 with cm-kubernetes-setup using Kubernetes v1.31 on Ubuntu 22.04
Allow the option to install NVIDIA GPU Operator without installing BCM NVIDIA GPU packages
Improvements
Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough
Fixed Issues
An issue where the cluster-admin service account is not created in the correct ‘default’ namespace
Suppress an incorrect warning message about existing /etc/kubernetes directory as a symlink on the secondary head node
Allow the option to install BCM NVIDIA GPU packages without installing NVIDIA GPU Operator
Wait for the Etcd information to become available to prevent cm-kubernetes-setup failures with error message ‘NoneType’ object has no attribute ‘advertiseClientUrls’
cm-wlm-setup
New Features
Allow the option to select NRT GPU configuration settings in cm-wlm-setup
Fixed Issues
An issue with setting up pyxis if the secondary head node is down
cmsh
Fixed Issues
An issue with creating a ‘node’ type execution multiplexer in cmsh
An issue with the addinterface command which can result in a crash of cmsh
jupyter
Fixed Issues
An issue where kernel icons are not available when the certificates are generated with openssl 3.2.2
pythoncm
New Features
Added pythoncm parallel MIG function RPC