Release notes for NVIDIA Base Command™ Manager (BCM) 10.25.03

Released: 28 March 2025

General

New Features

  • Updated multiple Kubernetes operators when performing new setups: ingress-nginx 4.12.1 (fix CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24513, CVE-2025-24514), kube-prometheus-stack 70.3.0 (fix CVE-2025-22868, CVE-2025-22870), kube-state-metrics 5.31.0. Existing Kubernetes setups need to be updated manually.

  • Added BFB pre/post-install sections in cm-dpu-manage

  • Added CUDA 12.6 packages

  • Added mlnx-ofed24.10 packages

  • Updated Slurm 24.05 to 24.05.7

  • Updated Slurm 24.11 to 24.11.3

  • Updated cm-nvhpc to 25.1

  • Updated cuda-driver to 570.124.06

  • Updated cuda-driver-535 to 535.216.03

  • Updated cuda-driver-550 to 550.127.08

  • Updated cuda12.6 to 12.6.3

  • Updated mlnx-ofed23.10 to 23.10-4.0.9.1

  • Updated mlnx-ofed58 to 5.8-6.0.4.2

  • Updated the Ubuntu 24.04 base distribution to Ubuntu 24.04.1

  • Updated BaseOS image to version 7.0.2

Fixed Issues

  • Updated the grub images provided by cm-tftpboot to support booting PE32+ Linux kernels which otherwise prevents aarch64 node booting with RHEL 9.5

CMDaemon

New Features

  • An issue where in the case of a write failure CMDaemon may generate a large number of websocket related log error messages

  • An issue parsing the bgp information in cm-lite-daemon

  • An issue where an invalid “plain/text” MIME type may be used by CMDaemon

  • Added new advanced configuration option HttpStrictTransportSecurityMaxAge

  • Update the Slurm topology.conf file when a cloud node goes UP or DOWN

  • Allow the option to configure network security group for a specific VNIC in OCI

  • An issue where DPU apply is not working for a brief period of time immediately after the node becomes UP

  • Added bound checks in the monitoring storage to prevent possible crashes due to data corruption

  • Extend cm-deploy-lite-daemon to download packages on the head node for switches that are not connected to the external network

  • An issue where on Ubuntu 24.04 CMDaemon configures the ntp service instead of ntpsec

  • Include the arch/os software image information in the CMDaemon XML dump file

  • Allow the option to specify onboot=no for network interfaces via extra configuration values

  • Added total cpu and memory utilization metrics

  • An issue with generating Slurm topology configuration from switches connected to other switches

  • Restrict the ability to use cmsh foreach or range commands for inefficient power or terminate operations on a large number of devices

  • Add a new REST API endpoint for WLM drain operations

  • Allow the option to override the list of enabled OCI agent plugins

  • Allow the option to customize /var/lib/kubelet/config.yaml via the Kubelet role

Fixed Issues

  • An issue where redfish metrics containing a ~ symbol are being exported to Prometheus

  • Perform periodic checks for certificate signing requests for new Kubelets and Cert rotations which otherwise may prevent the CSR approvals

  • Rare CMDaemon crash when performing PDU-port power operations

  • An issue where the DPU apply RPC timeout is too short for the operation to complete

  • Improved pagination of the REST API /network/topology endpoint

  • An issue where the sysinfo GPU UUID does not match the nvidia-smi UUID

  • An issue where switching to a different consolidator can leave behind the old monitoring data

  • An issue where kill-no-job-user-ssh-sessions returns no-data instead of failing

  • An issue where recently added labeled entities may be removed and prevent returning correct PromQL results

  • An issue where sysinfo disk information is shown multiple times for devices managed by the lite-daemon

  • An issue with parsing jobs information when group name is not set

  • Rare deadlock within the head node CMDaemon process when both head nodes and many other nodes are being committed at the same time from different threads

  • An issue where the slurmctld service may be restarted when CMDaemon is restarted

  • An issue where chargeback queries can incorrectly report the value is out of range

Node Installer

Fixed Issues

  • An issue with copying dangling symbolic links from the /cm/conf/* configuration directories to the node

COD

New Features

  • Disable by default the public networks access for the storage account in Azure

Machine Learning

New Features

  • Added NCCL 2.25.1 and CuDNN 9.6 and 9.7 for Cuda 12.8

cm-kubernetes-setup

New Features

  • Modify the default configuration for NVIDIA container toolkit to match the Run:ai requirements

  • Tune the Kubernetes API Server to more sensible defaults for production systems

  • [Kubernetes] Various improvements to out-of-the-box Kube Prometheus Stack configuration (+ patch to fix existing BCM from pre-10.25.02 (cm-kubernetes-setup –patch-kube-prometheus-stack))

  • [Kubernetes] Simplify BCM landingpage ingress (no need for running Pod)

  • Enable Typha by default on clusters with less than 50 nodes when setting up Kubernetes with Calico CNI

  • Allow the option to setup NetQ 4.13 with cm-kubernetes-setup using Kubernetes v1.31 on Ubuntu 22.04

  • Allow the option to install NVIDIA GPU Operator without installing BCM NVIDIA GPU packages

Improvements

  • Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough

Fixed Issues

  • An issue where the cluster-admin service account is not created in the correct ‘default’ namespace

  • Suppress an incorrect warning message about existing /etc/kubernetes directory as a symlink on the secondary head node

  • Allow the option to install BCM NVIDIA GPU packages without installing NVIDIA GPU Operator

  • Wait for the Etcd information to become available to prevent cm-kubernetes-setup failures with error message ‘NoneType’ object has no attribute ‘advertiseClientUrls’

cm-wlm-setup

New Features

  • Allow the option to select NRT GPU configuration settings in cm-wlm-setup

Fixed Issues

  • An issue with setting up pyxis if the secondary head node is down

cmsh

Fixed Issues

  • An issue with creating a ‘node’ type execution multiplexer in cmsh

  • An issue with the addinterface command which can result in a crash of cmsh

jupyter

Fixed Issues

  • An issue where kernel icons are not available when the certificates are generated with openssl 3.2.2

pythoncm

New Features

  • Added pythoncm parallel MIG function RPC