Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.01

Released: 15 February 2024

General

New Features

Added support for upgrading BCM 3 and Bright 9.2 clusters to BCM 10
The head node installer will now create a new /etc/cm-install-release file to keep a record of the installation time and media that has been used

Improvements

Added cuda-driver-535 package
The mlnx-ofed packages’ installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
Updated mlnx-ofed58 to 5.8-4.1.5.0
Updated mlnx-ofed23.10 to 23.10-1.1.9.0
Updated cuda12.3 to 12.3 update 2
Updated cm-nvhpc to 23.11

CMDaemon

Improvements

Update the Kubernetes users’ configuration files with Run:ai configuration settings
Redirect the output from cm-burn to tty1
Added new GPU totals metrics for temperature and nvlink bandwidth
Allow the option to select BCM GPU autodetection configuration mechanism also in the Slurm WLM cluster settings, and not only in the Slurm WLM client role
Ensure kubelets are able to join a Kubernetes cluster also after the initial certificates have expired (which typically happens after 4 hours)

Fixed Issues

An issue with sorting the data passed to the PromQL engine, which can result in an error “expanding series: closed SeriesSet” when running instant queries
An issue where the exclude list snippets are not being cloned when cloning a software image
Rare deadlock in CMDaemon which can occur while committing a head node
An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface
An issue where /etc/systemd/resolved.conf is not being added to the imageupdate exclude list for the compute nodes
An issue where install-license may not copy some certificates to all /cm/shared* on a multi-arch or multi-os cluster
An issue with the Prometheus exporter when entities have recently been removed
An issue with parsing multiple pending Kubernetes CSR per node, which can result in none of the CSR’s being approved
On SLES base distribution, an issue with updating the cluster landing page with links to the dashboards of other integrations such as Kubernetes or Ceph
An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change
An issue where CMDaemon may hang waiting for events while stopping
An issue where the cmsh call to create a certificate may return before the certificate is written
An issue where entering the cmsh biossettings mode may result in an “Error parsing JSON” error message
In some cases, an issue with configuring Slurm when GPU automatic configuration by BCM has been selected
In some cases, an issue with setting up Etcd due to insufficient permissions to access the Etcd certificate files
An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes
An issue with collecting GPU job metrics for containerized Pyxis jobs

cm-kubernetes-setup

New Features

Use Calico 3.27, Run:ai 2.15.2, and GPU operator v23.9.1 for new Kubernetes deployments using cm-kubernetes-setup

Improvements

Allow the option to choose Network Operator version 23.10.0
Allow the option to configure a custom Kubernetes Ingress certificate

Fixed Issues

An issue with setting up Run:ai which can result in the Run:ai cluster installer not being able to complete successfully
An issue with the interactive uninstall question in cm-kubernetes-setup when the Kubernetes API is not responsive
In some cases, an issue where cm-kubernetes-setup may not wait for the required nodes, such as control-plane or worker nodes, to come back up after a reboot

cm-lite-daemon

Improvements

Added new metrics for the total traffic on network interfaces

cm-wlm-setup

Fixed Issues

In some cases, an issue with installing Pyxis on multi-arch or multi-distro software images
Pyxis enroot is now configured to use its internal value for the cache directory, which previously was being set to a directory under /run

cmsh

New Features

Added a new cmsh “multiplexers” command in monitoring / setup which can show which nodes will run for other entities a specified dataproducer

Fixed Issues

An issue with importing a JSON file that was exported on a different cluster
An issue with entering the SlurmJobQueueAccessList submode of the SlurmSubmit role when the role is assigned directly to a node

pythoncm

Improvements

Added a new pythoncm example script total-job-power-usage.py for calculating WLM jobs power usage