Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.12

Released: 14 December 2023

General

Allow the Kubernetes kubelet service to start when swap is enabled, which resolves an issue where Kubernetes setup may fail if the head node has swap and is selected for Kubernetes master node

Added support for two VLANs on top of a bond interface with a VLAN as the provisioning interface
Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage
Display both the nodes and the accelerator counts in the license information in cmsh
Added an hourly cron job to clean up files left behind when the sample-ipmi script is being killed
Prevent already outdated monitoring data with timestamps in the past to be saved in the CMDaemon database
Added an “export as CSV” option in the REST monitoring data API
Added new monitoring metrics ManagedServicesOK for the partition and the categories to get totals for all of the nodes in the cluster or the nodes in a given category
Allow the option to disable the nvlink and nvswitch metrics by specifying extra_values configuration settings for the nodes
Include the outdated packages for the passive head node, the node-installer, and the software images in the notification for available BCM updates

An issue with sorting on timestamps in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests
An issue with reporting the correct InfiniBand count in the hardware overview of the compute nodes
In some cases, a timing issue that may prevent the pbsmom service from starting in an on-perm+edge workload manager setup
An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
An issue with performing parallel device power operations with cmsh
Allow the option to use GET in the Prometheus sampler for older exporters that do not allow POST operations
An issue with moving the software image revisions directories when updating the path of the parent software image

On Ubuntu based distributions, make the updates of the ntp.conf drift file settings consistent between the Node Installer and CMDaemon, which until now could generate different configuration files

In some cases, fixed a race condition where Harbor from the cm-harbor package and Shorewall are concurrently updating the iptables rules, which can prevent enabling the required iptables rules

An issue selecting the correct Kubernetes namespace in the retry mechanism of cm-kubernetes-setup when uninstalling failed operators

The Auto Scaler now takes into account the NVIDIA GPU requests made by Kubernetes pods and jobs when selecting the compute nodes to power on

An issue with ssh connections initiated by cm-setup scripts when a ssh-agent is running and has a SSH ECDSA certificate added to the agent
A regression in cm-container-registry-setup for Harbor on HA head node setup, which can result in “no such file or directory” error messages for files not present on the passive head node

Create the enroot cache shared directory automatically if it does not already exist

An issue where enroot is not configured by default on a head node when the pyxis Slurm plugin is enabled
Allow the option to configure full BCM GPU autodetection for Slurm with cm-wlm-setup

In some cases, an issue where duplicated pods or services may be created due to a race condition in the Kubernetes API
Update the JupyterLab and JupyterHub dependencies to the most recent versions