Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.12

Released: 14 December 2023

General

New Features

  • Added support for Kubernetes v1.28

Improvements

  • Added mlnx-ofed23.10 package

  • Added CUDA 12.3 packages

  • Updated cuda-driver to 545.23.08

  • Updated cm-openssl to 3.1.4

  • Updated cuda-driver-legacy-470 to 470.223.02

CMDaemon

New Features

  • Allow the Kubernetes kubelet service to start when swap is enabled, which resolves an issue where Kubernetes setup may fail if the head node has swap and is selected for Kubernetes master node

Improvements

  • Added support for two VLANs on top of a bond interface with a VLAN as the provisioning interface

  • Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage

  • Display both the nodes and the accelerator counts in the license information in cmsh

  • Added an hourly cron job to clean up files left behind when the sample-ipmi script is being killed

  • Prevent already outdated monitoring data with timestamps in the past to be saved in the CMDaemon database

  • Added an “export as CSV” option in the REST monitoring data API

  • Added new monitoring metrics ManagedServicesOK for the partition and the categories to get totals for all of the nodes in the cluster or the nodes in a given category

  • Allow the option to disable the nvlink and nvswitch metrics by specifying extra_values configuration settings for the nodes

  • Include the outdated packages for the passive head node, the node-installer, and the software images in the notification for available BCM updates

Fixed Issues

  • An issue with sorting on timestamps in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests

  • An issue with reporting the correct InfiniBand count in the hardware overview of the compute nodes

  • In some cases, a timing issue that may prevent the pbsmom service from starting in an on-perm+edge workload manager setup

  • An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces

  • An issue with performing parallel device power operations with cmsh

  • Allow the option to use GET in the Prometheus sampler for older exporters that do not allow POST operations

  • An issue with moving the software image revisions directories when updating the path of the parent software image

Node Installer

Improvements

  • On Ubuntu based distributions, make the updates of the ntp.conf drift file settings consistent between the Node Installer and CMDaemon, which until now could generate different configuration files

cm-diagnose

Improvements

  • Include syslog in cm-diagnose

  • Sanitize all mysqldumps in cm-diagnose

cm-harbor

Fixed Issues

  • In some cases, fixed a race condition where Harbor from the cm-harbor package and Shorewall are concurrently updating the iptables rules, which can prevent enabling the required iptables rules

cm-kubernetes-setup

Fixed Issues

  • An issue selecting the correct Kubernetes namespace in the retry mechanism of cm-kubernetes-setup when uninstalling failed operators

cm-scale

New Features

  • The Auto Scaler now takes into account the NVIDIA GPU requests made by Kubernetes pods and jobs when selecting the compute nodes to power on

Fixed Issues

  • Auto Scaler now takes Slurm mincpus parameter into account

cm-setup

Fixed Issues

  • An issue with ssh connections initiated by cm-setup scripts when a ssh-agent is running and has a SSH ECDSA certificate added to the agent

  • A regression in cm-container-registry-setup for Harbor on HA head node setup, which can result in “no such file or directory” error messages for files not present on the passive head node

cm-wlm-setup

Improvements

  • Create the enroot cache shared directory automatically if it does not already exist

Fixed Issues

  • An issue where enroot is not configured by default on a head node when the pyxis Slurm plugin is enabled

  • Allow the option to configure full BCM GPU autodetection for Slurm with cm-wlm-setup

jupyter

Improvements

  • In some cases, an issue where duplicated pods or services may be created due to a race condition in the Kubernetes API

  • Update the JupyterLab and JupyterHub dependencies to the most recent versions

pythoncm

Improvements

  • Added an option to use gzip compression in RPC calls