Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.01

Released: 15 February 2024

General

New Features

  • Added support for upgrading BCM 3 and Bright 9.2 clusters to BCM 10

  • The head node installer will now create a new /etc/cm-install-release file to keep a record of the installation time and media that has been used

Improvements

  • Added cuda-driver-535 package

  • The mlnx-ofed packages’ installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED

  • Updated mlnx-ofed58 to 5.8-4.1.5.0

  • Updated mlnx-ofed23.10 to 23.10-1.1.9.0

  • Updated cuda12.3 to 12.3 update 2

  • Updated cm-nvhpc to 23.11

CMDaemon

Improvements

  • Update the Kubernetes users’ configuration files with Run:ai configuration settings

  • Redirect the output from cm-burn to tty1

  • Added new GPU totals metrics for temperature and nvlink bandwidth

  • Allow the option to select BCM GPU autodetection configuration mechanism also in the Slurm WLM cluster settings, and not only in the Slurm WLM client role

  • Ensure kubelets are able to join a Kubernetes cluster also after the initial certificates have expired (which typically happens after 4 hours)

Fixed Issues

  • An issue with sorting the data passed to the PromQL engine, which can result in an error “expanding series: closed SeriesSet” when running instant queries

  • An issue where the exclude list snippets are not being cloned when cloning a software image

  • Rare deadlock in CMDaemon which can occur while committing a head node

  • An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface

  • An issue where /etc/systemd/resolved.conf is not being added to the imageupdate exclude list for the compute nodes

  • An issue where install-license may not copy some certificates to all /cm/shared* on a multi-arch or multi-os cluster

  • An issue with the Prometheus exporter when entities have recently been removed

  • An issue with parsing multiple pending Kubernetes CSR per node, which can result in none of the CSR’s being approved

  • On SLES base distribution, an issue with updating the cluster landing page with links to the dashboards of other integrations such as Kubernetes or Ceph

  • An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change

  • An issue where CMDaemon may hang waiting for events while stopping

  • An issue where the cmsh call to create a certificate may return before the certificate is written

  • An issue where entering the cmsh biossettings mode may result in an “Error parsing JSON” error message

  • In some cases, an issue with configuring Slurm when GPU automatic configuration by BCM has been selected

  • In some cases, an issue with setting up Etcd due to insufficient permissions to access the Etcd certificate files

  • An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes

  • An issue with collecting GPU job metrics for containerized Pyxis jobs

cm-kubernetes-setup

New Features

  • Use Calico 3.27, Run:ai 2.15.2, and GPU operator v23.9.1 for new Kubernetes deployments using cm-kubernetes-setup

Improvements

  • Allow the option to choose Network Operator version 23.10.0

  • Allow the option to configure a custom Kubernetes Ingress certificate

Fixed Issues

  • An issue with setting up Run:ai which can result in the Run:ai cluster installer not being able to complete successfully

  • An issue with the interactive uninstall question in cm-kubernetes-setup when the Kubernetes API is not responsive

  • In some cases, an issue where cm-kubernetes-setup may not wait for the required nodes, such as control-plane or worker nodes, to come back up after a reboot

cm-lite-daemon

Improvements

  • Added new metrics for the total traffic on network interfaces

cm-wlm-setup

Fixed Issues

  • In some cases, an issue with installing Pyxis on multi-arch or multi-distro software images

  • Pyxis enroot is now configured to use its internal value for the cache directory, which previously was being set to a directory under /run

cmsh

New Features

  • Added a new cmsh “multiplexers” command in monitoring / setup which can show which nodes will run for other entities a specified dataproducer

Fixed Issues

  • An issue with importing a JSON file that was exported on a different cluster

  • An issue with entering the SlurmJobQueueAccessList submode of the SlurmSubmit role when the role is assigned directly to a node

pythoncm

Improvements

  • Added a new pythoncm example script total-job-power-usage.py for calculating WLM jobs power usage