Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.10

Released: 23 October 2023

General

New Features

  • Added mlnx-ofed23.07 package

  • Added cm-pmix4 package

Improvements

  • Added drainstatus to cm-diagnose

  • Updated cuda-driver package to 535.104.12

  • Updated cm-libprometheus package to 0.47.0

  • Updated cm-openssl package to 3.1.3

CMDaemon

New Features

  • Added advanced config flag DisableRemoteShell to disable all remote shell RPC

  • Added events for Cumulus service management operations

Improvements

  • Added cmsh clone device option to increment IP addresses by values other than 1

  • Allow lite node IP to be set during cmsh device add

  • Display an error when setting an invalid software image in cmsh

  • Update /etc/resolv.conf via netconfig on SLES15 instead of writing file

  • Created the ability to add model/serial number information to new switches (ZTP)

  • Kill active ramdisk create process when software image is removed

Fixed Issues

  • Fixed provisioning trigger when an image name starts with the name of another image

  • Allow cm-cmd-ports –get to work without an active cmd

  • Prevent “Reboot required: Interfaces have been modified” event from being shown for a node if the node has a VLAN interface on a Bridge interface that includes a bond interface

  • Fixed cm-burn unsuccessful completion in the absence of both a pre and post section

  • Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.

  • Allow appending or skipping adding a Slurm drain reason when healthcheck fails with drain action enabled

  • Fixed crash of pythoncm parallel node termination function

  • Fixed an edge case that causes hostlist generation failures when there are 3 numeric fields in the hostname

  • Fixed service management for cm-lite-daemon

cm-scale

Fixed Issues

  • Allow to start terminated cloud nodes whose state is one of the node installer ones

  • Terminate useless AWS spot instance requests

  • Fixed the termination of cloud nodes when multiple clone operations are issued in parallel

  • Fixed the startup of nodes by cm-scale if Slurm job predicted start time is set by Slurm in the future

  • Fixed handling of job arrays with range from 1 to >1 figure number

Cloud

New Features

  • Added support for AWS FSx on Ubuntu for cmjob

Improvements

  • Improved error message when starting a cloud node with incorrect VPC/subnet configuration

Fixed Issues

  • Fixed issue with cm-cloud-storage-setup when using us-east-1 region

  • Prevent cloud instance termination when cloud director is down from being listed as UP+terminated

  • Fixed starting spot instances after a no-capacity in availability zone scenario occurs

  • Unfulfilled spot instance requests stay in PENDING state until fulfilled or terminated

  • Store availability zones for networks created by COD or manually, which enables AutoScaler to distribute loads between availability zones in COD deployments

Kubernetes

New Features

  • Added support for NGC token authentication in cm-kubernetes-setup

Improvements

  • Improved the wizard when it should fail earlier then it actually does (incorrect return code checks caused the installer to confusingly fail at later stages)

  • Kubernetes wizard errors will now show more context information where possible

  • Increased timeouts for kubeadm init and clusterctl init operations to effectively handle slow connections

Fixed Issues

  • Add user wizard will use BCM user name and not commonName

Workload Management

New Features

  • Added enroot and enroot+caps packages

Fixed Issues

  • Update AWS spot instances state in Slurm when they are terminated outside of BCM

Container Engines

Improvements

  • Improved internal IP detection logic for etcd (similarly to internal IP detection for Kubernetes Calico and Flannel)

Monitoring

New Features

  • Added Prometheus /rules and /alert and /alertmanagers end points

  • Added operstate metrics (operational state i.e., UP / DOWN ) via cm-lite-daemon for Cumulus switches

Improvements

  • Display K/M/G in cmsh for consolidated averages when no unit is set for a metric

Fixed Issues

  • Added support to run healthcheck with storcli software next to megacli software

Cluster on Demand

Improvements

  • Improved the display of the EULA when running from docker image

  • Allow CMDaemon to work with cluster-on-demand cluster spanning multiple regions (requires manual setup)

Base View

Improvements

  • Provide notifications in Base View if BCM package updates are available

  • Visualize licensed GPU used and available in Base View