Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.10
Released: 23 October 2023
General
New Features
Added mlnx-ofed23.07 package
Added cm-pmix4 package
Improvements
Added drainstatus to cm-diagnose
Updated cuda-driver package to 535.104.12
Updated cm-libprometheus package to 0.47.0
Updated cm-openssl package to 3.1.3
CMDaemon
New Features
Added advanced config flag DisableRemoteShell to disable all remote shell RPC
Added events for Cumulus service management operations
Improvements
Added cmsh clone device option to increment IP addresses by values other than 1
Allow lite node IP to be set during cmsh device add
Display an error when setting an invalid software image in cmsh
Update /etc/resolv.conf via netconfig on SLES15 instead of writing file
Created the ability to add model/serial number information to new switches (ZTP)
Kill active ramdisk create process when software image is removed
Fixed Issues
Fixed provisioning trigger when an image name starts with the name of another image
Allow cm-cmd-ports –get to work without an active cmd
Prevent “Reboot required: Interfaces have been modified” event from being shown for a node if the node has a VLAN interface on a Bridge interface that includes a bond interface
Fixed cm-burn unsuccessful completion in the absence of both a pre and post section
Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
Allow appending or skipping adding a Slurm drain reason when healthcheck fails with drain action enabled
Fixed crash of pythoncm parallel node termination function
Fixed an edge case that causes hostlist generation failures when there are 3 numeric fields in the hostname
Fixed service management for cm-lite-daemon
cm-scale
Fixed Issues
Allow to start terminated cloud nodes whose state is one of the node installer ones
Terminate useless AWS spot instance requests
Fixed the termination of cloud nodes when multiple clone operations are issued in parallel
Fixed the startup of nodes by cm-scale if Slurm job predicted start time is set by Slurm in the future
Fixed handling of job arrays with range from 1 to >1 figure number
Cloud
New Features
Added support for AWS FSx on Ubuntu for cmjob
Improvements
Improved error message when starting a cloud node with incorrect VPC/subnet configuration
Fixed Issues
Fixed issue with cm-cloud-storage-setup when using us-east-1 region
Prevent cloud instance termination when cloud director is down from being listed as UP+terminated
Fixed starting spot instances after a no-capacity in availability zone scenario occurs
Unfulfilled spot instance requests stay in PENDING state until fulfilled or terminated
Store availability zones for networks created by COD or manually, which enables AutoScaler to distribute loads between availability zones in COD deployments
Kubernetes
New Features
Added support for NGC token authentication in cm-kubernetes-setup
Improvements
Improved the wizard when it should fail earlier then it actually does (incorrect return code checks caused the installer to confusingly fail at later stages)
Kubernetes wizard errors will now show more context information where possible
Increased timeouts for kubeadm init and clusterctl init operations to effectively handle slow connections
Fixed Issues
Add user wizard will use BCM user name and not commonName
Workload Management
New Features
Added enroot and enroot+caps packages
Fixed Issues
Update AWS spot instances state in Slurm when they are terminated outside of BCM
Container Engines
Improvements
Improved internal IP detection logic for etcd (similarly to internal IP detection for Kubernetes Calico and Flannel)
Monitoring
New Features
Added Prometheus /rules and /alert and /alertmanagers end points
Added operstate metrics (operational state i.e., UP / DOWN ) via cm-lite-daemon for Cumulus switches
Improvements
Display K/M/G in cmsh for consolidated averages when no unit is set for a metric
Fixed Issues
Added support to run healthcheck with storcli software next to megacli software
Cluster on Demand
Improvements
Improved the display of the EULA when running from docker image
Allow CMDaemon to work with cluster-on-demand cluster spanning multiple regions (requires manual setup)
Base View
Improvements
Provide notifications in Base View if BCM package updates are available
Visualize licensed GPU used and available in Base View