Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.05
Released: 24 May 2024
General
New Features
Added mlnx-ofed24.01 package
Added CUDA 12.4 toolkit packages
Added PBS Professional 2024 packages
cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page
Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package
Improvements
Updated mlnx-ofed23.10 to 23.10-2.1.3.1
For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemons
Fixed Issues
An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass
An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver-* packages
CMDaemon
New Features
Mark the devices monitored by MQTT as UP when recent monitoring data exists
Allow the option to run Slurm accounting database daemon in high-availability mode
Improvements
Added per node Slurm state metric
Allow the option to select automatically a random free port for the IMEX service
Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces
Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics
Improved performance of the cm-mqtt service
Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots
Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested
Added lxc* interfaces to the default exclude list for the ProcNetDev monitoring producer
Allow the option to disable a MQTT with a flag in the configuration file
Fixed Issues
An issue with validating of the LDAP group during commit of a user
An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed
An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks
An issue where monitoring consolidators may not be created for all entity-measurable pairs
An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database
Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception
An issue with chargeback calculations using per node requested CPU/GPU information
An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage
An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon
An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed
An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover
An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover
An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners
An issue with the reporting of the GPU chargeback information
An issue where a category can be removed while another category’s provisioning role still has a reference to it
An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node’s power status is ON
An issue where on failure AWS node power on actions produce an error message “Unable to parse output” instead of the AWS error message
An issue with the cmsh dropunused command that can result in removing too many measurables
An issue with the cmsh device syncinfo command when specifying an fspart path
Base View
Fixed Issues
An issue with displaying the SNMP system information data for switches
Cluster Tools
Fixed Issues
An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package
COD
New Features
Added support for creating HA COD clusters in Azure
Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool
Added support for OCI defined tags. This changes the original –head-node-tags command line option to –head-node-freeform-tags and adds new command line option –head-node-defined-tags
Improvements
Allow the option to select the Azure availability zone on the command line of the cluster create command
Machine Learning
New Features
Introduced ML NCCL and CuDNN packages for CUDA 12.4
Fixed Issues
An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start
cm-clone-install
Fixed Issues
An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution
cm-cluster-extension
Fixed Issues
An issue where ‘germany’ is incorrectly listed as an Azure region
cm-create-image
Fixed Issues
An issue where missing modular metadata for the ‘default’ package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images
cm-kubernetes-setup
Improvements
Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster
Fixed Issues
A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters
An issue with cm-kubernetes-setup –pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure
cm-scale
New Features
Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed
Fixed Issues
An issue where the shutdown state from files may be used incorrectly
cm-wlm-setup
Fixed Issues
Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically
cmsh
Improvements
Allow the options to specify the IP increment with the cmsh addinterface command
Include the job run time data in the cmsh WLM jobs info command
Allow the option to specify the network CIDR on the cmsh “add network” command line
Fixed Issues
An issue where the cmsh monitoring trigger info command does not show grouped expressions
An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history
jupyter
New Features
Allow the option to use sqsh files to run Jupyter kernels based on enroot
Restrict the access to Jupyter based on group memberships
Improvements
Allow the option to install and configure VNC when setting up Jupyter
pythoncm
Fixed Issues
An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used
pyxis-sources
Improvements
Updated pyxis-sources to 0.19.0