Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.03
Released: 28 March 2024
General
New Features
Added support for RHEL8u9 and ROCKY8u9
Added support for RHEL9u3 and ROCKY9u3
Allow the option to use Jupyter kernels with environment based on Conda and enroot images
The DGX OS software images included on the head node installer ISOs are now based on DGX OS 6.2.0 (Release 1)
Enable NVSM metrics for DGX systems
Added a new BCM package for NVIDIA Nsight systems (cm-nsight-systems)
Improvements
Include the gsp firmware with the cuda-driver* packages
Updated DCGM to 3.3.5
Updated cuda-driver to 550.54.15
Updated cuda-driver-535 to 535.161.08
Updated cuda-driver-legacy-470 to 470.239.06
Updated munge to v0.5.16
Updated cm-nvhpc to 24.1
Updated cm-openssl to 3.1.5
Removed the dependency of the cm-nvidia-container-toolkit package on the cuda-driver package, which otherwise can cause some package managers to remove the toolkit package when the CUDA driver is replaced with a different version
Fixed Issues
Increase the stack and nofile limits in cm-config-limits for the root user on Ubuntu 22.04 to prevent possible issues with head nodes hanging under heavy load
CMDaemon
New Features
An issue where CMDaemon may store the Slurm array’s main job information in the CMDaemon monitoring DB, which creates unnecessary entries in the DB since these jobs expand to individual tasks as they get are scheduled
Allow the option to open a remote-request-assistance session from within cmsh
CMDaemon user profiles can now include new tokens, such as SET_USER_PROFILE_TOKEN, allowing a cmsh or Base View client connected using a certificate with this profile to set or update the profile setting for users
CMDaemon will now generate a combined kubeconfig file in .kube/config in the home directory of a user containing all clusters the user has access to, allowing the user to connect to the Kubernetes cluster without first loading the environment module
Allow the option to use negative matching such as “!resource!=category-name” in the monitoring comparison expressions
Improvements
Improved Prometheus exporter cache management to ensure the memory usage does not grow over time
CMDaemon will now set the InfiniBand (IB) interface GUID as extra values for the IB interfaces, which then can be shown in cmsh / Base View
Exclude the link MAC entry from the Cumulus switch overview information
Allow the option to set the MAC address of non-node devices to the MAC address identified on the respective switch port for the device
Restrict users’ abilities with a profile that allows them to add other users to also set arbitrary group ids for the users they create. The behavior can be tuned with advanced configuration options
Increase the Slurm job queue QOS table size in the CMDaemon DB
Redirect all base-view and userportal HTTPS calls from the passive to the active head node
CMDaemon will now send an event when network interfaces are added or removed
Fixed Issues
An issue where setting the extra_values nvlink property to false is not picked up by CMDaemon, causing CMDaemon to continue sampling nvlink metrics
An issue that can result in duplicate entries in the /exporter Prometheus endpoint
An issue where a network switch ZTP.sh can point to an incorrect IP address for the head node
An issue where a disabled backup role is still being used for monitoring backup, preventing it from being removed from the target list
An issue where a small buffer size for the user’s groups can prevent CMDaemon from correctly listing all user’s groups
Ensure the head nodes can provision each other regardless of the ProvisioningRole selections
In some cases, an issue with folding the compute nodes hostnames when generating slurm.conf configuration file
An issue with Prometheus sampler with exporter that only supports http GET
An issue where committing a monitoring action without setting a script can cause CMDaemon on the head node to crash
An issue with Slurm job management operations in cmsh and Base View unable to handle Slurm job arrays ids
An issue where metrics from the AggregateNode producer do not have correct data / do not have “no data” values when there are no nodes in a rack
In some cases, an issue that can leave a passive head node CMDaemon process using 100% CPU
In some cases, an issue that may prevent CMDaemon from loading old jobs information from the CMDaemon DB
An issue with sending a test email when using cmsh
An issue with the passive head node forwarding labeled entity information to the active, preventing it from being used in PromQL queries
An issue where terminated cloud nodes that are consequently powered on may still be listed as ‘terminated’ in cmsh
An issue that prevents two different configuration overlays with the same priority and different generic roles to be committed in CMDaemon
Node Installer
Fixed Issues
An issue where setting the frozenFilesPerNode directive may not cause the node-installer to freeze /etc/sysconfig/network on RHEL
Head Node Installer
Improvements
Updated the default partition sizes for the standard RAID1 and RAID5 head node disk layouts to match the sizes of the non-raid standard layout
Fixed Issues
An issue with head node installations with Lmod where the DefaultModules.lua module file is not created by default, resulting in messages about empty LMOD_SYSTEM_DEFAULT_MODULES environment variable
cm-kubernetes-setup
Fixed Issues
Improved error reporting when kubeadm init step fails
cm-scale
New Features
Improved handling of a lack of vCPUs in AWS in the same way as a lack of capacity
cm-wlm-setup
New Features
Allow the option to set the enroot temporary directory using cm-wlm-setup
Fixed Issues
Ensure cm-wlm-setup can install AGE 2023.1.1 (8.8.1)
cmsh
Improvements
Added new cmsh WLM jobs mode command pidsgpus to list the pids and the gpus used by a WLM job
Fixed Issues
An issue in cmsh user mode with case-sensitive compare of profile names
pyxis-sources
New Features
Updated pyxis sources package to 0.17.0
slurm23.11
New Features
Added Slurm 23.11 packages. The cm-setup and cmdaemon packages need to be updated to their most recent versions to support Slurm 23.11 setup