Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.11
Released: 16 November 2023
General
New Features
Added support for SLES15 SP5
Improvements
Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)
Updated cuda-driver package to 535.129.03
CMDaemon
New Features
Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
Added cmsh command to show dhcpd leases
Added Border Gateway Protocol (BGP) overview for Cumulus switches
Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1
Improvements
Allow nodes to be automatically powered off or reset upon installer failure
Allow devices to be identified by serial in DHCP
Relaxed SSL checks when registering a new Cumulus switch via ZTP
Improved CMDaemon startup speed in HA mode
Prevent multiple identical failover group status
Added a flag to allow changing a user home directory to an existing directory
Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
Write chrony.conf instead of ntp.conf in node-installer on RHEL9
Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’
Fixed Issues
Fixed counting of nodes and accelerators towards the license limit
Fixed service status in cmsh of a lite-node
Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
Store services added to lite-node to DB
Fixed cmsh imageupdate –pattern <path>
Workload Management
New Features
Automatically configure non-MIG GPUs in Slurm when detected
Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
Added new package pyxis-sources to allow building pyxis in air-gapped environments
Improvements
Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf
Fixed Issues
Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
Cleaned up database node entries of Slurm jobs that were requeued
Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
Install enroot dependencies on Ubuntu 20.04
Container Engines
Improvements
Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates
Monitoring
New Features
Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
Added ManagedServicesOk health check to lite devices
Improvements
Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
Do not use linear interpolation for health check data, but rather the last known value
Fixed Issues
Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
Fixed job-metrics in the base-view monitoring tree