Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.11

Released: 16 November 2023

General

New Features

  • Added support for SLES15 SP5

Improvements

  • Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)

  • Updated cuda-driver package to 535.129.03

CMDaemon

New Features

  • Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run

  • Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd

  • Added cmsh command to show dhcpd leases

  • Added Border Gateway Protocol (BGP) overview for Cumulus switches

  • Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches

  • Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

Improvements

  • Allow nodes to be automatically powered off or reset upon installer failure

  • Allow devices to be identified by serial in DHCP

  • Relaxed SSL checks when registering a new Cumulus switch via ZTP

  • Improved CMDaemon startup speed in HA mode

  • Prevent multiple identical failover group status

  • Added a flag to allow changing a user home directory to an existing directory

  • Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions

  • Write chrony.conf instead of ntp.conf in node-installer on RHEL9

  • Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

Fixed Issues

  • Fixed counting of nodes and accelerators towards the license limit

  • Fixed service status in cmsh of a lite-node

  • Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node

  • Store services added to lite-node to DB

  • Fixed cmsh imageupdate –pattern <path>

Workload Management

New Features

  • Automatically configure non-MIG GPUs in Slurm when detected

  • Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)

  • Added new package pyxis-sources to allow building pyxis in air-gapped environments

Improvements

  • Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

Fixed Issues

  • Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role

  • Cleaned up database node entries of Slurm jobs that were requeued

  • Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name

  • Install enroot dependencies on Ubuntu 20.04

Container Engines

Improvements

  • Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)

  • Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

Monitoring

New Features

  • Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION

  • Added ManagedServicesOk health check to lite devices

Improvements

  • Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes

  • Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs

  • Do not use linear interpolation for health check data, but rather the last known value

Fixed Issues

  • Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created

  • Fixed job-metrics in the base-view monitoring tree