Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.05

Released: 24 May 2024

General

New Features

  • Added mlnx-ofed24.01 package

  • Added CUDA 12.4 toolkit packages

  • Added PBS Professional 2024 packages

  • cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page

  • Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package

Improvements

  • Updated mlnx-ofed23.10 to 23.10-2.1.3.1

  • For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemons

Fixed Issues

  • An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass

  • An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver-* packages

CMDaemon

New Features

  • Mark the devices monitored by MQTT as UP when recent monitoring data exists

  • Allow the option to run Slurm accounting database daemon in high-availability mode

Improvements

  • Added per node Slurm state metric

  • Allow the option to select automatically a random free port for the IMEX service

  • Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces

  • Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics

  • Improved performance of the cm-mqtt service

  • Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots

  • Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested

  • Added lxc* interfaces to the default exclude list for the ProcNetDev monitoring producer

  • Allow the option to disable a MQTT with a flag in the configuration file

Fixed Issues

  • An issue with validating of the LDAP group during commit of a user

  • An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed

  • An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks

  • An issue where monitoring consolidators may not be created for all entity-measurable pairs

  • An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database

  • Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception

  • An issue with chargeback calculations using per node requested CPU/GPU information

  • An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage

  • An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon

  • An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed

  • An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover

  • An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover

  • An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners

  • An issue with the reporting of the GPU chargeback information

  • An issue where a category can be removed while another category’s provisioning role still has a reference to it

  • An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node’s power status is ON

  • An issue where on failure AWS node power on actions produce an error message “Unable to parse output” instead of the AWS error message

  • An issue with the cmsh dropunused command that can result in removing too many measurables

  • An issue with the cmsh device syncinfo command when specifying an fspart path

Base View

Fixed Issues

  • An issue with displaying the SNMP system information data for switches

Cluster Tools

Fixed Issues

  • An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package

COD

New Features

  • Added support for creating HA COD clusters in Azure

  • Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool

  • Added support for OCI defined tags. This changes the original –head-node-tags command line option to –head-node-freeform-tags and adds new command line option –head-node-defined-tags

Improvements

  • Allow the option to select the Azure availability zone on the command line of the cluster create command

Machine Learning

New Features

  • Introduced ML NCCL and CuDNN packages for CUDA 12.4

Fixed Issues

  • An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start

cm-clone-install

Fixed Issues

  • An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution

cm-cluster-extension

Fixed Issues

  • An issue where ‘germany’ is incorrectly listed as an Azure region

cm-create-image

Fixed Issues

  • An issue where missing modular metadata for the ‘default’ package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images

cm-kubernetes-setup

Improvements

  • Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster

Fixed Issues

  • A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters

  • An issue with cm-kubernetes-setup –pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure

cm-scale

New Features

  • Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed

Fixed Issues

  • An issue where the shutdown state from files may be used incorrectly

cm-wlm-setup

Fixed Issues

  • Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically

cmsh

Improvements

  • Allow the options to specify the IP increment with the cmsh addinterface command

  • Include the job run time data in the cmsh WLM jobs info command

  • Allow the option to specify the network CIDR on the cmsh “add network” command line

Fixed Issues

  • An issue where the cmsh monitoring trigger info command does not show grouped expressions

  • An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history

jupyter

New Features

  • Allow the option to use sqsh files to run Jupyter kernels based on enroot

  • Restrict the access to Jupyter based on group memberships

Improvements

  • Allow the option to install and configure VNC when setting up Jupyter

pythoncm

Fixed Issues

  • An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used

pyxis-sources

Improvements

  • Updated pyxis-sources to 0.19.0