Release notes for NVIDIA Base Command™ Manager (BCM) 10.23.09

Released: 22 September 2023

General

NVIDIA Base Command™ Manager (BCM) 10.23.09 is the first public release for version 10, a new major version of NVIDIA cluster management software.

New Features

  • Support for Oracle Cloud Infrastructure for Cluster On Demand

  • Support for NVIDIA Spectrum switches provisioning (Cumulus OS 5) and management via cm-lite-daemon

  • Support for NVIDIA BlueField-2 and BlueField-3 Data Processing Units (DPUs) provisioning (BFB) and management

  • Support for NVIDIA AI Enterprise software versions

  • New DGX SuperPOD post install setup tool (cm-pod-setup)

  • New DGX SuperPOD network configuration setup tool (bcm-netautogen)

  • Switch to GPU-based licensing

  • Add cm-cron service

  • Add cm-list-image-conf-files.py script to list all special files in <image>/cm/conf/

  • Add cuda12.2 packages

  • Add mlnx-ofed23.04 package

  • Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470

Improvements

  • Update cm-openssl package to 3.1.2

  • Update mlnx-ofed58 package to 5.8-3.0.7.0

  • Update mplnx-ofed54 package to 5.4-3.7.5.0

  • Update mlnx-ofed49 package to 4.9-7.1.0.0

  • Update mlnx-ofed59 DGX H100 package to 5.9.0.5.6.0.125

CMDaemon

New Features

  • Add cmsh device switchports command to get an overview of available switch ports

  • Send a warning event when a provisioning request has stalled longer than 2 hours. (Default value can be configured)

Improvements

  • Switch to UUIDs to uniquely identify entities

  • Allow cm-mig-manage to support GPUs that do not have index = minorID

  • Turn on MIG on DGX H100 after node reboot when MIG.profiles are set in GPU settings

  • Increase DHCP maximal search domains to 32 by default

  • Add cmsh chassis set members as compact device list

  • Preserve files in /cm/images/<image>/cm/conf/{node,category}/ while updating images with rsync

  • Show an error message when cmsh createramdisk is run without arguments or an image set

  • Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year

  • Add a new ‘–all’ option to cmsh sysinfo command to show extra information that has been collected by CMDaemon

  • Prevent CMDaemon crash when missing or truncated files are present in the monitoring backup directory

  • Increase systemd-resolved.service reload timeout

  • Redirect all stdout/stderr from a cmburn test script to a log file

  • Show inherited kernel properties in cmsh device get

  • Add multiline support for cmsh rack display

  • Add free extra_values to all entities to store additional information

  • Remove field for the CPU frequency scaling governor

  • Add –certificate –key options in cmsh help

  • Add user/group name validation in cmsh

  • Do not populate status for each node in the environment to avoid multiple slow RPCs

Fixed Issues

  • Fix killing jobs on a node when CMDaemon is restarted on that node

  • Fix RemoteMountChecker when a custom port is specified as the NFSCheckerPort AdvancedConfig parameter when querying cm-nfs-checker

  • Handle cm-lite-daemon restart properly

  • Fix help of cmsh cert removerequest command

  • Fix HPL test start in cmburn on SLES 15 base distribution

  • Automatically adjust overlay.category references when a category is removed

  • Do not clone switchports when cloning a device

  • Fix CMDaemon crash when malformed JSON data is sent

  • Update node environment cache when automatically changing FS exports

  • Honor backup role disabled=yes configuration

  • Detect xvd* disk in sysinfo

  • Prevent the addition of duplicate nameservers in /etc/resolv.conf

  • Delete duplicate entries in /etc/nginx/nginx.conf

  • Fix cmsh crash when cloning an entity without specifying a name in the genericresources submode

  • Hide all events in cmsh if –hide-events is used

  • Remove verbose logs in /tmp/aws* from cm-setup

  • Fix cmsh table formatting with long lines

  • Fix default gateway for edge nodes running Ubuntu

  • Fix duplicate nodes for monitoring pickup scheduler

  • Fix database storage of drained provisioning nodes

  • Ensure named gets reloaded when network changes made

  • Fix false negative open –failbeforedown when a status value is unchanged

  • Fix typo guage -> gauge

Node Installer

Fixed Issues

  • Fix booting of compute nodes with separate /usr filesystem

  • Allowed cloning of headnodes with btrfs filesystems

  • Fix disk management script to correctly assemble MD raids

cm-scale

New Features

  • Support for Oracle Cloud Infrastructure for Auto Scaler

  • Automatically detect memory and GPUs for cloud nodes

Improvements

  • Support multi-partition Slurm jobs in Auto Scaler

Fixed Issues

  • Fix incorrect number of CPUs for Slurm jobs in Auto Scaler

  • Handle lack of availability zone capacity for AWS spot instances in Auto Scaler

  • Auto Scaler ignores queue priorities for multi-queue Slurm jobs

Linux and Hardware Integration

New Features

  • Support for DGX OS 6.1

  • Add cm-dpu-setup tool to define NVIDIA BlueField-2 and BlueField-3 Data Processing Units (DPUs) in the cluster

  • Add cm-dpu-manage to perform management actions on NVIDIA BlueField-2 and BlueField-3 Data Processing Units (DPUs)

Cloud

New Features

  • Add cm-cod-oci to create Cluster on Demand in Oracle Cloud Infrastructure

  • Allow COD-AWS cluster to span multiple regions (contact support for assistance)

  • Add support for AWS FSx on Ubuntu

Fixed Issues

  • Fix various issues with Azure locations caused by Azure API errors

  • Improved support for AWS spot instances

Kubernetes

New Features

  • Change Kubernetes deployment to use kubeadm

  • Change Kubernetes deployment to use packages from kubernetes.io instead of cm-kubernetesXXX packages

  • Support for Cluster API (CAPI) as a deployment method for new Kubernetes clusters

Improvements

  • Update Kyverno to 3.0.4 (due to incompatibility with Kubernetes 1.27.x)

  • Support for multiple NVIDIA GPU operator versions

  • Deploy the NVIDIA GPU Operator with toolkit.enabled=false by default

Fixed Issues

  • NVIDIA GPU Operator deployment always results in NVIDIA packages being installed

  • Update exclude lists for Kubernetes to avoid failures on “grabimage”

  • Do not include kubelet.service file in exclude list (this can interfere with assigning additional nodes to the Kubernetes roles and prevent the kubelet service from starting correctly)

Workload Management

New Features

  • Support data and cache sharing options for pyxis and enroot

  • Allow management of Slurm prolog/epilog timeouts

Improvements

  • Rely on MIG autodetection to configure gres.conf

  • Update Slurm package to 23.02 (older versions are not supported anymore)

  • Use pmix4 with Slurm 23.02

  • pyxis may now be compiled and installed from a local tarball with sources

  • All RPCs for job management API in CMDaemon also return an exit code of the operation

Fixed Issues

  • Fix parsing of Slurm job CPUs

  • Fix fetching job information when UGE accounting rotation is configured

  • Fix UGE AdditionalSubmitHosts advanced configuration flag

  • Advanced accounting (job types and account hierarchy monitoring)

Jupyter

New Features

  • Manage Spark and PostgreSQL instances from JupyterLab

  • Manage Pods and data migration from/to Persistent Volume Claims

  • Read Pod logs and events from Jupyter interface

  • Support for multi-factor authentication

Improvements

  • Support for private NGC credentials in Kubernetes kernel templates

Container Engines

Improvements

  • Update cm-docker package to 23.0.6

  • Update cm-containerd package to 1.7.1

  • Update cm-apptainer package to 1.1.9

Container Registries

Improvements

  • Update cm-harbor package to 2.8.2

  • Update cm-docker-registry package to 2.8.1

Fixed Issues

  • Generate containerd certificates when a registry mirror is not configured

Ceph

Improvements

  • Updated Ceph to Ceph Quincy

Monitoring

New Features

  • Add new NVSwitch metrics

  • Support for Graphana 10

Improvements

  • Disable job metrics collection when JobSampler is not setup to run in OOB mode

  • Sample node JobsRunning metric even when there are no jobs running

  • Reduce memory usage spike when using PromQL over short timespans

  • Multiply metric value by 100 when displaying % in pythoncm

  • Exclude rdma* by default in /proc/net/dev sampler

  • Exclude virtual ibp*v* interface from monitoring

Fixed Issues

  • Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts

  • Fix calculation of job_gpu_wasted metric when the node has multiple GPUs

  • Fix samplenow CPUUsage metric

  • Ensure job_gpu_* have correct values in the first few seconds of a job being started

  • Ensure first data sample of a Prometheus sampler is stored to the database

  • Propagate cumulative values passed by a JSON sampler during initialize

  • Fix metrics sampling when temperatures are not provided by the Redfish API

  • Clean up job monitoring when jobs are removed from cache