NVIDIA DGX SuperPOD: Release Notes 10.24.05#


This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.05 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:


The NVIDIA DGX SuperPOD: Release Notes 10.24.05 is also available as a PDF.

Component Versions#

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions#








Ubuntu 22.04.2 LTS



CUDA toolkit




Cumulus OS


Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 28.39.2048

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.2016

HCA Firmware: CX6 - 20.39.2048



Mellanox OFED Driver (A100 and H100)


DGX kernel


GPU Driver


Lustre Client

lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125


UFM Enterprise SW: 1.7.0







Change Requests#


New Features#

  • Added mlnx-ofed24.01 package

  • Added CUDA 12.4 toolkit packages

  • Added PBS Professional 2024 packages

  • cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page

  • Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package

  • bcm-post-install: Allow slogin node base name to configurable

  • bcm-post-install: added command for SuperPOD validation


  • Updated mlnx-ofed23.10 to 23.10-

  • For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemon

Fixed Issues#

  • An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass

  • An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver– —packages

  • Fix SuperPOD enroot configuration


New Features#

  • Mark the devices monitored by MQTT as UP when recent monitoring data exists

  • Allow the option to run Slurm accounting database daemon in high-availability mode


  • Added per node Slurm state metric

  • Allow the option to select automatically a random free port for the IMEX service

  • Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces

  • Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics

  • Improved performance of the cm-mqtt service

  • Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots

  • Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested

  • Added lxc- interfaces to the default exclude list for the ProcNetDev monitoring producer

  • Allow the option to disable a MQTT with a flag in the configuration file

Fixed Issues#

  • An issue with validating of the LDAP group during commit of a user

  • An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed

  • An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks

  • An issue where monitoring consolidators may not be created for all entity-measurable pairs

  • An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database

  • Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception

  • An issue with chargeback calculations using per node requested CPU/GPU information

  • An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage

  • An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon

  • An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed

  • An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover

  • An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover

  • An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners

  • An issue with the reporting of the GPU chargeback information

  • An issue where a category can be removed while another category’s provisioning role still has a reference to it

  • An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node’s power status is ON

  • An issue where on failure AWS node power on actions produce an error message “Unable to parse output” instead of the AWS error message

  • An issue with the cmsh dropunused command that can result in removing too many measurables

  • An issue with the cmsh device syncinfo command when specifying an fspart path

Base View#

Fixed Issues#

  • An issue with displaying the SNMP system information data for switches

Cluster Tools#

Fixed Issues#

  • An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package


New Features#

  • Added support for creating HA COD clusters in Azure

  • Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool

  • Added support for OCI defined tags. This changes the original –head-node-tags command line option to –head-node-freeform-tags and adds new command line option –head-node-defined-tags


  • Allow the option to select the Azure availability zone on the command line of the cluster create command

Machine Learning#

New Features#

  • Introduced ML NCCL and CuDNN packages for CUDA 12.4

Fixed Issues#

  • An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start


Fixed Issues#

  • An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution


Fixed Issues#

  • An issue where ‘germany’ is incorrectly listed as an Azure region


Fixed Issues#

  • An issue where missing modular metadata for the ‘default’ package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images



  • Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster

Fixed Issues#

  • A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters

  • An issue with cm-kubernetes-setup –pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure


New Features#

  • Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed

Fixed Issues#

  • An issue where the shutdown state from files may be used incorrectly


Fixed Issues#

  • Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically



  • Allow the options to specify the IP increment with the cmsh addinterface command

  • Include the job run time data in the cmsh WLM jobs info command

  • Allow the option to specify the network CIDR on the cmsh “add network” command line

Fixed Issues#

  • An issue where the cmsh monitoring trigger info command does not show grouped expressions

  • An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history


New Features#

  • Allow the option to use sqsh files to run Jupyter kernels based on enroot

  • Restrict the access to Jupyter based on group memberships


  • Allow the option to install and configure VNC when setting up Jupyter


Fixed Issues#

  • An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used



  • Updated pyxis-sources to 0.19.0