Release notes for NVIDIA Base Command™ Manager (BCM) 11.25.08

Released: 1 October 2025

General

New Features

  • Updated Slurm 25.05 to 25.05.2

  • Updated Slurm 24.11 to 24.11.6

  • Updated Topograph to 3.2.0

  • Updated etcd to 3.5.22

  • Updated cm-nvhpc to 25.5

  • Added cm-kubeadm-manage-joins helper script

Fixed Issues

  • Fixed potential killing of slurmdbd by systemd

NVIDIA Mission Control

New Features

  • Support for NVIDIA Mission Control (NMC) 2.0

  • Updated run:ai to 2.22

  • Bumped Run:ai self-hosted helm chart to 40m timeout (20m was not enough)

  • Run:ai cluster installation (helm install) increased max. timeout to 20 minutes (instead of default)

  • Do not check if Kyverno is disabled during AHR (Autonomous Hardware Recovery) setup

  • Run:ai wizard now includes the correct mpi / kubeflow CRDs

Fixed Issues

  • Run:ai cluster installation stage will now install always on the correct kube cluster (in case of multiple kubernetes cluster on a single BCM cluster)

  • Long hostnames could visually render incorrect TUI dialogs in the Run:ai configuration dialog.

  • Run:ai self-hosted allows for custom version selection and latest patch release versions

CMDaemon

New Features

  • Standardize on array result for rest/sysinfo

  • Fixed an issue with changing cluster UUID not updating in BCM

  • Ensure we add the KubernetesCertsExpiration healthcheck to existing deployments

  • CMDaemon runs scontrol reconfig once for both slurm.conf and topology.conf updates

  • Fixed an issue with labeled entities not being listed in the base-view monitoring view

  • BCM can set node-role.kubernetes.io/runai-gpu-worker=true for GPU nodes (and cpu worker variant).

  • Opt-in through BCM GlobalConfig to disable writing out root kubeconfig configuration files on kubernetes worker nodes.

Fixed Issues

  • Make sure hostname changes are actually host name changes + restore etcd:etcd ownership after restoring pki backup (done before and after kubeadm reset)

  • Introduce new KubernetesCertsExpiration healthcheck, that warns if nodes have expiring certificates (within 30 days)

  • Fixed regression in KubernetesChildNode healthcheck (and others)

  • Extend v1/rest/status to include rack and system name

  • cm-kubeadm-manage helper script should never overwrite etcd certificates handled by BCM

  • Fixed regressions in metrics collection scripts in Kubernetes

  • Only invoke kubeadm kubeconfig user commands when actually needed

COD

New Features

  • ALL: Change default BCM version to 11.0.

  • ALL: Create inbound rules for both protocols (TCP/UDP) if the user didn’t explicitly request a protocol.

  • AWS: Preserve internal IP address when recovering broken AWS HA head node.

  • AWS: Create placement group with spread strategy for better HA head node separation.

  • Azure: Added support for NAT gateway for outbound connectivity.

  • GCP: Add cloudsetting network_performance_config.

  • GCP: Always delete empty subnets in BCM created networks.

  • OCI: Ensure the VM architecture is compatible with the requested image.

Fixed Issues

  • ALL: Improve error on missing architecture config.

  • ALL: Preserve SSH host keys on cm-cloud-ha-setup.

  • ALL: HA setup checks that head node IP is within CIDR.

  • ALL: Harmonize –cluster-tags and –head-node-tags cluster create flags.

  • ALL: Updated image search defaults and validation. By default list the latest image. Flag –all-revisions overrides –latest. When both –latest and –all-revisions are used, an info message is logged.

  • Azure: Powering on multiple nodes could raise an error while creating several VHDs at the same time.

  • GCP: Fix cryptic invalid refresh token grant errors.

  • GCP: Set correct instance hostname on cluster create.

  • GCP: Reduce API requests on cluster delete.

  • GCP: Fix timeout creating HA shared storage.

  • GCP: Optimize cluster list.

  • GCP: Improve reliability of cluster delete.

  • OCI: Fixed cluster create hang when non-existing region is specified.

cm-kubernetes-setup

New Features

  • Started using containerd v2 by default for new Kubernetes setups

  • Updated Kubeflow training operator to 1.9.2

  • Remove validation for NetQ and Kubernetes version Mapping Only for NetQ version 4.15.0

  • The wizard now allows choosing ‘toolkit.enabled=true’, delegating the install + containerd configuration to the NVIDIA GPU Operator

Fixed Issues

  • Fixed kube_get_available_nodes remote_request RPC call to also consider head nodes for Kubernetes workers

  • CAPI templates (stored in the kubernetes submode) should not be uninstallable via cm-kubernetes-setup.

  • Fixed showing MetalLB/BGP screen in cm-kubernete-setup when Tigera operator is selected

cm-wlm-setup

Fixed Issues

  • Fixed adding pyxis plugin to secondary software images on multi-arch/multi-distro setups