Base Command Manager 11
Base Command Manager 11 New Features
NVIDIA DGX GB200 with NVLINK support [NVIDIA Mission Control only]
Leak action policies for NVIDIA DGX GB200 with NVLINK [NVIDIA Mission Control only]
Building Management Systems integration [NVIDIA Mission Control only]
NVIDIA NMX and NVIDIA Internode Memory Exchange Service support [NVIDIA Mission Control only]
Power Reservation Steering support [NVIDIA Mission Control only]
Workload Power Profile Settings [NVIDIA Mission Control only]
About this Document
This document provides important release-specific considerations for Base Command Manager 11.
Introduction to Base Command Manager
NVIDIA Base Command Manager provides cluster management software for streamlining cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools for deploying and managing an AI data center.
Version 11 is a major release of Base Command Manager. It integrates the same functionality provided by Base Command Manager 10 (except where stated explicitly).
Base Command Manager is now also included in NVIDIA Mission Control. NVIDIA Base Command Manager in Mission Control includes advanced features designed for the NVIDIA Blackwell Architecture.
Note
Base Command Manager 11 can be evaluated by requesting a free license. Note: support for the product is not included with the free license. It can be purchased later, separately.
Please visit https://www.nvidia.com/en-us/data-center/base-command/manager/ to request a license for Base Command Manager.
For additional documentation and information about Base Command Manager, refer to
Base Command Manager 11 New Features
The following are the key new features of Base Command Manager 11:
NVIDIA DGX GB200 with NVLINK support [NVIDIA Mission Control only]
Base Command Manager introduces support for NVIDIA DGX GB200 racks with NVLINK.
Base Command Manager can provision and manage compute trays and switch trays in a rack, with advanced network automation and a new rackoverview mode in CMSH.
New attributes are added to the rack entities to model GB200 racks.
NVIDIA Mission Control license
A new NVIDIA Mission Control edition is introduced for Base Command Manager licenses purchased via NVIDIA Mission Control.
Base Command Manager 11 clusters with an NVIDIA Mission Control license include exclusive advanced capabilities.
A new dedicated panel is added to Base View to manage NVIDIA Mission Control advanced capabilities.
Leak action policies for NVIDIA DGX GB200 with NVLINK [NVIDIA Mission Control only]
Base Command Manager interacts with the BMC of compute trays and switch trays in a GB200 rack via RedFish API to periodically fetch metrics and detect leaks.
Administrators can define a set of rules and policies to customize the Base Command Manager behavior when leaks with different severity levels are detected.
Building Management Systems integration [NVIDIA Mission Control only]
Base Command Manager can interact via MQTT with a Building Management System (BMS) conforming to the Controls and Monitoring Reference Design by NVIDIA for advanced data center monitoring.
Base Command Manager can send electrical or liquid isolation requests to the BMS according to information acquired from the BMC of compute trays and switch trays in a GB200 rack.
Base Command Manager can read information from the BMS and receive leak-related events happening outside the racks.
NVIDIA NMX and NVIDIA Internode Memory Exchange Service support [NVIDIA Mission Control only]
Base Command Manager introduces support for NVIDIA NMX to configure and monitor NVLINK and enable GPU-to-GPU communication in GB200 racks.
The existing bcm-netautogen and bcm-post-install tools are augmented to support the provisioning and configuration of NVLINK switches with NMX Controller and NMX Telemetry.
Base Command Manager introduces two ways to configure NVIDIA IMEX communication channels: per-job (via prolog) or as a global service.
Power Reservation Steering support [NVIDIA Mission Control only]
Base Command Manager introduces support for Power Reservation Steering (PRS), a prediction-based datacenter-scale dynamic power management technology. PRS manages the power budget dynamically, ensuring power budget compliance while minimizing the impact on application performance.
New packages, wizard (cm-prs-setup), entities and options added to install and configure PRS in Base View and CMSH.
Workload Power Profile Settings [NVIDIA Mission Control only]
The Blackwell architecture introduces support for performance profiles that control some operating parameters of the GPUs via DCGM. Base Command Manager introduces support for Workload Performance Profile Settings (WPPS) to make it easier for users to apply Blackwell performance profiles on GPUs for their Slurm jobs.
If WPPS is enabled, new BCM prolog and epilog scripts are enabled in Slurm, and users can specify which performance profile should be used for jobs via comments.
Base Command Manager administrators can specify which users are allowed to change performance profiles with a new flag in CMSH.
Base View
The new Base View (previously known as base-view-ng) becomes the official and only GUI of Base Command Manager.
The new Base View includes support for new wizards and monitoring dashboards, improved scalability, and better user experience.
Google Cloud Platform for Cluster On Demand
Base Command Manager now includes support for Google Cloud Platform (GCP) via Cluster on Demand, with the new cm-cod-gcp tool.
CMDaemon authenticates to GCP using a service account that allows the head node to obtain ephemeral tokens in order to interact with GCP APIs. Static credentials are not supported for GCP.
Slurm improvements
Slurm package files are moved from /cm/shared/apps/slurm to /cm/local/apps/slurm. This change enables easier in-place updates of Slurm, as recommended by SchedMD.
Support for topology/block is added (topology.conf), to also enable NVLINK-aware scheduling of jobs for GB200 systems. New commands are added to configure the SlurmBlockTopologySettings in Base View and CMSH.
Support for topograph is added to automatically discover cluster network topology in cloud. A new option is added to configure topograph in the SlurmBlockTopologySettings in Base View and CMSH.
NVSM integration
Base Command Manager introduces support for NVSM in Base View and CMSH.
The commands invokable via Base Command Manager are convenient wrappers around NVSM commands. They will be executed on one or more target nodes (or category of nodes) and collect the results.
Run:ai setup wizard
The cm-kubernetes-setup wizard is improved to streamline the installation of the Run:ai control plane on a BCM cluster (self-hosted).
Base Command Manager verifies the correctness of the input credentials and certificates, installs all the dependencies for Run:ai, and performs some post-installation configuration to make sure the control plane and the workload cluster are operational.
CUDA driver and MOFED packages
CUDA driver and MOFED packages are no longer distributed by Base Command Manager, and removed from the repositories.
Administrators are now required to install the CUDA driver and MOFED packages directly from the NVIDIA repositories.
Kernel Provisioning for JupyterLab integration
Base Command Manager removes Jupyter Enterprise Gateway from the integration with JupyterLab, in favor of the new native Kernel Provisioning, enabling secure native communication with remote kernels (no certificate generation/exchange is required anymore).
This underlying change is transparent to JupyterLab users. However, templates and kernels created with previous version(s) of BCM are not compatible with Base Command Manager 11 and the new Kernel Provisioning.
Other features and improvements
Introduced support for BaseOS 7
Introduced support for multi-arch Kubernetes deployments
Introduced support for Blackwell GPUs autodetection for Slurm
Introduced support for cgroups v2 in Slurm
Improved Prometheus metrics parser
Reduced coupling between CMDaemon and Kubernetes for easier lifecycle management
Introduced new cm-etcd-manage tool to manage etcd members (add/remove nodes, promote leaders, etc) and perform backup/restore of the data
Introduced new cm-kubeadm-manage tool to handle Kubernetes components and related certificates
Improved Calico deployment from legacy Kubernetes addons to Tigera Operator
Improved Ingress NGINX Controller from legacy Kubernetes addons to Operator
Base Command Manager 11 Removed Features
The following are the key features removed from Base Command Manager 11:
cmjob
Machine learning packages (RPMs and DEBs)
Ceph integration
Apptainer/Singularity package
Ubuntu 20.04 (end of life: April 25)
GigaIO integration
Altair Grid Engine integration
Direct attached storage for HA setups
BeeGFS integration
OpenShift integration