Base Command Manager 11

About this Document

This document provides important release-specific considerations for Base Command Manager 11.

Introduction to Base Command Manager

NVIDIA Base Command Manager provides cluster management software for streamlining cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools for deploying and managing an AI data center.

Version 11 is a major release of Base Command Manager. It integrates the same functionality provided by Base Command Manager 10 (except where stated explicitly).

Base Command Manager is now also included in NVIDIA Mission Control. NVIDIA Base Command Manager in Mission Control includes advanced features designed for the NVIDIA Blackwell Architecture.

Note

Base Command Manager 11 can be evaluated by requesting a free license. Note: support for the product is not included with the free license. It can be purchased later, separately.

Please visit https://www.nvidia.com/en-us/data-center/base-command/manager/ to request a license for Base Command Manager.

For additional documentation and information about Base Command Manager, refer to

Base Command Manager 11 New Features

The following are the key new features of Base Command Manager 11:

NVIDIA Mission Control license

  • A new NVIDIA Mission Control edition is introduced for Base Command Manager licenses purchased via NVIDIA Mission Control.

  • Base Command Manager 11 clusters with an NVIDIA Mission Control license include exclusive advanced capabilities.

  • A new dedicated panel is added to Base View to manage NVIDIA Mission Control advanced capabilities.

Building Management Systems integration [NVIDIA Mission Control only]

  • Base Command Manager can interact via MQTT with a Building Management System (BMS) conforming to the Controls and Monitoring Reference Design by NVIDIA for advanced data center monitoring.

  • Base Command Manager can send electrical or liquid isolation requests to the BMS according to information acquired from the BMC of compute trays and switch trays in a GB200 rack.

  • Base Command Manager can read information from the BMS and receive leak-related events happening outside the racks.

NVIDIA NMX and NVIDIA Internode Memory Exchange Service support [NVIDIA Mission Control only]

  • Base Command Manager introduces support for NVIDIA NMX to configure and monitor NVLINK and enable GPU-to-GPU communication in GB200 racks.

  • The existing bcm-netautogen and bcm-post-install tools are augmented to support the provisioning and configuration of NVLINK switches with NMX Controller and NMX Telemetry.

  • Base Command Manager introduces two ways to configure NVIDIA IMEX communication channels: per-job (via prolog) or as a global service.

Power Reservation Steering support [NVIDIA Mission Control only]

  • Base Command Manager introduces support for Power Reservation Steering (PRS), a prediction-based datacenter-scale dynamic power management technology. PRS manages the power budget dynamically, ensuring power budget compliance while minimizing the impact on application performance.

  • New packages, wizard (cm-prs-setup), entities and options added to install and configure PRS in Base View and CMSH.

Workload Power Profile Settings [NVIDIA Mission Control only]

  • The Blackwell architecture introduces support for performance profiles that control some operating parameters of the GPUs via DCGM. Base Command Manager introduces support for Workload Performance Profile Settings (WPPS) to make it easier for users to apply Blackwell performance profiles on GPUs for their Slurm jobs.

  • If WPPS is enabled, new BCM prolog and epilog scripts are enabled in Slurm, and users can specify which performance profile should be used for jobs via comments.

  • Base Command Manager administrators can specify which users are allowed to change performance profiles with a new flag in CMSH.

Base View

  • The new Base View (previously known as base-view-ng) becomes the official and only GUI of Base Command Manager.

  • The new Base View includes support for new wizards and monitoring dashboards, improved scalability, and better user experience.

Google Cloud Platform for Cluster On Demand

  • Base Command Manager now includes support for Google Cloud Platform (GCP) via Cluster on Demand, with the new cm-cod-gcp tool.

  • CMDaemon authenticates to GCP using a service account that allows the head node to obtain ephemeral tokens in order to interact with GCP APIs. Static credentials are not supported for GCP.

Slurm improvements

  • Slurm package files are moved from /cm/shared/apps/slurm to /cm/local/apps/slurm. This change enables easier in-place updates of Slurm, as recommended by SchedMD.

  • Support for topology/block is added (topology.conf), to also enable NVLINK-aware scheduling of jobs for GB200 systems. New commands are added to configure the SlurmBlockTopologySettings in Base View and CMSH.

  • Support for topograph is added to automatically discover cluster network topology in cloud. A new option is added to configure topograph in the SlurmBlockTopologySettings in Base View and CMSH.

NVSM integration

  • Base Command Manager introduces support for NVSM in Base View and CMSH.

  • The commands invokable via Base Command Manager are convenient wrappers around NVSM commands. They will be executed on one or more target nodes (or category of nodes) and collect the results.

Run:ai setup wizard

  • The cm-kubernetes-setup wizard is improved to streamline the installation of the Run:ai control plane on a BCM cluster (self-hosted).

  • Base Command Manager verifies the correctness of the input credentials and certificates, installs all the dependencies for Run:ai, and performs some post-installation configuration to make sure the control plane and the workload cluster are operational.

CUDA driver and MOFED packages

  • CUDA driver and MOFED packages are no longer distributed by Base Command Manager, and removed from the repositories.

  • Administrators are now required to install the CUDA driver and MOFED packages directly from the NVIDIA repositories.

Kernel Provisioning for JupyterLab integration

  • Base Command Manager removes Jupyter Enterprise Gateway from the integration with JupyterLab, in favor of the new native Kernel Provisioning, enabling secure native communication with remote kernels (no certificate generation/exchange is required anymore).

  • This underlying change is transparent to JupyterLab users. However, templates and kernels created with previous version(s) of BCM are not compatible with Base Command Manager 11 and the new Kernel Provisioning.

Other features and improvements

  • Introduced support for BaseOS 7

  • Introduced support for multi-arch Kubernetes deployments

  • Introduced support for Blackwell GPUs autodetection for Slurm

  • Introduced support for cgroups v2 in Slurm

  • Improved Prometheus metrics parser

  • Reduced coupling between CMDaemon and Kubernetes for easier lifecycle management

  • Introduced new cm-etcd-manage tool to manage etcd members (add/remove nodes, promote leaders, etc) and perform backup/restore of the data

  • Introduced new cm-kubeadm-manage tool to handle Kubernetes components and related certificates

  • Improved Calico deployment from legacy Kubernetes addons to Tigera Operator

  • Improved Ingress NGINX Controller from legacy Kubernetes addons to Operator

Base Command Manager 11 Removed Features

The following are the key features removed from Base Command Manager 11:

  • cmjob

  • Machine learning packages (RPMs and DEBs)

  • Ceph integration

  • Apptainer/Singularity package

  • Ubuntu 20.04 (end of life: April 25)

  • GigaIO integration

  • Altair Grid Engine integration

  • Direct attached storage for HA setups

  • BeeGFS integration

  • OpenShift integration