Introduction#

This guide will help a deployment engineer complete the software installation of Base Command Manager (BCM) 11 on the head node, set up the control plane nodes, and prepare the BCM software to power on, provision, and manage any NVIDIA GB200 NVL72 rack(s) managed by BCM. It also will show the deployment engineer prepare the cluster for NVIDIA Mission Control Software Installation. The guide is designed to be followed in order, with each section building on the previous one unless indicated otherwise. It assumes that the reader has a basic understanding of Linux and networking concepts, as well as familiarity with NVIDIA hardware and software.

The major steps that will be covered in this guide are:

  1. BCM 11 software installation on a server designated as the head node.

    • Installing the BCM 11 software on the head node.

    • Configuring the head node for mixed architecture provisioning support.

    • Setting up the head node for network connectivity.

  2. Configuring networking within BCM 11 manually (OEM deployments) or via the bcm-netautogen tool (DGX SuperPOD deployments only) for the entire cluster.

  3. Creating categories and their respective software images for node provisioning and NVIDIA Mission Control Software installation, including:

    • Slurm login.

    • Kubernetes (K8s) Administrator.

    • Kubernetes (K8s) User space

    • DGX GB200 compute trays.

  4. Individual control plane node hardware setup:

    • Slurm control plane nodes.

    • K8s Administrator control plane nodes.

    • K8s User space control plane nodes.

  5. GB200 Rack setup using the GB200 rack import process or a manual setup where major rack devices are added to BCM:

    • GB200 compute trays.

    • NVLink Switch devices.

    • Power shelves.

  6. High availability (HA) setup.

  7. Adding shared storage (NFS) to BCM.

Note

NVIDIA DGX SuperPOD supports workload orchestration with Slurm or with Kubernetes (K8s) through RunAI, but not both concurrently.