Configuration for Node Provisioning#

The control plane for the DGX SuperPOD reference architecture is a mix of primarily ARM/C2 servers and x86 servers. The head nodes can have either x86 or C2/ARM processors. However it is recommended to have a set of C2/ARM nodes in order to natively compile ARM code to run on the GB200 compute trays.

BCM 11 creates a default image, node-installer image and a /cm/shared image for each architecture. Upon PXE boot, the architecture of the server is detected by the BCM PXE bootloader, and then the relevant node-installer is used.

This section covers the complete configuration process for node provisioning, from initial mixed-architecture setup through to power-on and provisioning. The process involves setting up: * mixed architectures * creating software images * defining node categories * configuring both control plane and compute nodes.

Mixed Architecture, Software Image, and Category Setup#

Mixed Architecture Setup

Configure support for mixed ARM/C2 and x86 server architectures by importing OS images for different architectures than the head node.

Software Image Setup

Create and customize software images for each node category and type, including control plane nodes and DGX GB200 nodes.

Category Creation

Define and configure individual categories for different node types with appropriate settings and software image assignments.

Initial Setup Verification Checklist

Comprehensive verification checklist to ensure all initial setup processes are complete and correct before proceeding to node configuration.

Control Plane Node Configuration#

Control Plane Node Entries

Create node entries for control plane servers by defining “golden” nodes and cloning them for each node type.

Control Plane Configuration Verification Checklist

Comprehensive verification checklist to ensure all control plane node configurations are complete and correct before proceeding to provisioning.

GB200 Rack Node Configuration#

A new feature of NVIDIA Mission Control, an entire DGX GB200 rack can be imported into the BCM device list from a rack inventory file. Per rack, it includes all the MACs and serial numbers for all eighteen GB200 compute trays, all nine NVLink Switch trays, in addition to the power shelves. Once imported, the entire rack can be controlled as a single cohesive entity through a series of new rack level commands. The commands for rack management are not covered in this document but in the NVIDIA Mission Control User Guide.

If manual configuration of the node entries is needed, .json examples in this section can be modified and imported into BCM 11, or the node entries can be added manually as was done with control plane nodes.

Automated Rack Import Process

Import entire DGX GB200 racks including all compute trays, NVLink Switch trays, and power shelves using the bcm-netautogen tool.

Manual Rack Import Process

Manually edit/create .json files to import compute trays entries, NVLink Switch trays node entries, and power shelves node entries.

Manual Addition of GB200 Rack Entries

Manually add DGX GB200 rack entries when automatic import is not available or when custom configurations are required.

Manual Addition of NVLink Switch Rack Entries

Manually add NVLink Switch entries with full Zero Touch Provisioning (ZTP) configuration, including NVOS firmware installation and CM Lite Daemon setup.

GB200 Rack Configuration Verification Checklist

Comprehensive verification checklist to ensure all GB200 rack configurations are complete and correct before proceeding to rack bring-up and provisioning.

Final Setup and Provisioning#

Finalize Headnode Setup

Complete the headnode configuration to prepare for cluster-wide provisioning operations.

Control Plane Power On and Provisioning

Execute the final power-on and provisioning sequence for control plane nodes to bring the cluster online.

Unified Fabric Manager (UFM) Pre-Setup

Configure UFM (Unified Fabric Manager) prerequisites before beginning the node provisioning process.