Introduction#

This document provides supplemental instructions for creating and managing DGX OS and BaseOS images in Base Command Manager (BCM). Please refer to the Base Command Manager Documentation for more detailed documentation and guides.

Important

BaseOS Software refers to the software that is released on top of, and independently from, the OS distribution. It describes the software stack included in DGX OS. Images created with the instructions in this document share the same Ubuntu Linux distribution as DGX OS.

BCM manages the local file system of a compute node as a software image, a blueprint for the contents of the file system. In practice, a software image is a directory on the head node containing a full Linux filesystem.

When a compute node boots, the node provisioning system sets up the node with a copy of the software image. Once the node is fully booted, it is possible to instruct the node to re-synchronize its local filesystems with the software image. This procedure can be used to distribute changes of the software image to the nodes without rebooting them (imageupdate). It is also possible to lock a software image so that no node is able to pick up the image until the software image is unlocked.

BCM includes a default image (default-image) derived from the same Linux distribution as the head node with the BCM node management software. The default image provides a base for customized images and is used on nodes that don’t have a specific image assigned to them. Most clusters typically utilize customized images tailored to specific node roles (e.g., login nodes, Slurm partitions, or Kubernetes worker nodes). Users manage these customizations on the head node by creating a new image or cloning and modifying existing images to suit specific requirements.

Compute nodes can also be provisioned with a system image that includes the NVIDIA BaseOS Software. NVIDIA BaseOS delivers an essential software stack designed for a robust, production-ready operating system environment critical for high-performance AI compute nodes. Key features include optimized system configurations, enhanced drivers, comprehensive diagnostics, and advanced monitoring tools.

BCM further facilitates the management and deployment of architecture-specific images, which is necessary for managing diverse hardware within mixed-architecture clusters.

References#

NVIDIA Base Command Manager

Provides comprehensive information and user guides for Base Command Manager (BCM).

NVIDIA BaseOS

Additional information about the NVIDIA BaseOS Software, including features, installation instructions, and configuration options.

Additional BCM Concepts#

The following list provides a brief overview of BCM concepts relevant to this guide.

Refer to Base Command Manager 11 Admin Guide - Section 2.1 (Concepts) for a more extensive list and detailed information.

Devices, Chassis, and Racks#

A device in the BCM infrastructure represents components of a cluster. A device can be anything from a physical node, virtual/cloud node, GPU, switch, etc. with different properties that can be configured through BCM.

Cluster can have local nodes grouped physically into racks that can be visualized in BCMView and console, and also operated on, for example, to reboot the rack.

Node Categories#

Node categories provide administrators a mechanism for grouping nodes that share common characteristics, such as hardware specifications, intended roles, or configuration requirements. Nodes within a category inherit default property values and settings from that category, simplifying management of large clusters. Configurations in a node category can be overwritten for a particular node, though.

Administrators can apply configurations, assign roles, or perform operations, such as rebooting nodes, on an entire group of nodes simultaneously by configuring the category.

  • Configure a large group of nodes concurrently, for example, to set up a group of nodes with a particular disk layout.

  • Operate on a large group of nodes concurrently, for example, to conduct a reboot on an entire category.

Node categories are typically assigned a specific software image, ensuring all members of that group boot with the correct software stack. Multiple categories can use the same image, and some properties of an image can be modified by category-specific settings, such as the disk layout, kernel parameters, etc.

System types other than login or compute nodes are automatically placed in the default node category defaultcategory

Node Groups#

A node group is a collection of individual nodes assembled for administrative convenience to perform operations simultaneously. The primary purpose is to allow administrators to run commands and carry out operations on multiple nodes at once such as a concurrent reboot or a network configuration update.

A single node can belong to multiple node groups simultaneously, or none at all. Nodes within a group do not necessarily share the same hardware, operating system image, or configuration. This is different from a node category, where all nodes share the exact same configuration, disk layout, or software image.

While roles and categories determine what a node is and how it is configured, node groups define arbitrary collections of nodes for operational management tasks.

Roles#

Roles are specific functions or tasks that can be applied to a node, a group of nodes, or a category. When a role is assigned to a category, all nodes in that group implicitly inherit that role. It enables efficient management and monitoring of large-scale infrastructure by defining the specific responsibilities of nodes within the cluster.

By assigning a certain role to a node activates the corresponding functionality, allowing administrators to customize the configuration and purpose of the node within the cluster. For example, a node can be assigned the “provisioning node” role to handle the deployment of other nodes, or “storage node” role to manage data storage. The “Slurm client” role uses parameters to control how a node is configured within the Slurm workload manager, such as defining its queues or the number of GPUs it utilizes.

Image configurations and power operations#

BCM provides administrators extensive options to configure and operate the clusters. The following table only covers image configurations and how to reboot and power cycle systems for reprovisioning these systems.

Concept

Object

Reboot

Power Op

Assign Image

Kernelparamters

Kernelmodules

Images

softwareimage

N/A

N/A

N/A

Yes

Yes

Node

device

Yes

Yes

Yes

Yes

Yes

Node Category

category

Yes

Yes

Yes

Yes

Yes

Node Group

nodegroup

Yes

Yes

No

No

No

Roles

role

Yes

Yes

No

No

No

Rack

rack

Yes

Yes

No

No

No

Note

BCM does not merge kernel parameters from different objects but replaces them with the following priorities:

  • Parameters defined in a category replace those defined in an image.

  • Parameters defined in a node replace those defined in a category or in an image.

Note

To reboot or power cycle node categories or group, use % device reboot or % device power with the options:

  • -n <node> for a node

  • -g <group> for a node group

  • -c <category> for a node category

  • -r <rack> for a rack

Kernel parameters should be set in the respective level. Go to device or category and use set kernelparameters.

You can always use help <command> and also <tab> <tab> to see a lit of options in the current context.

The cmsh Shell Command#

The cmsh shell can be used either interactively as a command-line shell or non-interactively by passing commands as arguments directly to cmsh. Throughout this guide, the examples provided use these options interchangeably.

Note

To navigate the levels in cmsh:

  • Use exit to return to the previous level or to leave the shell entirely at the top level.

  • Use Ctrl+D as a shortcut to exit the shell.

  • Interactive shell

    1. Execute cmsh to start an interactive shell at the root level showing the node name:

      $ cmsh
      
      [headnode]%
      
    2. Enter softwareimage to change to software images:

      [headnode]% softwareimage
      
    3. Issue list to list all images on the system.

      [headnode->softwareimage]% list
      
      Name (key)                        Path (key)                                   Kernel version           Nodes
      ---------------------------       -------------------------------------------  --------------------     -----
      default-image                     /cm/images/default-image                    6.8.0-51-generic-64k        0
      
    4. Use use to select an image to modify:

      [headnode->softwareimage]% use my-image
      

      Note

      You can also use softwareimage use my-image directly upon entering the shell.

    5. Exit the shell by pressing Ctrl+D.

  • Passing commands as arguments

    Use cmsh with the -c option and a list of commands separated by “;” :

    $ cmsh -c "softwareimage;list"
    
    Name (key)                        Path (key)                                   Kernel version           Nodes
    ---------------------------       -------------------------------------------  --------------------     -----
    default-image                     /cm/images/default-image                    6.8.0-51-generic-64k        0
    

Prompt Nomenclature throughout the Documentation#

Prompts in the code examples indicate the context in which commands are executed.

  • $ indicates a standard shell prompt on the head node or another node.

  • % is a short form for all cmsh prompts, such as [headnode]%, [headnode->softwareimage]%, etc.

  • # indicates the root user prompt inside the context of an image (chroot).