Node and Category Management#

When working with workload management systems in BCM, it is important to understand node categories. It is typically more efficient to assign BCM roles using categories or configuration overlays since that approach makes it easier to manage many nodes at once, in a common category.

Introduction to Categories#

A node category is a group of regular nodes that share the same configuration. Ultimately, node categories provide an efficient way to manage a pool of compute nodes. Nodes are typically divided into categories based on hardware specifications and their specific purpose.

Here is a snapshot of what a list of categories could look like on a real system:

[bcm-headnode-01]% category

[bcm-headnode-01->category]% ls
Name (key)                 Software image                           Nodes
--------------------------- ------------------------------------------- --------
badler-gb200               baseos7.1rc2-image-arm64-2025.04.30-2      0
badler-gb200-lustre        baseos7.1rc2-image-arm64-2025.04.30-2      0
cvt-x86                    cvt-image-x86_64                           0
default-ubuntu2404-aarch64 default-image                              0
default-ubuntu2404-x86_64  default-image-ubuntu2404-x86_64            0
dgx-gb200                  baseos7-image-arm64-03-06-2025             72
gtc-gb200                  gtc-570.124.06-image-20250313              0
hwqa-gb200                 hwqa-570.133.20-image-20250422             18
k8s-ctrl-node              k8s-ctrl-image                             3
nmx-m                      NMX-M-image                                 3
perf-team                  perf-team-baseos7-image-arm64-20250505     35
slogin                     slogin-image                                2
spare-aarch64              spare-image-aarch64                        0
spare-x86                  spare-image-x86_64                         0
swqa-570.124.06            swqa-570.124.06-image-20250324             0
swqa-570.133.20            swqa-570.133.20-image-20250414             0
swqa-575.51.03             swqa-575.51.03-image-20250418              0
test-baseos7.1rc2          baseos7.1rc2v3-image-arm64-05-07-2025.orig 1

As you can see, we have a list of categories in the left column, the software image they use in the middle, and how many nodes are using that category in the right column. One thing worth noting is that categories can share software images. It is not a one-to-one mapping between categories and software images.

The help command can be very useful in seeing what actions you can take when in a certain module of cmsh, here is what it looks like for the category module:

root@bcm-headnode-01:~# cmsh
[bcm-headnode-01]% category
[bcm-headnode-01]% help
...
=============================== category ===============================
add ........................... Create and use a category
append ........................ Append value(s) to category property
biossettings .................. Enter BIOS settings mode
bmcsettings ................... Enter BMC settings setup mode
clear ......................... Clear specific category property
clone ......................... Clone and use a category
commit ........................ Commit local changes
createramdisk ................. Create a ramdisk for a category
dpusettings ................... Enter DPU settings setup mode
foreach ....................... Execute a set of commands on several categorys
format ........................ Modify or view current list format
fsexports ..................... Enter fsexport setup mode
fsmounts ...................... Enter fsmount setup mode
get ........................... Get specific category property
gpusettings ................... Enter GPU settings setup mode
kernelmodules ................. Enter kernel module mode
list .......................... List overview
listnodes ..................... List nodes in a category
range ......................... Set a range of several categorys to execute future commands on
refresh ....................... Revert local changes
remove ........................ Remove a category
removefrom .................... Remove value(s) from category property
roles ......................... Enter role setup mode
selinuxsettings ............... Enter SELinux settings setup mode
services ...................... Enter service config mode
set ........................... Set category properties
show .......................... Show category properties
sort .......................... Modify or view current list sort order
staticroutes .................. Enter staticroute setup mode
swap .......................... Swap uuid names of two category
undefine ...................... Undefine specific category property
use ........................... Use the specified category
usedby ........................ List all entities which depend on this category
validate ...................... Remote validate a category
ztpsettings ................... Enter ZTP settings setup mode

In the above output, we can see that the category module within cmsh provides us a range of actions. Notably you can add or clone categories. Additionally, you can control all sorts of things from the bios, kernel modules, disks, networks, and filesystems.

Overview of BaseOS for NMC DGX Systems#

NMC provides an enhanced BaseOS image that builds upon the standard DGX OS. This BaseOS image comes ready to run on DGX nodes within the NMC SuperPOD environment. It is based on Ubuntu OS. These enhancements include:

  • BCM CMDaemon process for command and control from BCM head nodes.

  • Node Telemetry services.

  • Autonomous Hardware Recovery agent.

In addition, BaseOS includes NVIDIA services you would expect in other DGX OS based environments:

  • Datacenter GPU Manager (DCGM)

  • NVIDIA GPU Driver

  • Optimized Kernel

  • NVIDIA System Management (NVSM)

  • DOCA OFED

It is always good to clone the software images you have for backups. This can be done with these cmsh commands, within the softwareimage module:

[bcm-headnode-01->softwareimage]% clone <source-image-name> <backup-image-name>

Additionally, it is good to show a software image to see what properties it has. Here is an example that enters cmsh -> softwareimage module, lists software images, selects one, and shows its properties:

[bcm-headnode-01]% softwareimage

[bcm-headnode-01->softwareimage]% ls
Name (key)                             Path (key)                                          Kernel version           Nodes
-------------------------------------------  ------------------------------------------------------  ----------------------   --------
NMX-M-image                            /cm/images/NMX-M-image                              6.8.0-51-generic         3
baseos7-image-arm64                    /cm/images/baseos7-image-arm64                      6.8.0-1021-nvidia-64k   0
baseos7-image-arm64-03-06-2025         /cm/images/baseos7-image-arm64-03-06-2025           6.8.0-1021-nvidia-64k   72
baseos7-image-arm64-03-06-2025-backup  /cm/images/baseos7-image-arm64-03-06-2025-backup                            0
baseos7-image-arm64-backup-04-23-2025  /cm/images/baseos7-image-arm64-backup-04-23-2025   6.8.0-1021-nvidia-64k   0
baseos7.1rc2-image-arm64-2025.04.25    /cm/images/baseos7.1rc2-image-arm64-2025.04.25     6.8.0-1025-nvidia-64k   0
baseos7.1rc2-image-arm64-2025.04.25-backup /cm/images/baseos7.1rc2-image-arm64-2025.04.25-backup 6.8.0-1025-nvidia-64k   0
baseos7.1rc2-image-arm64-2025.04.29-backup /cm/images/baseos7.1rc2-image-arm64-2025.04.29-backup 6.8.0-1025-nvidia-64k   0
baseos7.1rc2-image-arm64-2025.04.30-2  /cm/images/baseos7.1rc2-image-arm64-2025.04.30-2   6.8.0-1025-nvidia-64k   0
baseos7.1rc2v2-image-arm64-2025.05.02-1 /cm/images/baseos7.1rc2v2-image-arm64-2025.05.02-1 6.8.0-1025-nvidia-64k   0
baseos7.1rc2v3-image-arm64-05-07-2025.orig /cm/images/baseos7.1rc2v3-image-arm64-05-07-2025.orig 6.8.0-1025-nvidia-64k   1
cvt-image-x86_64                       /cm/images/cvt-image-x86_64                         6.8.0-51-generic         0
default-image                          /cm/images/default-image                            6.8.0-51-generic-64k     0
default-image-ubuntu2404-x86_64        /cm/images/default-image-ubuntu2404-x86_64          6.8.0-51-generic         0
gtc-570.124.06-image-20250313          /cm/images/gtc-570.124.06-image-20250313            6.8.0-1021-nvidia-64k   0
hwqa-570.133.20-image-20250422         /cm/images/hwqa-570.133.20-image-20250422           6.8.0-1021-nvidia-64k   18
k8s-ctrl-image                         /cm/images/k8s-ctrl-image                           6.8.0-51-generic-64k     3
perf-team-baseos7-image-arm64-20250505 /cm/images/perf-team-baseos7-image-arm64-20250505   6.8.0-1021-nvidia-64k   35
slogin-image                           /cm/images/slogin-image                             6.8.0-51-generic-64k     2
spare-image-aarch64                    /cm/images/spare-image-aarch64                      6.8.0-51-generic-64k     0
spare-image-x86_64                     /cm/images/spare-image-x86_64                       6.8.0-51-generic         0
swqa-570.124.06-image-20250324         /cm/images/swqa-570.124.06-image-20250324           6.8.0-1021-nvidia-64k   0
swqa-570.133.20-image-20250414         /cm/images/swqa-570.133.20-image-20250414           6.8.0-1021-nvidia-64k   0
swqa-575.51.03-image-20250418          /cm/images/swqa-575.51.03-image-20250418            6.8.0-1021-nvidia-64k   0

[bcm-headnode-01->softwareimage]% use baseos7-image-arm64

[bcm-headnode-01->softwareimage[baseos7-image-arm64]]% show
Parameter                      Value
-------------------------------- ------------------------------------------------
Name                           baseos7-image-arm64
Nodes                          0
Revision
Path                           /cm/images/baseos7-image-arm64
Creation time                  Fri, 28 Feb 2025 14:15:12 PST
Kernel version                 6.8.0-1021-nvidia-64k
Kernel parameters              nouveau.modeset=0
Kernel output console          tty0
Enable SOL                     no
SOL Port                       ttyS1
SOL Speed                      115200
SOL Flow Control               yes
FSPart                         /cm/images/baseos7-image-arm64
Boot FSPart                    /cm/images/baseos7-image-arm64/boot
Notes                          <0B>

Category for DGX Compute Nodes Using BaseOS#

This section assumes you have access to the NMC BaseOS software image for DGX nodes.

Here is how you can create a category and set a software image to be associated with a particular category:

[bcm-headnode-01]% category
[bcm-headnode-01->category]% add dgx-gb200
[bcm-headnode-01->category*[dgx-gb200*]]% set softwareimage baseos7-image-arm64
[bcm-headnode-01->category*[dgx-gb200*]]% commit

This new dgx-gb200 category now has the base0s7-image-arm64 associated with it. All nodes using this category will boot into this software image.

The deployment process for nodes revolves around the PXE boot process that allows compute nodes to boot over the network instead of from local disks. It is important to configure this properly for the category.

Here is an example of adjusting properties associated with PXE at the category level:

[bcm-headnode-01->category*[dgx-gb200*]]% set kerneloutputconsole tty0
[bcm-headnode-01->category*[dgx-gb200*]]% set bootloaderprotocol http
[bcm-headnode-01->category*[dgx-gb200*]]% set bootloader grub
[bcm-headnode-01->category*[dgx-gb200*]]% commit

Here is what you might see when you show the dgx-gb200 category:

[bcm-headnode-01->category[dgx-gb200]]% show
Parameter                      Value
-------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------
Name                           dgx-gb200
Nodes                          72
Revision
Use exclusively for
User node login                ALWAYS
Data node                      no
Allow networking restart       no
Default gateway                7.241.16.1 (network: internalnet)
Default gateway metric         0

Slurm Roles and Configuration Overlays#

When working with workload management systems in BCM, it is important to understand node categories. It is typically more efficient to assign BCM roles using categories or configuration overlays since that approach makes it easier to manage many nodes at once, in a common category.

In this section, we will talk about categories and roles in the context of setting up compute nodes for Slurm.

By default, the cm-wlm-setup tool creates some configuration overlays and assigns roles to the configuration overlays accordingly. The following shows what cm-wlm-setup provides for Slurm:

Slurm Configuration Overlays Table

In this example, you can see that the default category takes on the slurmclient and slurmsubmit role (far right column) by being associated with the slurm-client and slurm-submit configuration overlay.

The wlm-headnode-submit configuration overlay is a special overlay. It is applied only to the head node and enables the head node to have the submit role for a given workload manager.

Integrating categories with Slurm Roles and Configuration Overlays#

Here we will go through some basics in order to associate the dgx-gb200 category with the proper Slurm Configuration Overlays and Roles.

  1. Enter cmsh -> category and use the dgx-gb200 category:

    [bcm-headnode-01]% category
    [bcm-headnode-01->category]% use dgx-gb200
    
  2. Enable the configuration overlays for slurm-client and slurm-submit:

    [bcm-headnode-01->category[dgx-gb200]]% configurationoverlay
    [bcm-headnode-01->category[dgx-gb200]->configurationoverlay]% set slurm-client
    [bcm-headnode-01->category[dgx-gb200]->configurationoverlay*]% set slurm-submit
    [bcm-headnode-01->category[dgx-gb200]->configurationoverlay*]% commit
    

This will associate the dgx-gb200 category with both the slurm-client and slurm-submit configuration overlays, giving nodes in this category the necessary Slurm roles.