Node and Category Management#
When working with workload management systems in BCM, it is important to understand node categories. It is typically more efficient to assign BCM roles using categories or configuration overlays since that approach makes it easier to manage many nodes at once, in a common category.
Introduction to Categories#
A node category is a group of regular nodes that share the same configuration. Ultimately, node categories provide an efficient way to manage a pool of compute nodes. Nodes are typically divided into categories based on hardware specifications and their specific purpose.
Here is a snapshot of what a list of categories could look like on a real system:
[bcm-headnode-01]% category
[bcm-headnode-01->category]% ls
Name (key) Software image Nodes
--------------------------- ------------------------------------------- --------
badler-gb200 baseos7.1rc2-image-arm64-2025.04.30-2 0
badler-gb200-lustre baseos7.1rc2-image-arm64-2025.04.30-2 0
cvt-x86 cvt-image-x86_64 0
default-ubuntu2404-aarch64 default-image 0
default-ubuntu2404-x86_64 default-image-ubuntu2404-x86_64 0
dgx-gb200 baseos7-image-arm64-03-06-2025 72
gtc-gb200 gtc-570.124.06-image-20250313 0
hwqa-gb200 hwqa-570.133.20-image-20250422 18
k8s-ctrl-node k8s-ctrl-image 3
nmx-m NMX-M-image 3
perf-team perf-team-baseos7-image-arm64-20250505 35
slogin slogin-image 2
spare-aarch64 spare-image-aarch64 0
spare-x86 spare-image-x86_64 0
swqa-570.124.06 swqa-570.124.06-image-20250324 0
swqa-570.133.20 swqa-570.133.20-image-20250414 0
swqa-575.51.03 swqa-575.51.03-image-20250418 0
test-baseos7.1rc2 baseos7.1rc2v3-image-arm64-05-07-2025.orig 1
As you can see, we have a list of categories in the left column, the software image they use in the middle, and how many nodes are using that category in the right column. One thing worth noting is that categories can share software images. It is not a one-to-one mapping between categories and software images.
The help
command can be very useful in seeing what actions you can take when in a certain module of cmsh
, here is what it looks like for the category
module:
root@bcm-headnode-01:~# cmsh
[bcm-headnode-01]% category
[bcm-headnode-01]% help
...
=============================== category ===============================
add ........................... Create and use a category
append ........................ Append value(s) to category property
biossettings .................. Enter BIOS settings mode
bmcsettings ................... Enter BMC settings setup mode
clear ......................... Clear specific category property
clone ......................... Clone and use a category
commit ........................ Commit local changes
createramdisk ................. Create a ramdisk for a category
dpusettings ................... Enter DPU settings setup mode
foreach ....................... Execute a set of commands on several categorys
format ........................ Modify or view current list format
fsexports ..................... Enter fsexport setup mode
fsmounts ...................... Enter fsmount setup mode
get ........................... Get specific category property
gpusettings ................... Enter GPU settings setup mode
kernelmodules ................. Enter kernel module mode
list .......................... List overview
listnodes ..................... List nodes in a category
range ......................... Set a range of several categorys to execute future commands on
refresh ....................... Revert local changes
remove ........................ Remove a category
removefrom .................... Remove value(s) from category property
roles ......................... Enter role setup mode
selinuxsettings ............... Enter SELinux settings setup mode
services ...................... Enter service config mode
set ........................... Set category properties
show .......................... Show category properties
sort .......................... Modify or view current list sort order
staticroutes .................. Enter staticroute setup mode
swap .......................... Swap uuid names of two category
undefine ...................... Undefine specific category property
use ........................... Use the specified category
usedby ........................ List all entities which depend on this category
validate ...................... Remote validate a category
ztpsettings ................... Enter ZTP settings setup mode
In the above output, we can see that the category
module within cmsh
provides us a range of actions. Notably you can add or clone categories. Additionally, you can control all sorts of things from the bios, kernel modules, disks, networks, and filesystems.
Overview of BaseOS for NMC DGX Systems#
NMC provides an enhanced BaseOS image that builds upon the standard DGX OS. This BaseOS image comes ready to run on DGX nodes within the NMC SuperPOD environment. It is based on Ubuntu OS. These enhancements include:
BCM CMDaemon process for command and control from BCM head nodes.
Node Telemetry services.
Autonomous Hardware Recovery agent.
In addition, BaseOS includes NVIDIA services you would expect in other DGX OS based environments:
Datacenter GPU Manager (DCGM)
NVIDIA GPU Driver
Optimized Kernel
NVIDIA System Management (NVSM)
DOCA OFED
It is always good to clone the software images you have for backups. This can be done with these cmsh
commands, within the softwareimage
module:
[bcm-headnode-01->softwareimage]% clone <source-image-name> <backup-image-name>
Additionally, it is good to show
a software image to see what properties it has. Here is an example that enters cmsh -> softwareimage
module, lists software images, selects one, and shows its properties:
[bcm-headnode-01]% softwareimage
[bcm-headnode-01->softwareimage]% ls
Name (key) Path (key) Kernel version Nodes
------------------------------------------- ------------------------------------------------------ ---------------------- --------
NMX-M-image /cm/images/NMX-M-image 6.8.0-51-generic 3
baseos7-image-arm64 /cm/images/baseos7-image-arm64 6.8.0-1021-nvidia-64k 0
baseos7-image-arm64-03-06-2025 /cm/images/baseos7-image-arm64-03-06-2025 6.8.0-1021-nvidia-64k 72
baseos7-image-arm64-03-06-2025-backup /cm/images/baseos7-image-arm64-03-06-2025-backup 0
baseos7-image-arm64-backup-04-23-2025 /cm/images/baseos7-image-arm64-backup-04-23-2025 6.8.0-1021-nvidia-64k 0
baseos7.1rc2-image-arm64-2025.04.25 /cm/images/baseos7.1rc2-image-arm64-2025.04.25 6.8.0-1025-nvidia-64k 0
baseos7.1rc2-image-arm64-2025.04.25-backup /cm/images/baseos7.1rc2-image-arm64-2025.04.25-backup 6.8.0-1025-nvidia-64k 0
baseos7.1rc2-image-arm64-2025.04.29-backup /cm/images/baseos7.1rc2-image-arm64-2025.04.29-backup 6.8.0-1025-nvidia-64k 0
baseos7.1rc2-image-arm64-2025.04.30-2 /cm/images/baseos7.1rc2-image-arm64-2025.04.30-2 6.8.0-1025-nvidia-64k 0
baseos7.1rc2v2-image-arm64-2025.05.02-1 /cm/images/baseos7.1rc2v2-image-arm64-2025.05.02-1 6.8.0-1025-nvidia-64k 0
baseos7.1rc2v3-image-arm64-05-07-2025.orig /cm/images/baseos7.1rc2v3-image-arm64-05-07-2025.orig 6.8.0-1025-nvidia-64k 1
cvt-image-x86_64 /cm/images/cvt-image-x86_64 6.8.0-51-generic 0
default-image /cm/images/default-image 6.8.0-51-generic-64k 0
default-image-ubuntu2404-x86_64 /cm/images/default-image-ubuntu2404-x86_64 6.8.0-51-generic 0
gtc-570.124.06-image-20250313 /cm/images/gtc-570.124.06-image-20250313 6.8.0-1021-nvidia-64k 0
hwqa-570.133.20-image-20250422 /cm/images/hwqa-570.133.20-image-20250422 6.8.0-1021-nvidia-64k 18
k8s-ctrl-image /cm/images/k8s-ctrl-image 6.8.0-51-generic-64k 3
perf-team-baseos7-image-arm64-20250505 /cm/images/perf-team-baseos7-image-arm64-20250505 6.8.0-1021-nvidia-64k 35
slogin-image /cm/images/slogin-image 6.8.0-51-generic-64k 2
spare-image-aarch64 /cm/images/spare-image-aarch64 6.8.0-51-generic-64k 0
spare-image-x86_64 /cm/images/spare-image-x86_64 6.8.0-51-generic 0
swqa-570.124.06-image-20250324 /cm/images/swqa-570.124.06-image-20250324 6.8.0-1021-nvidia-64k 0
swqa-570.133.20-image-20250414 /cm/images/swqa-570.133.20-image-20250414 6.8.0-1021-nvidia-64k 0
swqa-575.51.03-image-20250418 /cm/images/swqa-575.51.03-image-20250418 6.8.0-1021-nvidia-64k 0
[bcm-headnode-01->softwareimage]% use baseos7-image-arm64
[bcm-headnode-01->softwareimage[baseos7-image-arm64]]% show
Parameter Value
-------------------------------- ------------------------------------------------
Name baseos7-image-arm64
Nodes 0
Revision
Path /cm/images/baseos7-image-arm64
Creation time Fri, 28 Feb 2025 14:15:12 PST
Kernel version 6.8.0-1021-nvidia-64k
Kernel parameters nouveau.modeset=0
Kernel output console tty0
Enable SOL no
SOL Port ttyS1
SOL Speed 115200
SOL Flow Control yes
FSPart /cm/images/baseos7-image-arm64
Boot FSPart /cm/images/baseos7-image-arm64/boot
Notes <0B>
Category for DGX Compute Nodes Using BaseOS#
This section assumes you have access to the NMC BaseOS software image for DGX nodes.
Here is how you can create a category and set a software image to be associated with a particular category:
[bcm-headnode-01]% category
[bcm-headnode-01->category]% add dgx-gb200
[bcm-headnode-01->category*[dgx-gb200*]]% set softwareimage baseos7-image-arm64
[bcm-headnode-01->category*[dgx-gb200*]]% commit
This new dgx-gb200
category now has the base0s7-image-arm64
associated with it. All nodes using this category will boot into this software image.
The deployment process for nodes revolves around the PXE boot process that allows compute nodes to boot over the network instead of from local disks. It is important to configure this properly for the category.
Here is an example of adjusting properties associated with PXE at the category level:
[bcm-headnode-01->category*[dgx-gb200*]]% set kerneloutputconsole tty0
[bcm-headnode-01->category*[dgx-gb200*]]% set bootloaderprotocol http
[bcm-headnode-01->category*[dgx-gb200*]]% set bootloader grub
[bcm-headnode-01->category*[dgx-gb200*]]% commit
Here is what you might see when you show
the dgx-gb200
category:
[bcm-headnode-01->category[dgx-gb200]]% show
Parameter Value
-------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------
Name dgx-gb200
Nodes 72
Revision
Use exclusively for
User node login ALWAYS
Data node no
Allow networking restart no
Default gateway 7.241.16.1 (network: internalnet)
Default gateway metric 0
Slurm Roles and Configuration Overlays#
When working with workload management systems in BCM, it is important to understand node categories. It is typically more efficient to assign BCM roles using categories or configuration overlays since that approach makes it easier to manage many nodes at once, in a common category.
In this section, we will talk about categories and roles in the context of setting up compute nodes for Slurm.
By default, the cm-wlm-setup
tool creates some configuration overlays and assigns roles to the configuration overlays accordingly. The following shows what cm-wlm-setup
provides for Slurm:

In this example, you can see that the default
category takes on the slurmclient
and slurmsubmit
role (far right column) by being associated with the slurm-client
and slurm-submit
configuration overlay.
The wlm-headnode-submit
configuration overlay is a special overlay. It is applied only to the head node and enables the head node to have the submit role for a given workload manager.
Integrating categories with Slurm Roles and Configuration Overlays#
Here we will go through some basics in order to associate the dgx-gb200
category with the proper Slurm Configuration Overlays and Roles.
Enter
cmsh -> category
and use thedgx-gb200
category:[bcm-headnode-01]% category [bcm-headnode-01->category]% use dgx-gb200
Enable the configuration overlays for
slurm-client
andslurm-submit
:[bcm-headnode-01->category[dgx-gb200]]% configurationoverlay [bcm-headnode-01->category[dgx-gb200]->configurationoverlay]% set slurm-client [bcm-headnode-01->category[dgx-gb200]->configurationoverlay*]% set slurm-submit [bcm-headnode-01->category[dgx-gb200]->configurationoverlay*]% commit
This will associate the dgx-gb200
category with both the slurm-client
and slurm-submit
configuration overlays, giving nodes in this category the necessary Slurm roles.