Adding and Removing Nodes from Run:ai or Slurm#
Overview#
These instructions cover the steps for adding and removing compute nodes from existing workload manager installations, either Run:ai or Slurm, within an NVIDIA Mission Control environment.
Prerequisites#
The utilization of node categories in BCM makes this very easy. Essentially, all we do is switch a compute nodes category and reboot. Please be sure you have categories properly configured.
Basic requirements:
The cluster nodes are managed by BCM
Categories have been established for compute nodes to be apart of Run:ai and/or Slurm
Please refer to Node and Category Management and NVIDIA Run:ai Installation for more information on setting up node categories.
Adding or removing a compute node from a Run:ai cluster#
This assumes you already have Run:ai deployed and running.
These steps apply when adding a new compute node, moving a compute node from Slurm to Run:ai, or removing a node from an existing Run:ai cluster.
Add a compute node#
Add the compute node to the proper Run:ai GPU worker category (e.g. “dgx-gb200-k8s”).
Reboot the compute node (wait for it to come back online).
After adding a node, it should appear in the Run:ai control plane as an available worker node. If you moved this compute node from an existing Slurm cluster, no further action is needed; Slurm will automatically remove this node from the relevant partitions.
Remove a compute node#
Update the compute node to any non-Run:ai category.
Reboot the compute node (wait for it to come back online).
After removing a node, it should no longer appear in the Run:ai control plane automatically.
Adding or removing a compute node from a Slurm cluster#
This assumes you already have Slurm deployed and running.
These steps apply when adding a new compute node, moving a compute node from Run:ai to Slurm, or removing a node from an existing Slurm cluster.
Add a compute node#
Add the compute node to the proper Slurm category (e.g. “dgx-gb200”).
Reboot the compute node (wait for it to come back online).
After adding a node, it should appear in Slurm as an available node. If you moved this compute node from an existing Run:ai cluster, no further action is needed; Run:ai will automatically remove this node from its worker pool.
Remove a compute node#
Update the compute node to any non-Slurm category.
Reboot the compute node (wait for it to come back online).
After removing a node, it should no longer appear in Slurm.
Please verify you see your compute node in the Slurm cluster by running sinfo or scontrol show node <node hostame>