Adding and Removing Nodes from Run:ai or Slurm#

Overview#

These instructions cover the steps for adding and removing compute nodes from existing workload manager installations, either Run:ai or Slurm, within an NVIDIA Mission Control environment.

Prerequisites#

The utilization of node categories in BCM makes this very easy. Essentially, all we do is switch a compute nodes category and reboot. Please be sure you have categories properly configured.

Basic requirements:

  • The cluster nodes are managed by BCM

  • Categories have been established for compute nodes to be apart of Run:ai and/or Slurm

Please refer to Node and Category Management and NVIDIA Run:ai Installation for more information on setting up node categories.

Adding or removing a compute node from a Run:ai cluster#

This assumes you already have Run:ai deployed and running.

These steps apply when adding a new compute node, moving a compute node from Slurm to Run:ai, or removing a node from an existing Run:ai cluster.

Add a compute node#

  1. Add the compute node to the proper Run:ai GPU worker category (e.g. “dgx-gb200-k8s”).

  2. Reboot the compute node (wait for it to come back online).

After adding a node, it should appear in the Run:ai control plane as an available worker node. If you moved this compute node from an existing Slurm cluster, no further action is needed; Slurm will automatically remove this node from the relevant partitions.

Remove a compute node#

  1. Update the compute node to any non-Run:ai category.

  2. Reboot the compute node (wait for it to come back online).

After removing a node, it should no longer appear in the Run:ai control plane automatically.

Adding or removing a compute node from a Slurm cluster#

This assumes you already have Slurm deployed and running.

These steps apply when adding a new compute node, moving a compute node from Run:ai to Slurm, or removing a node from an existing Slurm cluster.

Add a compute node#

  1. Add the compute node to the proper Slurm category (e.g. “dgx-gb200”).

  2. Reboot the compute node (wait for it to come back online).

After adding a node, it should appear in Slurm as an available node. If you moved this compute node from an existing Run:ai cluster, no further action is needed; Run:ai will automatically remove this node from its worker pool.

Remove a compute node#

  1. Update the compute node to any non-Slurm category.

  2. Reboot the compute node (wait for it to come back online).

After removing a node, it should no longer appear in Slurm.

Please verify you see your compute node in the Slurm cluster by running sinfo or scontrol show node <node hostame>