Adding and Removing Nodes from Run:ai or Slurm#

Overview#

These instructions cover the steps for adding and removing compute nodes from existing workload manager installations, either Run:ai or Slurm, within an NVIDIA Mission Control environment.

Prerequisites#

The utilization of node categories in BCM makes this very easy. Essentially, all we do is switch a compute nodes category and reboot. Please be sure you have categories properly configured.

Basic requirements:

The cluster nodes are managed by BCM
Categories have been established for compute nodes to be apart of Run:ai and/or Slurm

Please refer to Node and Category Management and NVIDIA Run:ai Installation for more information on setting up node categories.

Adding a compute node to an existing Run:ai cluster#

This assumes you already have Run:ai deployed and running.

These steps will work for either scenario of adding a new compute node to the cluster or moving a compute node from Slurm to Run:ai.

Add the compute node to the proper Run:ai GPU worker category (e.g. “dgx-gb200-k8s”).
Reboot the compute node (wait for it to come back online)
On the BCM head node execute the provided helper script to fully configure the Run:ai GPU worker node and add it to the proper K8s cluster.

/cm/local/apps/cmd/scripts/cm-kubeadm-manage-joins --kube-cluster <cluster-name> --add-node <compute hostname>

Here is a full real example of running the command and the expected output:

root@a03-p1-head-01:~# /cm/local/apps/cmd/scripts/cm-kubeadm-manage-joins --kube-cluster runai-col5 --add-node b07-p1-dgx-07-c02

2025-10-15 10:42:07,579 - cm-kubeadm-manage-joins - INFO - Preparing CA certificate for cluster 'runai-col5' on node 'b07-p1-dgx-07-c02'...
2025-10-15 10:42:07,639 - cm-kubeadm-manage-joins - DEBUG - Executing: ssh b07-p1-dgx-07-c02 mkdir -p /etc/kubernetes/pki/runai-col5
2025-10-15 10:42:08,219 - cm-kubeadm-manage-joins - DEBUG - Executing: scp /tmp/tmpwclwyxn6 b07-p1-dgx-07-c02:/etc/kubernetes/pki/runai-col5/ca.crt
2025-10-15 10:42:08,628 - cm-kubeadm-manage-joins - DEBUG - Executing: ssh b07-p1-dgx-07-c02 chmod 644 /etc/kubernetes/pki/runai-col5/ca.crt
2025-10-15 10:42:09,099 - cm-kubeadm-manage-joins - INFO - CA certificate successfully written to /etc/kubernetes/pki/runai-col5/ca.crt on node b07-p1-dgx-07-c02.
2025-10-15 10:42:09,100 - cm-kubeadm-manage-joins - INFO - Adding node b07-p1-dgx-07-c02 to the Kubernetes cluster...
2025-10-15 10:42:09,100 - cm-kubeadm-manage-joins - DEBUG - Executing: ssh b07-p1-dgx-07-c02 mkdir -p /root/.kube
2025-10-15 10:42:11,184 - cm-kubeadm-manage-joins - INFO - Executing join command on b07-p1-dgx-07-c02 and saving output to /root/.kube/kubeadm-join.out...
2025-10-15 10:42:11,185 - cm-kubeadm-manage-joins - DEBUG - Executing: ssh b07-p1-dgx-07-c02 bash -c 'kubeadm join 127.0.0.1:10444 --token 3fzmuk.8bukym6wjlq259l5 --discovery-token-ca-cert-hash sha256:1b6c808873087c9288b1aceb0f69f18397d886962698d9215fcc932f4648e229 |& tee -a /root/.kube/kubeadm-join.out'
2025-10-15 10:42:12,602 - cm-kubeadm-manage-joins - INFO - [preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config.yaml' to re-upload it.
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 501.404579ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

At this point, the compute node is added to Run:ai and you should see it visible to the Run:ai control plane as an available worker node.

If you moved this compute node from an existing Slurm cluster, no further action is needed. Slurm will automatically remove this node from the relative partitions.

Adding a compute node to an existing Slurm cluster#

This assumes you already have Slurm deployed and running.

These steps will work for either scenario of adding a new compute node to the cluster or moving a compute node from Run:ai to Slurm.

Add the compute node to the proper Slurm category (e.g. “dgx-gb200”)
Reboot the compute node (wait for it to come back online)

And it is that easy. BCM’s category management handles the heavy lifting of ensuring the nodes are configured and added to Slurm. Additionally, this node will automatically be removed from Run:ai.

Please verify you see your compute node in the Slurm cluster by running sinfo or scontrol show node <node hostame>