NVIDIA Mission Control Upgrade From 2.0 to 2.1#

Overview#

NVIDIA Mission Control (NMC) 2.1 has no changes to the Control-Plane architecture in version 2.0.

For GB200, it has an Admin Kubernetes Cluster (NMC Mgmt Nodes) and a User Kubernetes Cluster (User Mgmt Nodes). For B200, it has a single Kubernetes Cluster (NMC Mgmt Nodes). However, for both systems NMC 2.1 brings a new kubernetes versions (v1.33.0 and v1.34.0).

For this reason, there’s no clean upgrade path between those versions. In order to “upgrade” from 2.0 to 2.1, a simplified version of the steps would be:

  1. Back up necessary data and configuration files, if necessary.

  2. Reset the current cluster (i.e., remove the Admin Cluster and User Cluster)

  3. Finally, follow the NVIDIA Mission Control Software Installation Guide 2.1.

Backing up necessary data and configuration files#

Before doing a destructive operation on the cluster ensure that you back up any important data and configuration files from your NVIDIA Mission Control cluster. Examples of information you might be interested in backing up are:

  • Prometheus data for the metrics

  • Any custom Grafana dashboards

  • Any custom AHR playbooks (GB200 only)

  • Any data that might be on kubernetes volumes (e.g., the local storage path on your BCM installation)

Additionally, you might want to confirm that you are not using any of the features that are deprecated in kubernetes v1.34.0. The list of features that are deprecated in v1.34.0 can be found in the kubernetes deprecated features documentation.

  1. Find Deprecated API Usage: Use kubectl to scan for deprecated resources, client warnings, or API usage:

$ kubectl get --raw="/metrics" | grep deprecated
# list of metrics to not be used
$ kubectl logs -n kube-system -l k8s-app=kube-apiserver | grep deprecated
# Must return nothing!
  1. Check for AppArmor Profiles:

$ kubectl get pods --all-namespaces -o yaml | grep apparmor
# Should return nothing
  1. Detect Deprecated Traffic Distribution:

$ kubectl get svc --all-namespaces -o yaml | grep PreferClose
# Should return nothing
  1. containerd version

$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'
# Should not return v1.x.x

Resetting The Current Clusters#

When first creating a NMC cluster, during the Rack Bring-Up Guide you create BCM Node Categories and associate them with nodes. On BCM, categories ultimately translate to the host image/OS and configuration installed on a node.

Resetting the current clusters ultimately means removing the old clusters from the NMC 2.0 installation. There is no need to create new categories or software images if the headnode is not being upgraded to 2.1.

For NMC 2.1 the categories remains the same as in 2.0, but you will need to create a new software image for the new kubernetes version:

  • slogin (for Slurm Login nodes)

  • k8s-system-user (for the User Kubernetes Cluster)

  • k8s-system-admin (for the Kubernetes Admin Cluster)

  • dgx-gb200 (for the Slurm Compute nodes in GB200)

  • dgx-b200 (for the Slurm Compute nodes in B200)

  • runai-compute (for Run:ai Compute nodes)

If the headnode is being upgraded to 2.1, you will need to follow the Rack Bring-Up Install Guide, more specifically the section Configuration for Node Provisioning.

To reset the current clusters, follow the steps below:

Uninstall Admin and User Kubernetes Clusters#

Let’s start by gracefully removing the NMC 2.0 Admin and User Kubernetes Clusters configuration:

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-user

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-admin
$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-user

The last command will show you the available modules. You should NOT see the following modules:

$ module avail
------------------------------------------------------------------------------------------- /cm/local/modulefiles -------------------------------------------------------------------------------------------
kubernetes/k8s-admin/1.32.7-1.1  kubernetes/k8s-user/1.32.7-1.1

NVIDIA Mission Control Software Installation Guide 2.1#

With the clusters removed, the next steps should be to just follow the NVIDIA Mission Control Software Installation Guide as usual.

On the Workload Manages (WLM) and Kubernetes setup, the wizards might complain about previously installed versions, but you can request to start a new installation.