NVIDIA Mission Control Upgrade From 2.1 to 2.2#

Overview#

NVIDIA Mission Control (NMC) 2.2 introduces changes to the control plane architecture for B200 compared to version 2.1.

For B200, the architecture now includes an Admin Kubernetes Cluster (NMC Mgmt Nodes), requiring an additional 3 control nodes. Deploy the additional control nodes per the existing Deployment Guide for DGX B200 Systems with NVIDIA Mission Control. Refer to the NVIDIA Mission Control Software Installation Guide for 2.2 to complete the installation. The existing Kubernetes cluster becomes the User Kubernetes Cluster (User Mgmt Nodes). Refer to the NVIDIA Run:ai Node Categories section of the NMC 2.2 installation guide for BMC node category naming guidance for Run:ai.

For GB200 and GB300, the architecture includes an Admin Kubernetes Cluster (NMC Mgmt Nodes) and a User Kubernetes Cluster (User Mgmt Nodes). For B300, there is a single Kubernetes cluster running Kubernetes version 1.34.

There is no in-place upgrade path between component versions. To upgrade components from 2.1 to 2.2, complete the following steps:

Back up necessary data and configuration files.
Reset the current cluster by removing the Admin Cluster and User Cluster.
Follow the NVIDIA Mission Control Software Installation Guide for version 2.2 to install the components.

Backing Up Necessary Data and Configuration Files#

Before performing a destructive operation on the cluster, back up any important data and configuration files from your NVIDIA Mission Control cluster. Examples of data you should back up include:

Prometheus data for the metrics
Any custom Grafana dashboards
Any custom AHR playbooks (GB200 only)
Any data that might be on Kubernetes volumes (for example, the local storage path on your BCM installation)

Additionally, confirm that you are not using any features that are deprecated in Kubernetes v1.34.0. For a list of deprecated features, refer to Deprecations and Removals in the Kubernetes v1.34.0 release documentation.

Find deprecated API usage. Use kubectl to scan for deprecated resources, client warnings, or API usage:

$ kubectl get --raw="/metrics" | grep deprecated
# list of metrics to not be used
$ kubectl logs -n kube-system -l k8s-app=kube-apiserver | grep deprecated
# Must return nothing

Check for AppArmor profiles:

$ kubectl get pods --all-namespaces -o yaml | grep apparmor
# Should return nothing

Detect deprecated traffic distribution:

$ kubectl get svc --all-namespaces -o yaml | grep PreferClose
# Should return nothing

Verify the containerd version:

$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'
# Should not return v1.x.x

Resetting the Current Clusters#

When you first create an NMC cluster during the Rack Bring-Up Guide, you create BCM Node Categories and associate them with nodes. On BCM, categories translate to the host image/OS and configuration installed on a node.

Resetting the current clusters means removing the old clusters from the NMC 2.1 installation. If you are not upgrading the headnode to 2.2, you do not need to create new categories or software images.

For NMC 2.2, the categories remain the same as in 2.1, but you must create a new software image for the new Kubernetes version:

slogin (for Slurm Login nodes)
k8s-system-user (for the User Kubernetes Cluster)
k8s-system-admin (for the Kubernetes Admin Cluster)
dgx-gb200 (for the Slurm Compute nodes in GB200)
dgx-b200 (for the Slurm Compute nodes in B200)
runai-compute (for Run:ai Compute nodes)

If you are upgrading the headnode to 2.2, follow the Rack Bring-Up Install Guide, specifically the Configuration for Node Provisioning section.

To reset the current clusters, complete the following steps:

Uninstall Admin and User Kubernetes Clusters#

Gracefully remove the NMC 2.1 Admin and User Kubernetes cluster configuration:

DGX B200

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-user

DGX B300

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-user

DGX GB200

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-admin
$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-user

DGX GB300

Run the following command:

$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-admin
$ cm-kubernetes-setup --remove --yes-i-really-mean-it --name k8s-system-user

After the clusters are removed, verify the available modules by running the following command:

$ module avail
------------------------------------------------------------------------------------------- /cm/local/modulefiles -------------------------------------------------------------------------------------------
kubernetes/k8s-system-admin/1.34.3-1.1  kubernetes/k8s-system-user/1.34.3-1.1

Install NVIDIA Mission Control 2.2#

After you remove the clusters, follow the NVIDIA Mission Control Software Installation Guide for version 2.2 to complete the installation.

During the Workload Manager (WLM) and Kubernetes setup, the wizards might display warnings about previously installed versions. You can proceed by selecting to start a new installation.