Installing RHEL OpenShift#
This chapter describes additional steps that are required or recommended to install OpenShift and CoreOS on DGX worker nodes.
Prerequisites#
Here are the prerequisites for using RHEL OpenShift.
Red Hat Subscription
Installing and running OpenShift requires a Red Hat account and additional subscriptions. Please refer to Red Hat OpenShift for more information.
OpenShift 4.9.9 or later
Support for DGX has been added to the GPU operator in version 1.9. The operator requires OpenShift 4.9.9 or later.
Helm Management Tool
NVIDIA System Management (NVSM) uses Helm for installing NVSM on DGX worker nodes. NVSM is a software framework for collecting health status information and helps users analyze hardware and software issues. Refer to Installing Helm for instructions installing the Helm tool on the system you use to interact with the OpenShift cluster.
Installing Red Hat CoreOS#
Installing OpenShift and Red Hat CoreOS on clusters with DGX worker nodes is the same as installing on other systems.
Follow the instructions described in Installing a cluster on bare metal or other methods to create an OpenShift cluster and to install Red Hat CoreOS on the nodes.
Installing the NVIDIA GPU Operator#
The NVIDIA GPU Operator is required to manage and allocate GPU resources to workloads. It uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and DCGM based monitoring.
To install the Node Feature Discovery (NFD) Operator and NVIDIA GPU Operator, follow the instructions in the GPU Operator on OpenShift user guide. The NFD Operator manages the detection of hardware features and configurations in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. These labels are required by the GPU Operator to identify machines with a valid GPU.
Installing NVSM#
NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It includes active health monitoring, system alerts, and log generation, and also supports a stand-alone mode from the command line to get a quick health report of the system. Running NVSM is typically requested by the NVIDIA Enterprise Support team to resolve a reported problem.
NVSM can be deployed on the DGX nodes with the NVIDIA System Management NGC Container. It allows users to execute the NVSM tool remotely and on-demand in the containers that are deployed to the DGX nodes.
The installation uses the NVSM Helm Chart to create the necessary resources on the cluster. The deployment is limited to systems that are manually labeled with nvidia.com/gpu.nvsm.deploy=true
.
To deploy NVSM on the DGX worker nodes:
Issue the following command to get a list of all DGX nodes in the cluster.
oc get nodes --show-labels|grep nvidia.com/gpu.machine=.*DGX[^,]*
Set the
nvidia.com/gpu.nvsm.deploy=true
flag on the DGX worker nodes on which you want to deploy NVSM (replaceWORKER1
and so on with the actual name of the nodes).oc label node/WORKER1 nvidia.com/gpu.nvsm.deploy=true oc label node/WORKER2 nvidia.com/gpu.nvsm.deploy=true ...
Get the Helm chart for deploying NVSM on the cluster.
helm fetch https://helm.ngc.nvidia.com/nvidia/cloud-native/charts/nvsm-1.0.1.tgz --username='$oauthtoken' --password=<NGC_API_KEY>
Ensure the file is in your local directory.
ls ./nvsm-1.0.1.tgz
To ensure the default settings are correct for your installation, inspect the contents of
values.yaml
file in the above tar file.If the settings are not correct, update the
values.yaml
file as per the cluster configuration.helm install --
Deploy NVSM to the cluster.
The cluster installs the container on all nodes that have been labeled in the previous steps. The following command creates the namespace
nvidia-nvsm
namespace and deploys the resource in the namespace:helm install --set platform.openshift=true --create-namespace -n nvidia-nvsm nvidia-nvsm ./nvsm-1.0.1.tgz
Validate that NVSM has been deployed on all selected DGX nodes.
You should see an
nvidia-nvsm-XXXX
pod instance for each node:oc get pods -n nvidia-nvsm -o wide NAME READY STATUS RESTARTS AGE IP NODE ... nvidia-nvsm-d9d9t 1/1 Running 1 8h 10.128.2.11 worker-0 ... nvidia-nvsm-tt8g5 1/1 Running 1 8h 10.131.0.11 worker-1 ...
NVSM is now installed and can be run remotely using oc exec
.