Installing RHEL OpenShift

This chapter describes additional steps that are required or recommended to install OpenShift and CoreOS on DGX worker nodes.

  1. Installing Red Hat CoreOS.

  2. Installing the NFD Operator and NVIDIA GPU Operator.

  3. Installing and Using NVSM.

Prerequisites

Here are the prerequisites for using RHEL OpenShift.

  • Red Hat Subscription

    Installing and running OpenShift requires a Red Hat account and additional subscriptions. Please refer to Red Hat OpenShift for more information.

  • OpenShift 4.9.9 or later

    Support for DGX has been added to the GPU operator in version 1.9. The operator requires OpenShift 4.9.9 or later.

  • Helm Management Tool

    NVIDIA System Management (NVSM) uses Helm for installing NVSM on DGX worker nodes. NVSM is a software framework for collecting health status information and helps users analyze hardware and software issues. Refer to Installing Helm for instructions installing the Helm tool on the system you use to interact with the OpenShift cluster.

Installing Red Hat CoreOS

Installing OpenShift and Red Hat CoreOS on clusters with DGX worker nodes is the same as installing on other systems.

Follow the instructions described in Installing a cluster on bare metal or other methods to create an OpenShift cluster and to install Red Hat CoreOS on the nodes.

Installing the NVIDIA GPU Operator

The NVIDIA GPU Operator is required to manage and allocate GPU resources to workloads. It uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and DCGM based monitoring.

To install the Node Feature Discovery (NFD) Operator and NVIDIA GPU Operator, follow the instructions in the GPU Operator on OpenShift user guide. The NFD Operator manages the detection of hardware features and configurations in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. These labels are required by the GPU Operator to identify machines with a valid GPU.

Installing NVSM

NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It includes active health monitoring, system alerts, and log generation, and also supports a stand-alone mode from the command line to get a quick health report of the system. Running NVSM is typically requested by the NVIDIA Enterprise Support team to resolve a reported problem.

NVSM can be deployed on the DGX nodes with the NVIDIA System Management NGC Container. It allows users to execute the NVSM tool remotely and on-demand in the containers that are deployed to the DGX nodes.

The installation uses the NVSM Helm Chart to create the necessary resources on the cluster. The deployment is limited to systems that are manually labeled with nvidia.com/gpu.nvsm.deploy=true.

To deploy NVSM on the DGX worker nodes:

  1. Issue the following command to get a list of all DGX nodes in the cluster.

    oc get nodes --show-labels|grep nvidia.com/gpu.machine=.*DGX[^,]*
    
  2. Set the nvidia.com/gpu.nvsm.deploy=true flag on the DGX worker nodes on which you want to deploy NVSM (replace WORKER1 and so on with the actual name of the nodes).

    oc label node/WORKER1 nvidia.com/gpu.nvsm.deploy=true
    oc label node/WORKER2 nvidia.com/gpu.nvsm.deploy=true
    ...
    
  3. Get the Helm chart for deploying NVSM on the cluster.

    helm fetch https://helm.ngc.nvidia.com/nvidia/cloud-native/charts/nvsm-1.0.1.tgz --username='$oauthtoken' --password=<NGC_API_KEY>
    
  4. Ensure the file is in your local directory.

    ls ./nvsm-1.0.1.tgz
    
  5. To ensure the default settings are correct for your installation, inspect the contents of values.yaml file in the above tar file.

    If the settings are not correct, update the values.yaml file as per the cluster configuration.

    helm install --
    
  6. Deploy NVSM to the cluster.

    The cluster installs the container on all nodes that have been labeled in the previous steps. The following command creates the namespace nvidia-nvsm namespace and deploys the resource in the namespace:

    helm install --set platform.openshift=true --create-namespace -n nvidia-nvsm nvidia-nvsm ./nvsm-1.0.1.tgz
    
  7. Validate that NVSM has been deployed on all selected DGX nodes.

    You should see an nvidia-nvsm-XXXX pod instance for each node:

    oc get pods -n nvidia-nvsm -o wide
    NAME                READY   STATUS    RESTARTS   AGE   IP           NODE     ...
    nvidia-nvsm-d9d9t   1/1     Running   1          8h    10.128.2.11  worker-0 ...
    nvidia-nvsm-tt8g5   1/1     Running   1          8h    10.131.0.11  worker-1 ...
    

NVSM is now installed and can be run remotely using oc exec.