Step #1: Install and configure the Node Feature Discovery Operator

Optimize AI & Data Science Workloads (Red Hat OpenShift) (Latest)

OpenShift’s Node Feature Discovery (NFD) manages the detection of hardware features and their configuration in the OpenShift Container Platform. NFD labels the host with node-specific attributes; therefore, NFD is necessary for the NVIDIA GPU Operator. Within this first step, you will install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web.

  1. Using the information below, log into Red Hat OCP using the OpenShift Console link located on the left-hand navigation pane of this page.

    openshift-it-018.png
    • Username: nvadmin

    • Password: nvopenshift

  2. Within the left pane of OpenShift Web Console, expand the Operators section and select the OperatorHub.

    openshift-it-002.png

  3. Use the search bar to search for Node Feature Discovery. Two items should be returned as a result. Select the operator that is not tagged as the community operator. This is the version supported by Red Hat.

    openshift-it-003.png

  4. Click Install. The following menu allows you to customize how and where the operator will be installed. It is very rare you would change any of these options. Select Install to continue with the defaults.

  5. Once you see Installed operator - ready for use, click View Operator.

    You are now in the openshift-nfd project created as part of the operator installation.

    openshift-it-004.png


  6. Click the Create instance button to create a new NFD object.

    The section titled Provided APIs shows the Kubernetes objects this operator provides. The NFD operator only provides one object called NodeFeatureDiscovery.

    openshift-it-005.png


  7. This next menu provides options to configure how the NFD operator will scan your cluster. For the NFD operator, it is very rare you’ll need to change any of the defaults. Click Create to instantiate the NFD resource on your cluster.

    Note

    The values pre-populated by the OperatorHub are valid for the GPU Operator.

    This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.


  8. Using the left-hand menu, navigate to Workloads and then DaemonSets.

    You will see two DaemonSets that the NFD operator has created as part of the NFD resource you just created. They are nfd-master and nfd-worker. One instance of the master will run on each of the cluster’s control plane nodes. One worker instance will run on each available node for application scheduling, typically called the application nodes.

  9. Navigate to Workloads and then Pods. Here you will see the nfd-controller running with the nfd-masters and nfd-workers.

    The worker pods will scan the node to which they are assigned and detect different PCI devices and hardware capabilities. The nfd-workers then report this information back to the nfd-masters, which, in turn, apply labels to the nodes.

  10. Navigate back to Operators, then Installed Operators. Select the Node Feature Discovery operator. From the horizontal menu bar, select NodeFeatureDiscovery. You will see one instance listed here titled nfd-instance. Wait for the Status to show the deployment is finished. Once complete the status text may show “Conditions: Available, Upgradeable”.

    openshift-it-006.png

  11. This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.

The Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID 10de. Use the OpenShift Container Platform web console or the CLI to verify that the Node Feature Discovery Operator functions correctly.

  1. Click Compute > Nodes from the side menu.

    openshift-it-019.png

  2. Select a worker node that you know contains a GPU.

    openshift-it-020.png

  3. Click the Details tab.

  4. Under Node labels, verify that the following label is present feature.node.kubernetes.io/pci-10de.present=true.

    openshift-it-021.png
    Note

    0x10de is the PCI vendor ID that is assigned to NVIDIA.


  5. As an alternative to verifying with the OpenShift Console, you can use the following CLI command from the LaunchPad System Console to list all nodes with an NVIDIA PCI device detected.

    Copy
    Copied!
                

    oc get nodes -l feature.node.kubernetes.io/pci-10de.present


  6. You have now installed and configured the Node Feature Discovery Operator, and confirmed it discovered and labeled nodes with GPUs.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.