Install the Node Feature Discovery (NFD) Operator
NVIDIA AI Enterprise 2.0 or later
OpenShift’s Node Feature Discovery (NFD) manages the detection of hardware features and their configuration in the OpenShift Container Platform. NFD labels the host with node-specific attributes therefore NFD is a prerequisite for the NVIDIA GPU Operator. As a cluster administrator, you can install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console.
Using the left menu bar, expand the Operators section and select the OperatorHub
Use the search bar to search for Node Feature Discovery. Two items should be returned as a result. Select the operator that is not tagged as the community operator. This is the version supported by Red Hat.
Click Install. The next menu allows you to customize how and where the operator will be installed. It is very rare you would change any of these options. Select Install to continue with the defaults.
Wait while the NFD operator installs. Once you see “Installed operator - ready for use”, click View Operator. Notice that you are now in the openshift-nfd project that was created as part of the operator installation.
The section titled Provided APIs shows that Kubernetes objects that are provided by this operator. The NFD operator only provides one object called NodeFeatureDiscovery. Click the Create Instance button to create a new NFD object.
This next menu provides options to configure how the NFD operator will scan your cluster. For the NFD operator, it is very rare you’ll need to change any of the defaults. Click Create to instantiate the NFD resource on your cluster.Note
The values pre-populated by the OperatorHub are valid for the GPU Operator. This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.
Using the left menu, navigate to Workloads and then DaemonSets. You should see two DaemonSets that have been created by the NFD operator as part of the NFD resource you just created. They are nfd-master and nfd-worker. One instance of the master will run on each of the cluster’s control plane nodes. One worker instance will run on each of the nodes that are available for application scheduling, typically called the application nodes.
Navigate to Workloads and then Pods. Here you will see the nfd-controller running with the nfd-masters and nfd-workers. The worker pods will scan the node to which they are assigned and detect different PCI devices and hardware capabilities. The nfd-workers then report this information back to the nfd-masters which, in turn, apply labels to the nodes.
Navigate back to Operators then Installed Operators. Select the Node Feature Discovery operator. From the horizontal menu bar, select NodeFeatureDiscovery. You should see one instance listed here titled nfd-instance. Wait for the Status to show the deployment is finished.Note
Red Hat provides documentation of the The Node Feature Discovery Operator to install the Node Feature Discovery Operator as well.
This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.
Alternatively, the Node Feature Discovery Operator can also be verified using the OpenShift CLI.
Located your log in information from the OCP webconsole
Log into the OpenShift CLI with the copied login command:
$ oc login --token=sha256~<token> --server=https://api.<OCP-URL>.com:6443
Using the OpenShift CLI, type the following command:
$ oc get pods -n openshift-nfd NAME READY STATUS RESTARTS AGE nfd-controller-manager-7f86ccfb58-nqgxm 2/2 Running 0 11m
The Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID 10de. Use the OpenShift Container Platform web console or the CLI to verify that the Node Feature Discovery Operator is functioning correctly.
In the OpenShift Container Platform web console, click Compute > Nodes from the side menu.
Select a worker node that you know contains a GPU.
Click the Details tab.
Under Node labels verify that the following label is present:
0x10deis the PCI vendor ID that is assigned to NVIDIA.
Alternatively, you can verify that the GPU device (
pci-10de) is discovered on the GPU node using the OpenShift CLI:
$ oc get nodes -l feature.node.kubernetes.io/pci-10de.present NAME STATUS ROLES AGE VERSION nvaie-ocp-7rfr8-worker-7x5km Ready worker 20d v1.22.3+e790d7f nvaie-ocp-7rfr8-worker-jntsp Ready worker 11d v1.22.3+e790d7f