Install NVIDIA Network Operator (Optional)#

The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. This operator is not required if the cluster does not have NVIDIA networking devices installed, or the deployment does not require the use of multinode high-speed networking. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to support GPUDirect workloads that deliver high-throughput, low-latency networking for scale-out, GPU computing clusters.

Install NVIDIA Network Operator using the Web Console#

Install the NVIDIA Network Operator using the Red Hat Software Catalog (Red Hat OperatorHub in versions before 4.20).

Access the Red Hat OpenShift console.

  1. Navigate to Ecosystem -> Software Catalog.

  2. Search for “NVIDIA Network Operator” (or Network Operator).

  3. Select the NVIDIA Network Operator.

  4. Click Install.

  5. Select the desired Installation mode (usually “All namespaces on the cluster (default)” for cluster-wide functionality).

  6. Select the Installed Namespace (usually nvidia-network-operator).

  7. Select the desired Update approval strategy (Manual or Automatic).

  8. Click Install.

  9. Wait for the Operator to be installed and its status to change to Succeeded on the Installed Operators page.

Note

After installation, you will need to create an instance of the NicClusterPolicy Custom Resource (CR) to deploy the necessary networking components. This is typically done on the Operator Details page by clicking Create instance on the “Nic Cluster Policy” API.

Refer to the documentation for details on creating the NicClusterPolicy for different deployment options on OpenShift. Creating a NicClusterPolicy that supports deploying GPUDirect with RDMA on Red Hat OpenShift will depend on the hardware topology. For use cases like distributed inferencing with llm-d or NVIDIA dynamo which are built on the NVIDIA Inference Xfer Library (NIXL) library, GPUDirect with RDMA is not strictly required but can be highly beneficial with large workloads.

Refer to the official documentation from Red Hat for steps on using the CLI and configuring the NVIDIA Network Operator on Red Hat OpenShift.

Verify presence of NVIDIA Networking Devices#

The NVIDIA Network Operator requires proper node labels to be in place for pod scheduling. The Node Feature Discovery Operator creates the node labels to advertise presence of supported networking devices. Run the following commands to validate the labels match expectations.

1# Nodes that have Mellanox PCI (ConnectX)
2oc get nodes -l feature.node.kubernetes.io/pci-15b3.present=true

Verify the NicClusterPolicyis Ready.

oc get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'

If using RDMA Shared devices, inspect the number of RDMA resources in the cluster

oc get nodes -o custom-columns='NAME:.metadata.name,RDMA_A:.status.capacity.rdma/rdma_shared_device_a,RDMA_B:.status.capacity.rdma/rdma_shared_device_b'

Note

Refer to Appendix E of Installing OpenShift on DGX for a network connectivity validation test.