Install NVIDIA Network Operator (Optional)#
The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. This operator is not required if the cluster does not have NVIDIA networking devices installed, or the deployment does not require the use of multinode high-speed networking. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to support GPUDirect workloads that deliver high-throughput, low-latency networking for scale-out, GPU computing clusters.
Refer to the documentation for details on deploying GPUDirect with RDMA on Red Hat OpenShift. For use cases like distributed inferencing with llm-d or NVIDIA dynamo which are built on the NVIDIA Inference Xfer Library (NIXL) library, GPUDirect with RDMA is not strictly required but can be highly beneficial with large workloads.
Verify presence of NVIDIA Networking Devices#
The NVIDIA Network Operator requires proper node labels to be in place for pod scheduling. The Node Feature Discovery Operator creates the node labels to advertise presence of supported networking devices. Run the following commands to validate the labels match expectations
1# Nodes that have Mellanox PCI (ConnectX)
2oc get nodes -l feature.node.kubernetes.io/pci-15b3.present=true
Refer to the official documentation on installing the NVIDIA Network Operator on Red Hat OpenShift.
Verify the NicClusterPolicyis Ready.
oc get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'
Inspect the number of RDMA resources in the cluster
c get nodes -o custom-columns='NAME:.metadata.name,RDMA_A:.status.capacity.rdma/rdma_shared_device_a,RDMA_B:.status.capacity.rdma/rdma_shared_device_b'
Note
Refer to Appendix E of Installing OpenShift on DGX for a network connectivity validation test.