Is this page helpful?

NVIDIA Network Operator Overview#

NVIDIA Network Operator is an open-source technology designed to automate the provisioning and management of networking components for NVIDIA hardware in Kubernetes clusters.

It leverages Kubernetes Custom Resource Definitions (CRDs), Operator SDK, and integrates with Kubernetes-specific tooling such as Helm, kubectl, and Node Feature Discovery (NFD) to manage networking related components in order to enable execution of Remote Direct Memory Access (RDMA) and GPUDirect high-performance workloads in a Kubernetes cluster.

NVIDIA Network Operator works by:

Utilizing Kubernetes CRDs and the Operator SDK to manage and orchestrate NVIDIA networking resources.
Simplifying setup for features critical to AI/ML, HPC, and data-intensive applications by providing integrated support for Mellanox NICs, secondary networks, and low-latency interfaces.
Deploying required host software — such as drivers, device plugins, Container Network Interface (CNI) plugins, and IP Address Management (IPAM) plugins — onto nodes with compatible NVIDIA hardware.
Integrating closely with the NVIDIA GPU Operator to deliver high-throughput, low-latency data paths between GPUs across the cluster, essential for distributed deep learning and data analytics.

Key features include:

Automatic installation and management of NVIDIA networking drivers and supporting software.
Enablement of RDMA and GPUDirect.
Support for both Ethernet and InfiniBand networking modes, including SR-IOV (Single Root I/O Virtualization) for optimal resource sharing in virtualized environments.
Use of NFD to automatically label Kubernetes nodes with hardware/networking features needed for operator configuration.

System Requirements:

RDMA capable hardware: Mellanox ConnectX-5 NIC or newer
NVIDIA GPU and driver supporting GPUDirect
NVIDIA GPU Operator v1.5.2 and above
Operating Systems: Ubuntu 20.04 LTS, 22.04 LTS, and 24.04 LTS