Key Concepts#

Operators are Kubernetes applications#

Kubernetes native applications are called Operators. An Operator is a method of packaging, deploying and managing a Kubernetes application. A Kubernetes application is deployed on Kubernetes and managed using the Kubernetes APIs and kubectl tooling. Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components.

Operator SDK#

The Operator SDK provides the tools to build, test and package Kubernetes native applications (called Operators).

CustomResourceDefinitions (CRDs)#

A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind; for example, the built-in pods resource contains a collection of Pod objects.

A custom resource is an extension of the Kubernetes API that is not necessarily available in a default Kubernetes installation. It represents a customization of a particular Kubernetes installation.

NVIDIA Network Operator relies on Kubernetes custom resources to configure SR-IOV networking, attach secondary networks, and control how node-level changes are rolled out across the cluster.

The following CRDs are used by the NVIDIA Network Operator:

NICClusterPolicy: defines the desired network configuration for the entire Kubernetes cluster and is the main configuration resource for the operator deployment.
IPPool: defines and manages the IP address ranges used by secondary networks configured through NVIDIA Network Operator.
SriovNetworkNodePolicy: defines how SR-IOV-capable NICs are configured on selected nodes, including VF creation, RDMA settings, and the resource names exposed to Kubernetes workloads.
SriovIBNetwork: defines an InfiniBand secondary network that allows pods to attach to SR-IOV-based high-performance network interfaces.
SriovNetworkPoolConfig: defines how SR-IOV-related configuration changes are rolled out across a selected node group, including update parallelism and node availability controls during reconfiguration.

Kubernetes Node Feature Discovery (NFD)#

NVIDIA Network Operator relies on the existence of specific Kubernetes node labels to operate properly, e.g. label a node as having NVIDIA networking hardware available. This can be achieved by either manually labeling Kubernetes nodes or using NFD to perform the labeling.

NFD is a Kubernetes add-on for detecting hardware features such as CPU cores, memory, and GPUs on each node and advertising those features using node labels. By default, the NVIDIA Network Operator deploys NFD to perform node labeling in the cluster to allow proper scheduling of NVIDIA Network Operator resources.

NFD is used to label nodes with the following labels:

PCI vendor and device information
RDMA capability
GPU features

Secondary Network#

A secondary network in Kubernetes is an additional network, separate from the default CNI (Container Network Interface) plugin which provides pod-to-pod network functionality, and used to provide connectivity for advanced use cases like network isolation, high performance workloads, or data and control plane separation. The secondary network definition helps Kubernetes identify type of the network to work with it correctly. This enables the creation of network environments tailored to the specific needs of Kubernetes applications.

Specifically, for AI/ML workloads, the secondary network helps Kubernetes applications to communicate faster and more efficiently with the storage layer and GPUs by integrating with RDMA, RoCE, SR-IOV, and GPUDirect* technologies, etc. etc.

Following CNIs are used to facilitate the deployment of secondary network in Kubernetes:

CNI plugins: currently only containernetworking-plugins is supported
Multus-CNI: enables Kubernetes to use multiple CNI plugins simultaneously
IPAM CNI: helps the secondary CNIs with IP address management

* GPUDirect depends on disabling the PCIe ACS (Access Control Services) feature on the DGX through the nvidia-acs-disable.service service (this is accomplished by default on DGX).

SR-IOV Network Operator#

SR-IOV (Single Root I/O Virtualization) enables a single physical network interface card (NIC) to appear as multiple virtual NICs for virtualized environments. NVIDIA Network Operator can operate in unison with SR-IOV Network Operator to enable SR-IOV workloads in a Kubernetes cluster.