Is this page helpful?

Use Cases#

The NVIDIA Network Operator simplifies and automates the deployment of high-performance networking components required for accelerated workloads in a Kubernetes cluster run on DGX SuperPOD and BasePOD environments.

The most common use cases are high-performance computing, machine learning, and data processing applications that need direct access to networking hardware. That includes scientific simulations, distributed training or inference across GPU nodes, analytics and storage workloads that need high throughput.

NVIDIA Network Operator is especially useful in scale-out training and inference environments where GPUs on different nodes need low-latency, high-throughput communication. It works with GPU Operator to enable GPUDirect RDMA, which helps data move more efficiently between GPU nodes by reducing CPU involvement.

Distributed AI and Deep Learning: Enables GPUDirect RDMA by automating the deployment of DOCA / OFED drivers and Peer Memory driver (nvidia_peermem). This is critical for achieving high-bandwidth, low-latency multi-node GPU communication within containerized environments.

Automated Lifecycle Management: Eliminates the need for manual node-by-node configuration. It automatically deploys, configures, and manages the lifecycle of host networking drivers, device plugins, secondary CNIs, and IPAM across the cluster so pods can use fast networking features like RDMA, GPUDirect RDMA, SR-IOV, and related capabilities without a lot of manual cluster work.

Multi-Network Pods: Integrates seamlessly with Multus CNI to attach multiple network interfaces to a single pod. This allows workloads to separate standard Kubernetes management traffic from high-speed, accelerated data plane traffic.