Common Issues and Troubleshooting Tips#

Common NVIDIA Network Operator issues usually fall into a few buckets: dependency conflicts, nodes not being discovered correctly, driver or interface disruptions, and RDMA/SR-IOV features not actually becoming usable inside Kubernetes pods.

Frequent issues#

A common installation issue is that Network Operator installs successfully, but networking features are not exposed to workloads because the NicClusterPolicy is missing, incomplete, or not aligned with the intended RDMA, SR-IOV, Multus, or IPAM setup.

NIC driver problems#

When NVIDIA NICs are used as the primary network interface and the DOCA-OFED driver container is deployed: unloading and replacing the active driver can disrupt interfaces, and only basic configuration such as IP addresses may be restored automatically.

Feature not working#

If RDMA, GPUDirect RDMA, SR-IOV is not working, the usual causes are mismatched node labels, unsupported hardware or driver combinations, missing secondary-network configuration.

What to check first#

Start with these checks:

  • Confirm the operator pods and NicClusterPolicy are healthy and reconciled.

  • Check whether NFD is already present on the cluster and disable duplicate NFD deployment if needed with nfd.enabled=false.

  • Verify node labels expose the expected NIC capabilities and that the hardware supports the requested features.

  • Review whether the NIC is the primary interface before enabling DOCA-OFED replacement, because that can interrupt networking and fail to restore advanced settings.

  • Validate Multus, SR-IOV device plugin, CNI plugins, and NVIDIA IPAM configuration if pods are not receiving the expected secondary interfaces.

Practical troubleshooting flow#

A practical troubleshooting sequence is: first confirm cluster dependencies such as NFD and CRDs, then inspect the NicClusterPolicy, then verify node labels and operand pods, and only after that move into driver-level debugging.

If the issue appeared after a driver change, focus on NIC interface restoration and primary-interface behavior; if the issue is “pods run but fast networking is absent,” focus on SR-IOV, RDMA, Multus, and offload validation.